key: cord-018459-isbc1r2o authors: munjal, geetika; hanmandlu, madasu; srivastava, sangeet title: phylogenetics algorithms and applications date: 2018-12-10 journal: ambient communications and computer systems doi: 10.1007/978-981-13-5934-7_17 sha: doc_id: 18459 cord_uid: isbc1r2o phylogenetics is a powerful approach in finding evolution of current day species. by studying phylogenetic trees, scientists gain a better understanding of how species have evolved while explaining the similarities and differences among species. the phylogenetic study can help in analysing the evolution and the similarities among diseases and viruses, and further help in prescribing their vaccines against them. this paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. the paper has also discussed the application of phylogenetic study in disease diagnosis and evolution. phylogenetics can be considered as one of the best tools for understanding the spread of contagious disease, for example, transmission of the human immunodeficiency virus (hiv) and the origin and subsequent evolution of the severe acute respiratory syndrome (sars) associated coronavirus (scov) [1] . earlier, morphological traits were used for assessing similarities between species and building phylogenetic trees. presently, phylogenetics relies on information extracted from genetic material such as deoxyribonucleic acid (dna), ribonucleic acid (rna) or protein sequences [2] . methods used for phylogenetic inference have changed drastically during the past two decades: from alignment-based to alignment-free methods [3] . this paper has reviewed various methods under phylogenetic tree construction from character to distance methods and alignment-based to alignment-free methods. a brief review of phylogenetic tree applications is also given in cancer studies. a phylogenetic tree can be unrooted or rooted, implying directions corresponding to evolutionary time, i.e. the species at the leaves of a tree relate to the current day species. the species can be expressed as dna strings which are formed by combining four nucleotides a, t, c and g (a-adenine, t-thymine, c-cytosine and g-guanine). in literature, various string processing algorithms are reported which can quickly analyse these dna and rna sequences and build a phylogeny of sequences or species based on their similarity and dissimilarity. a high similarity among two sequences usually implies significant functional or structural likeliness, and these sequences are closely related in the phylogenetic tree. to get more precise information about the extent of similarity to some other sequence stored in a database, we must be able to compare sequences quickly with a set of sequences. for this, we need to perform the multiple sequence comparison. dynamic programming concepts facilitate this comparison using alignment methods, but it involves more computation. moreover, the iterative computational steps limit its utility for long length sequences [3] . alignment-free methods overcome this limitation as they follow alternative metrics like word frequency or sequence entropy for finding similarity between sequences. phylogenetic tree generation consists of sequence alignment where the resulting tree reveals how alignment can influence the tree formation. alignment-based methodologies are probably the most widely used tools in sequence analysis problems [4] . they consist of arranging two sequences: one on the top of another to highlight their common symbols and substrings. an alignment method is based on alignment parameters including insertion, deletions and gaps which play a pivotal role in the construction of the phylogenetic tree. a phylogenetic tree is formed as an outcome of sequence analysis performed on the dna or rna strings [5] . sequence comparison reveals the patterns of shared history between species, helping in the prediction of ancestral states. the comparison of sequences also helps in understanding the biology of living organisms which is required to find similarity and relationship among species. for sequence comparison, we can follow alignment-based or alignment-free methods [3, 6, 7] . sequence alignment is a method to identify homologous sequences. it is categorized as pairwise alignment in which only two sequences are compared at a time whereas in multiple sequence alignment more than two sequences are compared. alignmentbased can be global or local [8, 9] . these alignment-based algorithms can also be used with distance methods to express the similarity between two sequences, reflecting the number of changes in each sequence. figure 1 gives a hierarchical view of various methods for phylogenetic tree building. the character-based methods compare all sequences simultaneously considering one character/site at a time. these are maximum parsimony and maximum likelihood. these methods use probability and consider variation in a set of sequences [10] . both approaches consider the tree with the best score tree, which requires the smallest number of changes to perform alignment. maximum parsimony method suffers badly from the long-branch attraction and gives the least information about the branch lengths [10] . in such cases, if two external branches are separated by short internal branches, it leads to the incorrect tree. some of the salient features of character-based methods are mentioned in table 1 . distance-based methods use the dissimilarity (the distance) between the two sequences to construct trees. they are much less computationally intensive than the character based methods are mostly accurate as they take mutations into count. for tree generation, generally, hierarchical clustering is used in which dendrograms (clusters) are created. table 1 briefly compares various phylogenetic tree construction methods. multiple alignments of related sequences may often yield the most helpful information on its phylogeny. however, it can produce incorrect results when applied to more divergent sequence rearrangements [3] . some computationally intensive multiple alignment methods align sequences strictly based on the order in which they receive them. multiple sequence alignment methods emphasize that more closely related sequences should be aligned first. in cases of sequences being less related to one another, however, sharing a common ancestor may be clustered separately [11] . this implies that they can be more accurately aligned, but may result in incorrect phylogeny. alignment can provide an optimized tree if a recursive approach is followed; however, this will increase the complexity of the problem. if the differences among the lengths of sequences are very high, the alignment performance significantly impacts tree generation. the use of dynamic programming in alignment makes computation more complicated, and iterative steps limit their utility for large datasets. therefore, consistent efforts have been made in developing and improving multiple sequence alignment methods for supporting variable length sequences with high accuracy and also for aligning a larger number of sequences simultaneously. because of the problems associated with alignment-based phylogeny the importance of alignment-free methods is apparent [3] . hence, the alignment quality affects the relationship created in a phylogenetic tree based on the consideration discussed above. alignment-free methods proposed in recent years can be classified into various categories as shown in fig. 1 . these include k-tuple based on the word frequencies, methods that represent the sequence without using the word frequencies, i.e. compression algorithms probabilistic methods and information theory-based method. in the k-tuple method, a genetic sequence is represented by a frequency vector of fixed length subsequence and the similarity or dissimilarity measures are found based on the frequency vector of subsequence. the probabilistic methods represent the sequences using the transition matrix of a markov chain [12] of a pre-specified order, and comparison of two sequences is done by finding the distance between two transition matrices. graphical representation comprising 2d or 3d or even 20d methods provides an easy way to view, sort and compare various sequences. graphical representation further helps in recognizing major characteristics among similar biological sequences. as discussed k-tuple method uses k-words to characterize the compositional features of a sequence numerically. a biological sequence is numerically converted into a vector or a matrix composed of the word frequency. the k-word frequency pro-vides a fast arithmetic speed and can be applied to full sequences. the problem with k-tuple is a big value of k that poses a challenge in the computing time and space, and k-word methods underestimate or even ignore the importance of its location. the string-based distance measure uses substring matches with k mismatches. cancer research is considered one of the most significant areas in the medical community. mutations in genomic sequences are responsible for cancer development and increased aggressiveness in patients [13, 14] . the combination of all such genes mutations, or progression pathways, across a population can be summarized in a phylogeny describing the different evolutionary pathways [9] . application of the phylogenetic tree can be explored for finding similarities among breast cancer subtypes based on gene data [14, 15] . discovery of genes associated in cancer subtype help researchers to map different pathways to classify cancer subtypes according to their mutations. methods of phylogenetic tree inference have proliferated in cancer genome studies such as breast cancer [13] . phylogenetic can capture important mutational events among different cancer types; a network approach can also capture tumour similarities. it has been observed from the literature that in cancer disease, the driver genes change the cancer progression, and it even affects the participation of other genes thus generating gene interaction network. phylogenetic methods can solve the problem of class prediction by using a classification tree. phylogenetic methods give us a deeper understanding of biological heterogeneity among cancer subtype. the research focuses on the various methods of sequence analysis to generate phylogenetic trees. the limitations associated with sequence alignment methods lead to the development of alignment-free sequence analysis. however, most of the existing alignment-free methods are unable to build an accurate tree so more refinement is required in alignment-free methods. the phylogenetic study is not limited to species evolution, but disease evolution as well. extending phylogenetic to disease diagnosis can give birth to new treatment options and understanding its progression. use of phylogenetics in the molecular epidemiology and evolutionary studies of viral infections reconstructing optimal phylogenetic trees : a challenge in experimental algorithmics editorial: alignment-free methods in computational biology analyzing dna strings using information theory concepts modified k-tuple method for the construction of phylogenetic trees the evolution of tumour phylogenetics: principles and practice on maximum entropy principle for sequence comparison a general method applicable to the search for similarity in the amino acid sequence of two proteins comparison of biosequences approximate maximum parsimony and ancestral maximum likelihood phylogenetic trees in bioinformatics weighted relative entropy for phylogenetic tree based on 2-step markov model phylooncology: understanding cancer through phylogenetic analysis tumor classification using phylogenetic methods on expression data novel gene selection method for breast cancer classification sequence analysis integrating alignment-based and alignmentfree sequence similarity measures for biological sequence classification hidden markov models with applications to dna sequence analysis is multiple-sequence alignment required for accurate inference of phylogeny? constructing phylogenetic trees using multiple sequence alignment phylogenetic trees in bioinformatics constructing phylogenetic trees using maximum likelihood applications and algorithms for inference of huge phylogenetic trees: a review upgma clustering revisited: a weight-driven approach to transitive approximation an improved phylogenetic tree comparison method neighbor-net: an agglomerative method for the construction of phylogenetic networks bioinformatics: a practical guide to the analysis of genes and proteins sequence similarity using composition method sequence analysis kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison key: cord-014674-ey29970v authors: nan title: dreizehnter bericht nach inkrafttreten des gentechnikgesetzes (gentg) für den zeitraum vom 1.1.2002 bis 31.12.2002 : die arbeit der zentralen kommission für die biologische sicherheit (zkbs) im jahr 2002 date: 2003 journal: bundesgesundheitsblatt gesundheitsforschung gesundheitsschutz doi: 10.1007/s00103-003-0614-5 sha: doc_id: 14674 cord_uid: ey29970v nan die zentrale kommission für die biologische sicherheit (zkbs) prüft und bewertet sicherheitsrelevante fragen nach den vorschriften des gentechnikgesetzes (gentg), gibt hierzu empfehlungen und berät die bundesregierung und die länder in sicherheitsrelevanten fragen der gentechnik. da das gentg hauptsächlich aus der nationalen umsetzung der eu-gentechnikrichtlinien hervorgegangen ist, sind die entwicklungen im bereich der internationalen und der nationalen gentechnik-regelungen für die zkbs von besonderem interesse. aus dem bereich der internationalen regelungen zur gentechnik ist für das berichtsjahr 2002 hervorzuheben, dass das "intergovernmental committee for the cartagena protocol" (iccp) eingerichtet wurde,das die vorbereitungen zur ratifizierung und sachgerechten umsetzung des "biosafety protocols" begleitet. mit in vorbereitung auf das "dritte gesetz zur änderung des gentechnikgesetzes"das vorrangig die richtlinie 2001/18/ eu auf nationaler ebene umsetzt, befassten sich 2 arbeitskreise aus zkbs-mitgliedern mit darüber hinausgehenden, fachlich schwierigen aspekten (u. a. "sicherheitseinstufung", § 7 gentsv).bei diesem "dritten gesetz zur änderung des gentechnikgesetzes"wirdvoraussichtlichauch die implementierung des cartagena-protokolls berücksichtigt werden (s. oben). vom dezember 2001 bis zum september 2002 fand auf initiative des bundesministeriums für verbraucherschutz, ernährung und landwirtschaft (bmvel) ein "diskurs zur grünen gentechnik" statt,an dem repräsentanten gesellschaftlicher gruppen und betroffener verbände beteiligt waren. während der 10 monate wurden verschiedene aspekte der nutzung der gentechnik in landwirtschaft und ernährung erörtert 1 . am ende jedes teildiskurses wurde ein resü-mee gezogen. ein gemeinsam getragener ergebnisbericht, in dem auch minderheitspositionen dargestellt wurden, wurde auf einer schlussveranstaltung im september 2002 vorgestellt. die zkbs nahm zur kenntnis, dass dieser "diskurs zur grünen gentechnik" keine fortsetzung der sog."kanzlerinitiative" war,da die für die bewertung der biologischen sicherheit wesentlichen aspekte, die in der "kanzlerinitiative" eine vorrangige rolle spielen sollten (gewinnung praktischer erfahrungen aus bereits abgeschlossenen oder laufenden anbauprojekten mit gvos), schon in der konzeption des diskurses ausgeschlossen wurden.die zkbs konnte keine sachargumente für die durchführung dieses "diskurs zur grünen gentechnik" erkennen. insgesamt kann die zkbs die veränderungen im bereich der internationalen und der nationalen regelungen für die "grüne gentechnik" als auch ihre entwicklung und anwendung nicht unbedingt als positiv einstufen.aus diesem grund spricht die zkbs für das jahr 2003, in dem die politische verantwortung für den bereich gentechnik vom bundesministerium für gesundheit und soziale sicherung (bmgs) zum bundesministerium für verbraucherschutz, ernährung und landwirtschaft (bmvel) überwechseln wird, erneut ihre erwartung auf eine wende aus, die -unter wahrung sachgerechter, wissenschaftlich begründbarer vorsorgemaßnahmen -auch "den rechtlichen rahmen für die erforschung, entwicklung, nutzung und förderung der wissenschaftlichen, technischen und wirtschaftlichen möglichkeiten der gentechnik" (gentg § 1, abs. 2) schafft. die situation innerhalb der europäischen union für die genehmigungsverfahren zum inverkehrbringen von produkten, die gentechnisch veränderte organismen enthalten, stagniert nun schon im vierten jahr unverändert seit 1998. weder die z. t. seit einigen jahren anhängigen genehmigungsverfahren gemäß der richtlinie 90/220/ewg noch solche nach der novel-foods-verordnung wurden abgeschlossen (tabelle 4 in [3] ). zur erfüllung der aufgaben der zkbs bei der prüfung sicherheitsrelevanter fragen der gentechnik werden die mitglieder der kommission aus unterschiedlichen disziplinen berufen. maßgeblich für die zusammensetzung der zkbs ist § 4 des gentechnikgesetzes. darin ist geregelt, dass sich die kommission zusammensetzt aus ◗ 10 sachverständigen, die über besondere und möglichst auch internationale erfahrung in den bereichen der mikrobiologie, zellbiologie, virologie, genetik, hygiene, ökologie und sicherheitstechnik verfügen; von diesen müssen mindestens 6 auf dem gebiet der neukombination von nukleinsäuren arbeiten; jeder der genannten bereiche muss durch mindestens einen sachverständigen, der bereich der ökologie muss durch mindestens 2 sachverständige vertreten sein, in ihrer publikation "transgenic dna introgressed into traditional maize landraces in oaxaca, mexico" [7] anhang "letter to the editor" der zeitschrift 'nature' ladies and gentlemen, in their article quist and chapela (1) report on the detection of transgenic dna constructs in native maize landraces grown in remote mountains in oaxaca, mexico. they raise therewith concerns about the unintended introgression of transgenic maize traits into landraces ('criollo') in the centre of their origin resulting in a danger for the natural diversity of this crop plant. by use of molecular methods including pcr,inverted pcr (ipcr) and sequencing of the amplified dna,they obtained data from which they conclude (i) that the nucleotide sequence of the cauliflower mosaic virus (cmv) 35s promoter [p-35s; contained in various lines of genetically modified maize; 2] is present in the maize genomes of several 'criollo' samples, (ii) that in two instances these promoter sequences were flanked by adh1-sequences which are also neighbouring the p-35s in the transgenic construct of novartis bt11 maize, and (iii) that the transgenic p-35s sequences were "embedded within various genomic contexts" of the 'criollo' samples. we have closely examined the experimental data and the analyses of the nucleotide sequences presented in the report.we find that aside from problematic details of the experimental design and some erratic presentations of the data the results of the study do not provide evidence for the introgression of recombinant dna from transgenic crop plants into the genomes of 'criollo' maize. our detailed analyses of the data including the nucleotide sequences (1) which the authors have deposited in the nucleotide sequence data base of genbank clearly show that none of the authors conclusions are justified and therefore the far reaching interpretations on the endangered diversity of landraces is lacking any basis. our position with respect to the presented data are detailed in the following. 1. in order to prove that the p-35s sequences that were amplified by pcr from the dna samples prepared from the corn cobs were not derived from contaminating cmv it is necessary to show that the observed p-35s sequences are linked on one side or on both to maize dna. to identify the sequences flanking the p-35s the authors applied ipcr. the template for ipcr were ecorv restriction fragments of the maize dna circularized by ligation. ecorv cuts at a site in the middle of the p-35s sequence and therefore dna amplified by ipcr from ligation products with primer pairs matching the right and left parts of the p-35s sequence should contain one restored ecorv cleavage site.eight sequenced ipcr products were presented in fig. 2 of ref. 1 (sequences af434754 to af434761). none of them contains the ecorv site (see box in our fig.) .this casts doubts on the authors assumption that restriction by ecorv and ligation had created the circular dna products necessary for ipcr. 2. next we examined whether the nucleotides directly ahead of the four applied primers and expected to be identical to the p-35s sequence were present in the eight sequenced ipcr amplification products. as shown in our fig. the primers icmv2 and icmv1 were used for ipcr on the left side of the 35s promoter sequence,the primers icmv4 and icmv3 on the right side.[at this point two details of the authors experimental setup must be critizised. first, the binding sites of icmv2 and icmv3 are located outside of the p-35s region initially amplified by primers cm01 and cm02 from the dna samples and therefore the presence of these binding sites in the sample dna was not certain. second, icmv2 has 10 nucleotides at the 3' end which do not match the 35s promoter region (waved line in icmv2 in our fig.) and therefore is not expected to allow specific amplification of p-35s sequences.] in the sequences of the amplification products af434754, -55, -56, -57 in which the ipcr primer sequences can be identified the nucleotide sequences ahead of the primers are not from p-35s.the expected p-35s sequence is only partially present ahead of icvm1 in af434758. in five cases the primer sequences were not discernible (af434758, -59, -60). in one case (af434761) the expected p-35s nucleotides were present. these data indicate that perhaps with the exception of af434761 the template of the pcr amplifications was not the 35s promoter region. three other inconsistencies between the sequences af434754 to af434761 and the fig. 2 are apparent (1) . first, the sequences described as "downstream" relative to the p-35s by quist and chapela are in fact "upstream" sequences and correspondingly the "upstream" sequences of quist and chapela are in fact "downstream" (compare our fig. with fig. 2 of 1) . second, the vertical lines in fig. 2 which according to quist and chapela indicate the ends of cmv sequences are misleading since as outlined above they only mark the 3' ends of the primers employed (except for the sequence af434761; the source of this sequence is termed b3 in the fig.2 ,b2 in the sequence deposited in genbank,and a2 in the supplement to ref. 1). thirdly, the thin lines in fig. 2 of ref. 1 supposedly indicating the parts of cmv dna in the amplified sequences are unduely overstretched (they essentially always represent only the pcr primers) and several of them should not be there at all because primer sequences can not be identified. this is the case at one side of a2 (af434758) and both sides of b3 (af434759). 3. we characterized with the help of blast searches those parts of the sequences of the ipcr amplification products that were denoted by quist and chapela in their fig.2 as regions flanking the cmv p-35s sequence.we find that the sequence of af434754 denoted adh1 in the k1 source of fig. 2 does not match with the maize adh1 gene. rather, it matches with a sequence located about 40.000 nucleotides away from the adh1 gene (but still within the database entry of about 160.000 bp termed adh1).a corresponding blast search result was also obtained with af434755 from the a3 quence. therefore, the conclusion that "sequences adjacent to the p-35s dna were diverse" in the maize genome cannot be drawn. 4 . we examined whether the identified regions in the maize genomic dna from which pcr amplification products were obtained by the authors would perhaps be flanked by primer binding sites. for this we performed pairwise blast alignments of the ipcr sequences with the five matching maize genomic sequences. by adjusting the alignment parameters we were able to identify putative primer binding sites at the expected distances in the maize genome target sequences corresponding to af434754 to af434757 and also one binding site for the single primer sequence identified in af434758. this indicates that these five assumed ipcr amplification products were most likely obtained by normal pcr amplification directly from continuous sequences of the maize genome having accidentally flanking regions with similarity to the primers. no primer binding sites were found in sequences af434759 and af434760. these findings support our conclusion from section 2 that the template for at least seven of the eight ipcr products were not p-35s sequences. rather, the templates were sequences of the maize genome related to retroviral sequences which frequently had reasonably matching primer binding sites. evidence for the integration of p-35s sequences into the 'criollo' genome was not obtained because in none of the eight cases studied a linkage of p-35s sequences (aside from the primers used for pcr) to maize dna could be demonstrated. in one case p-35s sequences were linked to a non-identified sequence. this can be a consequence of the use of ipcr in which the essential ligation step always bears the risk of fusing (restriction) fragments that were not naturally contiguous in the sample dna. in this case the authors did not perform the necessary pcr control experiment using primers from p-35s and the unknown sequence to show that these sequences are in fact contiguously present in the 'criollo' genome. the fact that the p-35s specific primers used by the authors had considerable similarity to retroviral sequences explains the formation of pcr products from 'criollo' dna under conditions when the the hybridization stringency is not sufficiently controlled. five of the eight amplified sequences gave in fact matches with retroviral or retrotransposon elements.the claim of the authors that two of their sequences were related to sequences present in the transgenic construct of novartis event bt11 corn was disproven by careful analysis of the sequence and its target.the low amounts of p-35s sequences detected in the 'criollo' dna preparation can easily be explained by contamination of the samples with cmv. if the samples had been tested for other cmv sequences they would probably be there in the equivalent amounts as the p-35s sequence. achter bericht nach inkrafttreten des gentechnikgesetzes (gentg) für den zeitraum vom 1.1.1997 bis 31.12 gentechnisch veränderte pflanzen der elfter bericht nach inkrafttreten des gentechnikgesetzes (gentg) für den zeitraum vom 1.1.2000 bis 31 yellow head complex viruses: transmission cycles and topographical distribution in the asia-pacific region observation of measles virus cell-to-cell spread in astrocytoma cells by using a green fluorescent protein-expressing recombinant virus neunter bericht nach inkrafttreten des gentechnikgesetzes (gentg) für den zeitraum vom 1.1.1998 bis 31 transgenic dna introgressed into traditional maize landraces in oaxaca, mexiko further tests at cimmyt find no presence of promoter associated with transgenes in mexican landraces in gene bank or from recent field collections no credible scientific evidence is presented to support claims that transgenic dna was introgressed into traditional maize landraces in oaxaca doubts linger over mexican corn analysis transgenic dna introgressed into traditional maize landraces in oaxaca, mexico a method of detecting recombinant dnas from four lines of genetically modified maize fig.2 .these two sequences thus incorrectly associated with maize adh1 gene were a strong argument for the authors that a transgenic construct was identified in the 'criollo' because adh1 sequences are in fact present in the transgenic construct of the bt11 event of novartis. however, in the bt11 construct the adh1-related sequences are the introns ivs6 and ivs2 of the adh1 gene and are located downstream of p-35s (2). different from the bt11 construct the so called adh1 sequences in af434755 and af434756 are located upstream of p-35s (see previous section).thus,the adh1 hits of two ipcr products presented by the authors as evidence for "the integrity as an unaltered construct" retained in the 'criollo' genomes are wrong in two ways: (i) the sequences are not from the adh1 gene and (ii) they are located on the wrong side of p-35s. the sequence af434758 (a2 sample) was denoted as zea mays alpha zein gene, although the matching region in genbank sequence af031569 is not an alpha zein gene. instead, the target sequence is part of a region denoted as "similar to retrovirus-related pol polyprotein sequence".similarly, our blast search also identified the previously discussed "adh1" sequences of af434754 and af434755 as being highly similar to a putative gag-pol precurser, e.g. in the genbank sequence af464738. in case of the sequence af434757 (a3 sample) the similarity (bit score 44) with the dull1 gene could not be reproduced. a match of only 14 identical nucleotides (bit score 28) was obtained when using decreased stringency parameters in a pairwise alignment with "blast 2 sequences". in summary, five of the eight ipcr sequences are retro element sequences, the other three are not (af434760,-61) or not closely (af434757) related to any known maize dna sekey: cord-005060-n901y2d4 authors: zhang, feiyun; toriyama, shigemitsu; takahashi, mami title: complete nucleotide sequence of ryegrass mottle virus : a new species of the genus sobemovirus date: 2001 journal: j doi: 10.1007/pl00012989 sha: doc_id: 5060 cord_uid: n901y2d4 the genome of ryegrass mottle virus (rgmov) comprises 4210 nucleotides. the genomic rna contains four open reading frames (orfs). the largest orf 2 encodes a polyprotein of 947 amino acids (103.6 kda), which codes for a serine protease and an rna-dependent rna polymerase. the viral coat protein is encoded on orf 4 present at the 3′-proximal region. other orfs 1 and 3 encode the predicted 14.6 kda and 19.8 kda proteins of unknown function. the consensus signal for frameshifting, heptanucleotide uuuaaac and a stem-loop structure just downstream is in front of the aug codon of orf 3. analysis of the in vitro translation products of rgmov rna suggests that the 68 kda protein may represent a fusion protein of orf 2-orf 3 produced by frameshifting. the protease region of the polyprotein and coat protein have a low similarity with that of the sobemoviruses (approximately 25% amino acid identity), while the rna-dependent rna polymerase region has particularly strong similarity (54 to 60% of more than 350 amino acid residues). the sequence similarities of rgmov to the sobemoviruses, together with the characteristic genome organization indicate that rgmov is a new species of the genus sobemovirus. ryegrass mottle virus (rgmov) was first isolated from stunted italian ryegrass (lolium multiflorum) and cocksfoot (dactylis glomerata) having mottling and necrotic symptoms on leaves"). the isometric particle, 28 nm in diameter contains a species of single-stranded rna with a molecular weight of 1.5 x lo6. the physical properties and some biological ones of the virus are similar to cocksfoot mottle virus (cfmv), which is prevalent in cocksfoot pastures in japanlg). however, rgmov is serologically distinct from cfmv, cocksfoot mild mosaic virus, cynosurus mottle virus and phleum mottle virus, which occur in european countries2'). last year, an isometric virus, isolated from italian ryegrass in germany was found to be serologically related to rgmov ; in agar gel double diffusion tests, a spur formed between rgmov and the germany isolate (frank rabenstein, germany ; personal communication). in spite of serological differences between rgmov and sobemoviruses20), general properties of rgmov are similar to those of grass viruses that belong to sobemo-viruses4). the genome sequence of sobernoviruses has been determined in southern bean mosaic virus (sbmv)'2,24), cfmv8315), rice yellow mottle virus (rymv)") and lucerne transient streak virus (ltsv, accession number u31286). the genomic rna of sobemoviruses is a single-stranded molecule, approximately 4100 to 4500 nucleotides (nt) in size. the 5' terminus has a genome-linked viral protein (vpg) and the 3 end does not have a poly (a) tail. the genome encodes four orfs : the largest orf encodes the polyprotein of approximately 100 kda, which contains protease and rna polymerase motifs. only the polyprotein of cfmv is encoded by two smaller overlapping orfs, by -1 frame shifting7). recently, we determined the complete nucleotide sequence of the japanese isolate of cfmv (cfmv/jp) (zhang and toriyama, unpublished data ; accession number ab040447). the nucleotide sequence is 96.8% identical to the norwegian isolate of cfmv(cfmv/ no)') and 95.8% identical to the russian i~olate'~). its genome organization is identical to that of cfmv. so far, the genome sequence of rgmov and the germany isolate has not been determined, so the genus is still unknown. in this paper, we report the complete nucleotide sequence of rgmov and compare it to that of the sobemoviruses. ryegrass mottle virus (rgmov) was propagated in barley plants (cv. shunsei) and purified as described previously2o). a purified preparation of cfmvl jp, was stored at -80"c'9) and used for the in uitro translation experiment. in a preliminary experiment, we found that rgmov rna does not have a poly (a) tail at the 3'-terminus. thus, we determined the 3'-terminal sequence by two-dimensional mobility shift analysis as described previ~usly'~). in this experiment, the homomix (alkaline digested yeast rna mixture) was prepared by using the rna from torula utilis, a product of fluka (riedel-de haen; seelze, germany). the 5'terminal sequence of rgmov rna was identified by sequencing the pcr clones amplified by using the 5' race abridged anchor primer system (gibco brl, gaithersburg, usa). cdna synthesis was done as described previously2') using m-mlv reverse transcriptase (gibco brl), random hexanucleotide primer and synthetic oligonucleotide primer (pl), -6'-actagtcgacacgaaaacccc-3' : the sequence at the 3' end underlined was analyzed by twa-dimensional sequence analysis. the synthesized second strand cdna was blunt-ended with t4 dna polymerase and ligated into smai-digested puc18. recombinant plasmids were transformed into competent escherichia coli dh5a (toyobo, osaka, japan). the cdna clones shown in terminal sequence of viral rna cloning and dna sequence fig. 1 were made by primer extension and pcr amplification and used for sequencing of rgmov rna. the ambiguous nucleotide sequence was confirmed by using pcr clones prepared independently (not shown in fig. 1 ). nucleotide sequences were determined using the pharmacia dna sequencing kit and an alfred dna sequencer (pharmacia, uppsala, sweden). the sequence data were assembled and analyzed using the dnasis (macintosh) program (hitachi software engineering co., yokohama, japan). genebankiembl, nbrf and pir databases were searched for nucleic acid and amino acid sequence identity. cell-free translation in uitro translation using wheat germ extract (promega, madison, usa) was performed as described by the manufacturer's manual in a final volume of 50 ,ul in the presence of redivue l-[35sl methionine (amersham pharmacia biotech, buckinghamshire, uk) for 1 hr at 25°c. translation products were separated by sds-page (10% polyacrylamide) and detected using a molecular imager system (biorad, richmond, usa). a set of prestained sds-page standards (biorad) was used as protein size markers. purified rgmov was electrophoresed on 10% polyacrylamide-sds gels and electro-blotted onto the pvdf membrane (immobilon-psq ; millipore, middlesex, uk). the portion corresponding to the coat protein on the pvdf membrane was excised, and the n-terminal sequence of the coat protein was analyzed using a gas-phase protein sequencer (model 477a/120a, applied biosystems, foster city, usa). nucleotide sequence and genome organization the complete nucleotide sequence of rgmov coma) references of sequence data : sbmv (m23021), ltsv (u31286), rymv(l20893) and cfmv (248630). b) the percentage values indicate the identity over the stretch of amino acid residues indicated in parentheses. c) this similarity was found between the n-terminal region of the 56.3 k orf of cfmv (refer to fig. 2) . prises 4210 nt with a base composition of 24.3% a, 22.2% u, 25.2% c and 28.3% g. the g s c content is 53.5%. the sequence contains four major orfs flanked by 5'-and 3-untranslated sequences of 99 and 198 nt, respectively. database searches indicated that the genome sequence of rgmov is significantly similar to that of sobemoviruses, for which the genome organization is summarized in fig. 2 . as shown in fig. 2 the largest orf 2 extends from nucleotides 643 to 3486. the predicted 103.6 kda protein consists of 947 amino acids. database searches revealed a significant similarity to the polyproteins of sobemoviruses ; sbmv (accession number, m23021)24), rymv (l20893)"), cfmv (248630)*) and ltsv (u31286). the polyprotein of rgmov contains serine and p3c ~r o t e a s e s~,~~) and an rna-dependent rna p01ymerase'~) (fig. 3) . a conserved sequence, gxpxfdpxyg*), is found in the n-terminal region (amino acids 70 to 90 residues) of the 103.6 kda polyprotein. the protease motif appears immediately downstream of the conserved sequence : serine protease, in amino acids 148 to 220 from n-terminus and p3c protease, in amino acids 272 to 300 (fig. 3) . the serine protease motif is well conserved between rgmov, sobernoviruses and polioviru~~,~~). in addition, the p3c protease motif ... xgxs . /c . gxxxxxxxxgxxxxgxh* ... (the catalytic amino acid residue is marked with asterisks), is present just downstream. however, instead of serine (s*) or cysteine (c*), alanine is found in rgmov. thus, it is uncertain whether the p3c protease domain is catalytic in rgmov or not. the rna-dependent rna polymerase is encoded near the c-terminal region of the polyprotein. this region showed very strong similarity, 54 to 60% identity over a 350 amino acid stretch ( table 1) . the rna polymerase motifs13) are distributed between amino acids 680 to 810. the sequence of this domain is conserved in particular, with approximately 75% identity between rgmov and sobemoviruses. database searches also showed that the sequence of rgmov polymerase is highly conserved between the rna polymerases of beet mild yellowing virus (s65829), cucurbit aphid-borne yellowing virus (x76931)) potato leaf roll virus (x74789) and barley yellow dwarf virus (l25299) of the family luteoviridae. the similarity is approximately 50% identity over a 240 amino acid stretch, suggesting an evolutionary close relationships between rgmov, sobemovirus and luteovirus (subgroup 11)'o). van der wilk et a1.") found that the vpg of sbmv is encoded by orf 2, downstream of the protease domain and in front of the rna polymerase. we compared the amino acid sequence similarity between the vpg region of sbmv orf 2 and the corresponding region of rgmov orf 2. the search revealed no significant similarity. sequence diversities in the vpg regionzz) are also shown between sbmv, cfmv and rymv. however, the con-served sequence, wag + e/d rich sequence is detected in the region, and putative e/s cleavage sites are present on both sides of the region : proteolytic cleavage would result in a protein of 9 kda. possibly, the vpg of rgmov is located between the protease and the rna-dependent rna polymerase domains in the same order as in the sbmv orf 222) (fig. 3) . rgmov orf 3 is completely within the orf 2. the predicted 19.8 kda protein has distinct similarity, 40% identity to the corresponding orfs of sbmv and ltsv. however, it is unknown whether the 19.8 kda protein is independently translated in vivo, because orf 3 may be expressed as a fusion protein as will be discussed. orf 4 comprises 198 amino acids encoding a 25.6kda coat protein. the 16 amino acid sequence of the n-terminus of the viral coat protein was identical to that deduced from the orf 4 nucleotide sequence (data not shown). sequence similarity searches indicated that the rgmov coat protein revealed a weak but significant similarity, 24 to 27% identity with that of sbmv, ltsv and rymv, but only 15% identity with cfmv (table 1) . in the wheat germ extract system, rgmov rna directs the synthesis of two products of 103 kda and 68 kda, but no other distinct product was detected. in contrast, the translational products synthesized in vitro with cfmv/jp rna are four major proteins with sizes almost identical to those previously reported for cfmv/ no'*) (fig. 4) . the translational activity of cfmv/jp rna was low in our present system, as reported for other sobemoviruses'6). rgmov rna is a poorer message in our wheat germ extract system. the largest product of rgmov rna was 103 kda and seems to be derived from the largest orf 2 for the polyprotein. in the rgmov rna sequence, no orf corresponds to the second largest product of 68 kda. the putative replicase of cfmv is translated as part of a single polyprotein by -1 ribosomal frameshifting between two overlapping orfs having a coding capacity for 60.9 kda and 56.3 kda proteins7j8). translational frameshifts are known in coronavirus ibv2), polymerase genes of r e t r o~i r u s e s~'~) and plant viruses7+'). as consensus signals for frameshifting, the heptanucleotide sequence (e.g., uuuaaac sequence) and the stem-loop structure immediately downstream have been proposed by jacks et al5). as found in cfmv, sbmv and rymv7), identical signals are found in rgmov rna just preceding the initiation codon of the orf 3 (fig. 5 ). tamm et al1*) proposed a possible mechanism that the 70 kda in uitro translation product of sbmv and rymv rnas may represent the ow 2-orf 3 transframe fusion protein. thus, the 68 kda translational product of rgmov rna is probably derived from -1 ribosomal frameshifting (fig. 3) , not from proteolytic cleavage of the p~lyprotein~~). in this experiment, we tried to detect the rgmov coat protein in the in uitro translation products by immunoprecipitation. however, we could not detect any signal for the coat protein. the coat protein of sbmv is translated only from a smaller, subgenomic rna, which is detected in virus-infected tissues as well as virus particle^'^). as smaller rnas were not detectable in our rgmov rna preparation, the amount of subgenomic rna, if any, may have been insufficient for the detection of the in vitro translated coat protein. we conclude that rgmov is a member of the genus sobernovirus based on sequence similarities. the similarity level of nucleic acid (approximately 50% identity) and protein (table 1) is low enough for virus species demarcation between any species of sobemoviruses, whereas the genome organization of rgmov is closely related among sobemoviruses. biological and serological properties of rgmov are distinct from those of other characterized grass viruseszo). thus, rgmov is a unique species of the genus s o b e r n o v i r~~~,~~) . the polyprotein gene organization of rgmov is the same as that of sbmv, rymv and ltsv, but different from that of cfmv, for which a polyprotein is produced as a single fusion-protein by the frameshifting of two orfs7). expression of rice yellow mottle virus p1 protein in vitro and in vivo and its involvement in virus spread an efficient ribosomal frame-shifting signal in the polymerase-encoding region of the coronavirus ibv sobemovirus genome appears to encode a serine protease related to cysteine proteases of picornaviruses genus sobemovirus signals for ribosomal frameshifting in the rous sarcoma virus gag-pol region characterization of ribosomal frameshift in hiv-1 gag-pol expression the putative replicase of the cocksfoot mottle sobemovirus is translated as a part of the polyprotein by -1 ribosomal frameshift sequence and organization of barley yellow dwarf virus genomic rna luteovirus gene expression genome characterization of rice yellow mottle virus rna nucleotide sequence of the bean strain of southern bean mosaic virus identification of four conserved motifs among the rna-dependent polymerases encoding elements messenger rna for the coat protein of southern bean mosaic virus nucleotide sequence of rna from the sobemovirus found in infected cocksfoot shows a luteovirus-like arrangement of the putative replicase and protease genes translation of southern bean mosaic virus rna in wheat embryo and rabbit reticulocyte extracts complementarity between the 5'-and 3'-terminal sequences of rice stripe virus rnas identification of genes encoding for the cocksfoot mottle virus proteins cocksfoot mottle virus in japan ryegrass mottle virus, a new virus from lolium multiflorum in japan nucleotide sequence of rna 1, the largest genomic segment of rice stripe virus, the prototype of the tenuivirus the genome-linked protein (vpg) of southern bean mosaic virus is encoded by the orf2 guidelines to the demarcation of virus species sequence and organization of southern bean mosaic virus genomic rna evolution of rna viruses the nucleotide sequence data reported in this paper have been submitted to ddbj, embl and genbank under the accession number ab040446. national institute of agro-environmental sciences, tsukuba 305-8604, japan present address : tokyo university of agriculture and technology, united graduate school of agriculture, fuchu 183-8509, japan present address : tokyo university of agriculture, sakuragaoka 1, setagaya-ku, tokyo 156-8502, japan we wish to thank the late professor dr. d. hosokawa, tokyo university of agriculture and technology, for his encouragement and dr. t. teraoka for his help with the amino acid sequence analysis. key: cord-001340-kqcx7lrq authors: ladner, jason t.; beitzel, brett; chain, patrick s. g.; davenport, matthew g.; donaldson, eric; frieman, matthew; kugelman, jeffrey; kuhn, jens h.; o’rear, jules; sabeti, pardis c.; wentworth, david e.; wiley, michael r.; yu, guo-yun; sozhamannan, shanmuga; bradburne, christopher; palacios, gustavo title: standards for sequencing viral genomes in the era of high-throughput sequencing date: 2014-06-17 journal: mbio doi: 10.1128/mbio.01360-14 sha: doc_id: 1340 cord_uid: kqcx7lrq thanks to high-throughput sequencing technologies, genome sequencing has become a common component in nearly all aspects of viral research; thus, we are experiencing an explosion in both the number of available genome sequences and the number of institutions producing such data. however, there are currently no common standards used to convey the quality, and therefore utility, of these various genome sequences. here, we propose five “standard” categories that encompass all stages of viral genome finishing, and we define them using simple criteria that are agnostic to the technology used for sequencing. we also provide genome finishing recommendations for various downstream applications, keeping in mind the cost-benefit trade-offs associated with different levels of finishing. our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques. v iruses represent the greatest source of biological diversity on earth, and with the help of high-throughput (ht) sequencing technologies, great strides are being made toward the genomic characterization of this diversity (1) (2) (3) . genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. here, we outline a set of viral genome quality standards, similar in concept to those proposed for large dna genomes (4) but focused on the particular challenges of and needs for research on small rna/ dna viruses, including characterization of the genomic diversity inherent in all viral samples/populations. our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques. despite the small sizes of viral genomes, complications related to limited rna quantities, host "contamination," and secondary structure mean that it is often not time-or cost-effective to finish every genome, and given the intended use, finishing may be unnecessary (5) . therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. each viral family/species comes with its own challenges (e.g., secondary structure and gc content); therefore, we provide only loose guidance on the depth of sequence coverage likely required to obtain different levels of finishing. in reality, a similar amount of data will generate genomes with different levels of finishing for different viruses. to alleviate any reliance on particular aspects of the different sequencing technologies, we have made two assumptions that should be valid in most viral sequencing projects. the first assumption is a basic understanding of the genomic structure of the virus being sequenced, including the expected size of the genome, the number of segments, and the number and distribution of major open reading frames (orfs). fortunately, genome structure is highly conserved within viral groups (6) , and although new viruses are constantly being uncovered, the discovery of a novel family or even genus remains relatively uncommon (7) . in the absence of such information, the defined standards can still be applied following further analysis to determine genome structure. the second assumption is that the genetic material of the virus being described can be accurately separated from the genomes of the host and/or other microbes, either physically or bioinformatically. depending on the technology used, it is critical that the potential for crosscontamination of samples during the sample indexing/bar coding process and sequencing procedure be addressed with appropriate internal controls and procedural methods (8) . for a summary of the proposed categories for whole-genome sequencing of viruses, see fig. 1 and table 1 . the "standard draft" category is for whole shotgun genome assemblies with coverage that is low and/or uneven enough to prevent the assembly of a single contig for ն1 genome segments. genomes in this category are likely to result from samples with low viral titers, such as clinical and environmental samples, or to be those containing regions that are difficult to sequence across (e.g., intergenic hairpin regions) (9) . to distinguish standard drafts from targeted amplification of partial viral sequences, standard drafts should contain at least 1 contig for each genomic segment and should be prepared in a manner that allows the possibility of sequencing the vast majority of a virus's genome. to avoid the inclusion of small pieces of genomes as "drafts," there needs to be some type of minimum cutoff for breadth of coverage. therefore, we suggest that at least a majority (ն50%) of the genome be present for a set of sequences to be considered a draft genome. high quality (hq). genomes should be considered high quality if no gaps remain (i.e., a single contig per genome/segment), even if one or more orfs remain incomplete due to missing sequence at the ends of segments. an hq genome can often be achieved with modest levels of ht sequencing coverage (~15 to 30ϫ) or through sanger-mediated gap resolution of an sd. coding complete (cc). the "coding complete" category indicates that in addition to the lack of gaps, all orfs are complete. this level of completion is typically possible with high levels of ht sequencing coverage (ͼ100ϫ) or may require the use of conserved pcr primers targeting the ends of the segments. complete. a genome is complete when the genome sequence has been fully resolved, including all non-protein-coding sequences at the ends of the segment(s). this is typically achieved through rapid amplification of cdna ends (race) or similar procedures. finished. this final category represents a special instance in which, in addition to having a completed consensus genome sequence, there has been a population-level characterization of genomic diversity. typically this requires~400 to 1,000ϫ coverage (see below). this provides the most complete picture of a viral population; however, this designation will apply only for a single stock. additional characterizations will be necessary for future passages. population-level characterization. ht sequencing technologies provide powerful platforms for investigating the genetic diversity within viral populations, which is integral to our understanding of viral evolution and pathogenesis (10, 11) . population-level characterization requires very high levels of ht sequencing coverage (12, 13); however, the exact level will depend on the background error profiles of the sequencing technology and the desired level of sensitivity. as an example, wang et al. (12) determined that for pyrosequencing data,~400ϫ coverage is necessary to identify minor variants present at 1% frequency with 99.999% confidence, and~1,000ϫ coverage is needed for variants with a frequency of 0.5%. targeted amplification of the viral genome is often necessary to achieve these coverage requirements. due to the modest sequence lengths of most ht technologies, the state of the art for population-level analysis has been the characterization of unphased polymorphisms. however, single-molecule technologies, with maximum read lengths of ͼ20 kb, are opening the door for complete genome haplotype phasing (14) . identification of contaminants or adventitious agents. after isolation, viruses are often maintained as stocks, which are propagated within host cells in tissue culture and thus amplified and preserved for future use. despite careful laboratory practices, it is possible for these stocks to become contaminated with additional microbes. contaminating microbes are often detrimental to subsequent applications such as vaccine development or the testing of therapeutics, making it imperative to monitor the purity of viral stocks. ht sequencing provides a powerful method for not only detecting the presence of contaminants within a sample but also for identification and characterization of any contaminants. the level of sequencing required for contamination analysis is dependent on the desired sensitivity, with more sequencing required to ensure detection of contaminants present at very low levels. for most approaches, hq-level sequencing should be sufficient. depending on the intended applications, analysis may need to be repeated after further passaging to ensure that no additional contaminants have been introduced. description of novel viruses. despite the rapidly growing collection of viral sequences, the description of novel viruses is likely to remain an important aspect of viral genome sequencing (7, 15, 16) . this is true in part because viruses evolve rapidly and are capable of recombining to form novel genotypes (17, 18) . it is also true that most of the viruses that are currently circulating remain uncharacterized (15) . particularly lacking are representatives from groups that are not currently known to infect humans or organisms of economic importance. it would be imprudent, however, to continue to ignore these uncharacterized reservoirs of diversity, because it is difficult to predict the source of future emerging diseases (19) (20) (21) . additionally, with the current suite of primarily sequence similarity-based pathogen identification tools, the ability to detect novel pathogens is wholly dependent on highquality reference databases (22) . there is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last 1 to 2% of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization. therefore, for the majority of viral characterization projects, we recommend, at a minimum, a cc genome. this will ensure a complete description of the viral proteome and will allow accurate phylogenetic placement. molecular epidemiology. one of the most common and important applications for viral genomes is in the study of viral epidemiology, which encompasses our understanding of the patterns, causes, and effects of disease. early studies of molecular epidemiology targeted small pieces of viral genomes; however, this type of analysis is likely to miss important changes elsewhere in the genome. therefore, there has been a strong focus in recent years toward the sequencing of "full" viral genomes. institutes such as the broad institute and the j. craig venter institute (jcvi) have been instrumental in breaking ground in the collection of large numbers of good-quality viral sequences. their newly identified genomes typically fall within our cc category. this is likely to remain the gold standard for studies involving a large number of genome sequences, especially when some samples come from lowtiter clinical samples, often necessitating amplicon-based sequencing methods. cc genomes allow for interrogation of changes throughout the coding portion of the viral genome and often include partial noncoding regions. in the absence of highthroughput race alternatives, the time and resources required to complete hundreds or thousands of genomes are likely to continue to outweigh the potential information gained from completing the terminal sequences. countermeasure development. advancements in our capabilities to sequence viral genomes are changing the way we counteract global pandemics and acts of bioterrorism. there are two important aspects of countermeasure development that can benefit strongly from the availability of genome sequences and ht sequencing data: the detection of the infectious agent and the treatment of the disease caused by the agent. taxonomic classification and detection through dna/rna-based inclusivity assays (i.e., using techniques such as pcr to detect the presence of a pathogen) can be designed using fragmented and incomplete genomes (e.g., sd and hq sequences). fully resolved orfs (cc) further enable the development of immunological assays, such as enzyme-linked immunosorbent assays (elisa) and immunofluorescence assays (ifa), for protein-based detection, and obtaining a complete genome opens the door to a plethora of additional downstream applications, including the design of exclusivity tests, the establishment of reverse genetics systems, and the design of robust forensics protocols. however, for effective development and testing of animal models, therapeutics, vaccines, and prophylactics, it is necessary to obtain a complete picture of the variability present within both the challenge stock and postinfection populations, thereby necessitating finished genomes. in these medical applications, it is also important to demonstrate the absence of adventitious agents. in addition to standardizing the vocabulary of viral genome assemblies, it is also critical for researchers to routinely provide raw sequencing reads. without these, it is impossible for others to independently verify the quality of an assembly. data repositories such as genbank already provide a platform for depositing ht sequencing reads, but this is not a requirement for the submission of a genome, nor is this option typically utilized. wider analysis of data will ultimately result in higher-quality assemblies. it is worth considering broader implementation of a wiki-like, crowdsourcing strategy to genome assembly, similar to the annotation strategies that have been adopted for specific genomes of high interest (23, 24) . this approach would allow multiple parties to work on genome assembly and annotation at the same time and would provide instant updates for the entire community to evaluate and utilize in their own research. our primary goal here is to initiate a conversation. the rate at which viral genomes are being sequenced is only going to increase in the coming years, and without some standardization, it will be impossible for these valuable resources to be utilized to their full potential. we present these categories as a starting point, with the goal of adjusting and refining them over time as our capabilities and needs continue to change. crystal ball. the viriosphere: the greatest biological diversity on earth and driver of global processes metagenomic analysis of coastal rna virus communities the search for meaning in virus discovery genome project standards in a new era of sequencing next generation sequencing of viral rna genomes 2012. virus taxonomy. ninth report of the international committee on taxonomy of viruses human viruses: discovery and emergence double indexing overcomes inaccuracies in multiplex sequencing on the illumina platform rescue of the prototypic arenavirus lcmv entirely from plasmid viruses as quasispecies: biological implications quasispecies diversity determines pathogenesis through cooperative interactions in a viral population characterization of mutation spectra with ultra-deep pyrosequencing: application to hiv-1 drug resistance highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data the advantages of smrt sequencing a strategy to estimate unknown viral diversity in mammals the changing face of pathogen discovery and surveillance the evolution of epidemic influenza characterization of the candiru antigenic complex (bunyaviridae: phlebovirus), a highly diverse and reassorting group of viruses affecting humans in tropical america isolation and characterization of viruses related to the sars coronavirus from animals in southern china the emerging novel middle east respiratory syndrome coronavirus: the "knowns" and "unknowns relationship between domestic and wild birds in live poultry market and a novel human h7n9 virus in china computational tools for viral metagenomics and their application in clinical research web apollo: a web-based genomic annotation editing platform pseudomonas genome database: improved comparative analysis and population genomics capability for pseudomonas genomes key: cord-012975-u87ol3fs authors: ogiwara, atsushi; uchiyama, ikuo; seto, yasuhiko; kanehisa, minoru title: construction of a dictionary of sequence motifs that characterize groups of related proteins date: 1992-09-17 journal: protein eng doi: 10.1093/protein/5.6.479 sha: doc_id: 12975 cord_uid: u87ol3fs an automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. this procedure is applied to the pir database and a dictionary of sequence motifs that relate to specific superfamilies constructed. the motifs have a practical relevance in identifying the membership of specific superfamilies without the need to perform sequence database searches in 20% of newly determined sequences. the sequence motifs identified represent functionally important sites on protein molecules. when multiple blocks exist in a single motif they are often close together in the 3-d structure. furthermore, occasionally these motif blocks were found to be split by introns when the correlation with exon structures was examined. when the amino acid sequences of two proteins are similar, they probably belong to the same group of functionally related proteins. thus, when a new protein sequence is determined, it is customary to perform a database search for similar sequences in the hope of obtaining a clue to its biological function. the search involves pairwise comparisons against individual sequences in the database. this is becoming more time-consuming with the rapid growth in database size. an alternative approach is to search a library of signature patterns, each of which uniquely identifies a group of related proteins. whether all protein groups can be represented by such diagnostic patterns is arguable, but this approach is certainly more effective because the comparison is made against individual groups rather than individual sequences in the database. it is common knowledge that functionally important sites are well conserved in the amino acid sequences of related proteins. conserved regions are not necessarily contiguous in the primary structure, because a functional site in the 3-d structure can be composed of separate pieces of conserved segments. the conserved amino acid patterns, often called consensus patterns or sequence motifs (taylor, 1988; hodgman, 1989) , are usually identified by the tedious method of multiple aligning and comparing a group of functionally related sequences. these published motifs are then manually collected, verified and organized in a motif library (bairoch, 1989; seto et al., 1990 ). an additional constraint to the conserved regions is introduced in this study: the uniqueness of amino acid patterns when compared with all other sequences outside the group. this has enabled the design of an automatic procedure to define from the protein sequence database a collection of signature patterns that uniquely identify specific protein groups. this procedure is applied to the superfamily grouping of the pir database and a library of sequence motifs is constructed that identifies specific superfamilies. the amino acid sequences were obtained from the pir database release 26.0 (september 1990) . the pir database is divided into three sections (two sections before release 26.0): pir1, annotated and classified entries; pir2, preliminary entries; and pir3, unverified entries. only the pir1 section is used when constructing a motif library. the releases of 19.0 (december 1988) to 29.0 (june 1991) were also used for comparison purposes. the 3-d coordinates of the protein structures were acquired from the brookhaven protein data bank (april 1991) . functional groups of proteins suppose that a protein sequence database is divided into groups, each containing functionally related members, and that the diagnostic amino acid patterns that uniquely identify the membership to each functional group are required. the pir superfamily classification is used to define a protein group, but there may also be other definitions. a superfamily is a group of proteins bearing significant sequence similarity and represents the probable evolutionary relationships of the proteins (dayhoff, 1978; dayhoff et al., 1983) . it is not always the case that a protein is uniquely assigned to one superfamily, because it can contain multiple domains with different functions. for simplicity, however, the pir superfamily numbering scheme is used, which assumes that each protein in the database belongs to one, and only one, superfamily. dictionary of unique peptide words a three-step procedure is employed to identify the sequence motifs. the first step involves an exhaustive search for unique peptide words (upws) which, in our definition, are short oligopeptide patterns that are well conserved and found exclusively in one protein group. a group is usually a single superfamily, but it can be extended to comprise a few superfamilies. in practice, as illustrated in figure 1 , we make a tally of all possible tetra-, penta-and hexapeptide patterns in the superfamilies of the pir database. let m s and n t be the numbers of sequences containing a given pattern in a given superfamily and in the entire database respectively. the pattern is unique to this superfamily when n % = nj. the pattern is conserved when n s = n t > f-m, where m is the number of members belonging to the superfamily and/is the parameter defining the majority. we consider different cases ranging from/ = 1 (100% conservation) to/= 0.7 (70% conservation). although the distinction between 100 and 70% conservation is highly dependent on the superfamily size and variability of its members, the uniqueness is mostly determined by the size and variability of the entire database. screening of unique peptide words. this figure shows the numbers of sequences containing given tetrapeptide patterns. the superfamily 95 has 12 member sequences and all contain the pattern qwyw, while all other sequences outside this superfamily do not possess this pattern. thus, this pattern is unique to, and conserved in, the superfamily 95. the unique pattern whfv is not 100% conserved in superfamily 96, but this pattern can be detected by setting a lower threshold value for the conservation in the second step the order of unique peptide words in each sequence of a given group is examined and a consensus pattern constructed. as illustrated in figure 2 , each amino acid sequence is converted to an abstract structure, which may be called unique peptide sentences, consisting of the upw pattern number and the number of residues separating the first residues of two successive upws. one amino acid mutation is allowed when searching for the occurrence of each upw pattern. when the separation is smaller than the length of the preceding upw there is actually an overlap between the two upws, as in patterns 3 and 4 in figure 2 . from a set of these sentences, some of which may lack specific upws and some of which may contain duplicates of the same upw, a consensus sentence is constructed. this is a multiple alignment problem and an approximate procedure was devised by combining pairwise alignments. the optimal pairwise alignment can be obtained by the following dynamic programming algorithm which is similar to the rna secondary structure prediction algorithm (waterman and smith, 1978; kanehisa and goad, 1982) : (s, v -g.p.(i,j,k,l) where s,y is the score up to the ith pattern p t andy'th pattern pj, g.p. is the gap penalty and w is the weight for a match of two patterns. the resulting consensus pattern is represented by the order of upws with the upper-and lower-bound numbers of residues separating two successive upws ( figure 2 ). the consensus pattern obtained in the previous step is represented by the blocks of amino acid patterns, which we call motif blocks, separated by the upper-and lower-bound numbers of residues in the space region as follows: < motif blockl > [min_spacer, max_spacer] < motif block2 > as shown in figure 3 , this consensus is used again in the last step to compare each sequence in the group, to identify substitution patterns and to determine whether each block is conserved in all sequences. in practice, it is first decided whether a particular block exists or not, given the minimum fraction of matched residues, r, that constitute a block. then, all substitution patterns are recorded. in the representation of our motif library, the plus sign designates that the block is conserved in all members of the group, while the minus sign indicates that some members lack the block. substitution patterns are enclosed in braces. 2. an illustration of how the sequence motif is constructed from unique peptide words. first, the locations of unique peptide words in a given superfamily are examined for all member sequences. then the consensus ordering of unique peptide words is obtained by a dynamic programming algorithm. the pirl database release 26.0 contains 7235 sequences, totalling 2 221 416 residues, classified into 2350 superfamilies. the relatively large superfamilies that contained a set number of member sequences were considered. when the minimum value for the size of a superfamily in release 26.0 was defined as three or five members, there were 521 or 283 superfamilies respectively. as summarized in table i , our procedure identified sequence motifs that characterized >50% of these superfamilies when the degree of conservation, /, was set at 80 or 70%. the motif library constructed with the minimum superfamily size of five members and/= 80% contained 145 sequence motifs ( table i) . out of the 145 motifs, 35 were characterized by single blocks while the rest contained multiple blocks, as shown in table ii . a complete listing of the 121 motifs containing < 10 blocks is shown in appendix. substitution patterns are obtained when r = 80%. as each new release of the pirl database is produced, the motif library can be reconstructed by this automatic procedure. however, a long computation time is required because of the calculation of the many hexapeptide patterns in the initial screening of the upws. when the libraries shown in table i were constructed without hexapeptide patterns, -5% of the superfamilies could not be identified. this was a relatively small loss compared with the gain in computation time. superfamily assignment by sequence motifs a procedure for superfamily assignment was established utilizing our motif library, as follows: (i) begin the search using the first motif block. the criterion for the existence of a motif block is given by the parameter r, which specifies the minimum fraction of matched residues; (ii) if a motif block is found, check if the next motif block exists after the specified spacer length; and (iii) if a motif block is not found, skip this and continue searching for the next block. the search fails if no motif block is found. in the above procedure a sequence is considered assigned to a superfamily if any of the motif blocks match. no distinction is made between the conserved (+) and nonconserved (-) blocks. table iii(a) shows the results of this procedure when applied to the pir1 database release 26.0, which is the training data set used for constructing the motif library. when the block detection parameter r = 100%, no entries were falsely assigned (false positives), but 140 entries could not be detected (false negatives) as belonging to one of the 145 superfamilies. at the level of r = 80% there were 70 false positives and 79 false negatives. when false positives were examined in more detail, all resulted from single motif blocks containing substitution patterns. sequence motifs with multiple blocks or sequence motifs with single blocks without substitution patterns could be used safely for superfamily assignment. next, a test data set was prepared from release 29.0 of the pir1 database by identifying new entries added after release 26.0. there were cases where several entries in multiple superfamilies were combined into a single superfamily or entries in a single superfamily were split into different superfamilies. in such cases, the multiple superfamilies are considered to be related and assignment to a related superfamily is the correct answer. the results using this new data set are summarized in table iii (b) . although the prediction ability (~68%) was not as great as had been expected, the search itself could be performed within a fraction of a second on a small workstation, which is two to three orders of magnitude faster than the fasta homology search (pearson and lipman, 1988) . we modified the above procedure and stopped the search if any of the conserved (+) blocks were not found. the number of false positives could be decreased without affecting the number of false negatives in table ih (a) because this is how the conserved block was defined in the training set. however, this additional constraint has more effect on increasing the number of false negatives than decreasing the number of false positives in the if the motif library is to be used as an initial step in superfamily assignment, it is desirable to decrease the number of false negatives because false positives can easily be distinguished by sequence similarity in the subsequent step. there are still -20% of false negatives in table iii (b), even with low values of r. it is possible to halve this by incorporating amino acid similarity scores, such as the pam matrix (dayhoff, 1978) , when comparing motif blocks (data not shown). because the sequence motifs identified represent well conserved regions within a group of related proteins, they are likely to correspond to functionally important sites. table iv summarizes the percentages of biological sites, annotated in the pir1 database, which correspond to motif blocks identified by our procedure. table v is a listing of the single block sequence motifs that characterize 35 superfamilies, together with any known functional significance. our procedure identified known consensus patterns, or closely related derivatives, such as the active site sequence gdsgg which is known to be exclusive to the serine protease superfamily (dayhoff et al., 1983) . the sequence motifs were obtained strictly from 1-d sequence information, the superfamily classification based on sequence similarity and the amino acid pattern searches. among the 145 superfamilies with identified motifs, 21 superfamilies contained one or more member sequences with known 3-d structures; seven were characterized by single block motifs and 14 by multiple block motifs. using the coordinate data from the brookhaven protein data bank (bernstein et al, 1977) , it has been determined that multiple motif blocks come closer together in the 3-d structure. typical examples are: l-lactate dehydrogenase (sf31; see appendix for actual motifs), phosphoglycerate kinase (sf229), phospholipase a2 (sf281), neutral proteinase (sf385), carbonate dehydratase (sf472) and triose-phosphate isomerase (sf499). figure 4 shows a stereo drawing of phospholipase a2 with two motif blocks at the active site. the correlation between conserved sequence patterns and exon structures has also been examined. a popular view suggests that introns existed in ancestral genes and have been removed under the exon shuffling mechanism (holland and blake, 1987) where an exon forms a structural or functional unit of a protein. therefore, it was expected that the identified motif blocks may correspond to exon units. as shown in table iv , however, quite a few introns were found to split functionally important motif blocks. figure 5 shows typical examples where exon boundaries appear within the motif blocks. it is also noted that the intron positions around the motif block cgscw of the papain (cysteine protease) superfamily (ishidoh et al., 1989) and around the motif block gdsggp of the trypsin (serine protease) superfamily (rogers, 1985) are not fixed within the respective member sequences. these observations appear to support the concept of intron insertions (rogers, 1989) , although all introns examined here may not fall into this category. information about the functional properties of expressed protein products is often the main concern when dna sequences are determined. the method presented in this paper is an attempt towards fully computerized interpretations of the sequence data. a collection of sequence motifs with associated biological meanings in evolutionary, functional and structural aspects may be considered a dictionary for such purposes. at the same time, the motif search approach is expected to solve the speed and sensitivity problems in the current homology search approaches. because motifs represent more organized information, concentrated and extracted from primary databases, the search against a motif library is much faster than the search against a sequence database. it is also possible to incorporate various types of motif in the library, not only those to identify membership of a superfamily, but also other sequence patterns which are too weak to be detected by standard database search methods. until now, sequence motifs have been found by manually examining a set of related sequences, although there have been a few attempts to automate the procedure (staden, 1989; smith and smith, 1990; smith et ai, 1990) . the essence of our automatic method is the concept of uniqueness. for a protein with 100 residues there are 20 100 possible amino acid sequences. in nature, however, the repertoire of real amino acid sequences appears to be quite limited in comparison to this theoretical number. the protein sequences sequenced to date amount to 10 million residues, three times larger than 20 5 or 3.2 million pentapeptide patterns. in reality, -40% of the possible pentapeptide patterns are not used in the known sequences. thus, actual proteins seem to have evolved from a limited set of amino acid sequences, conserving functionally important residues. this has been the working hypothesis in this study. as expected, motif blocks, constructed from unique peptide words, were found to be well correlated with functionally important sites of protein molecules. in addition, separate blocks tend to be close together in space to form an active site. for the motif library to be more useful, it is necessary to increase the number of identified superfamilies, i.e. to reduce the number of no opinions (70-80%) in table iii . one approach is to use lower levels of conservation,/, as shown in table i . another is to relax the condition of uniqueness which was strictly required in this analysis. a few exceptions can be allowed in other superfamilies and/or patterns could be identified that are unique to multiple superfamilies. in our preliminary analyses of the latter case, the pattern ygdtds was found in two superfamilies (dna-directed dna polymerases of adenovirus and herpes virus) which share very little sequence homology. the possibility of combining multiple superfamilies based on short sequence motifs is thus inferred. the pattern hpdkgg was found exclusively in the three superfamilies: large t antigen, middle t antigen and small t antigen of polyoma and related viruses. however, this pattern was actually located in the exon shared by the three antigens. dictionary of sequence motifs characterizing superfamilies prosite: a dictionary of protein sites and patterns. embl, release 4 proc. natl acad. sci. usa received on february this work was supported by a grant-in-aid for scientific research on the priority area 'genome informatics' from the ministry of education, science and culture, japan key: cord-256278-jvfjf7aw authors: feng, jie; hu, yong; wan, ping; zhang, aibing; zhao, weizhong title: new method for comparing dna primary sequences based on a discrimination measure date: 2010-10-21 journal: journal of theoretical biology doi: 10.1016/j.jtbi.2010.07.040 sha: doc_id: 256278 cord_uid: jvfjf7aw abstract we introduce a new approach to compare dna primary sequences. the core of our method is a new measure of pairwise distances among sequences. using the primitive discrimination substrings of sequence s and q, a discrimination measure dm(s, q) is defined for the similarity analysis of them. the proposed method does not require multiple alignments and is fully automatic. to illustrate its utility, we construct phylogenetic trees on two independent data sets. the results indicate that the method is efficient and powerful. with the completion of the sequencing of the genomes of human and other species, the field of analysis of genomic sequences is becoming very important tasks in bioinformatics. comparison of primary sequences of different dna strands remains the upmost important aspect of the sequence analysis. so far, most comparison methods are based on string alignment (pearson and lipman, 1988; lake, 1994) : a distance function is used to represent insertion, deletion, and substitution of letters in the compared strings. using the distance function, one can compare dna primary sequences and resolve the questions of the homology of macromolecules. however, it is not easy to use for long sequences since it is realized with the aid of dynamic programming, which will be slow due to the large number of computational steps. in the past two decades, alignment-free sequence comparison (vinga and almeida, 2003) has been actively pursued. some new methods have been derived with a variety of theoretical foundations. one category out of these methods is based on the statistics of word frequency within a dna sequence (sitnikova and zharkikh, 1993; karlin and burge, 1995; wu et al., 1997 wu et al., , 2001 stuart et al., 2002; qi et al., 2004) . the core idea is that the more similar the two sequences are, the greater the number of the factors shared by two sequences is. the earliest publication using frequencies statistics of k-words for sequence comparison dates from 1986 (blaisdell, 1986) . three years after, blaisdell (1989) proved that the dissimilarity values observed by using distance measures based on word frequencies are directly related to the ones requiring sequence alignment. in recent years, many researchers employ the k-words and the markov model to obtain the information about the biological sequences (pham and zuegg, 2004; pham, 2007; kantorovitz et al., 2007; helden, 2004; dai et al., 2008) . another category does not require resolving the sequence with fixed word length segments. it can be further divided into three groups. in the first group, researchers represent dna sequence by curves (hamori and ruskin, 1983; nandy, 1994; randic et al., 2003a; zhang et al., 2003; liao, 2005; li et al., 2006; qi et al., 2007; yu et al., 2009) , numerical sequences (he and wang, 2002) , or matrices (randic, 2000; randic et al., 2001) . according to the representation, some numerical characterizations are selected as invariants of sequence for comparisons of dna primary sequences. the advantage of these methods is that they provide a simple way of viewing, sorting and comparing various gene structures. but how to obtain suitable invariants to characterize dna sequences and compare them is still a question need our attention. the second group corresponds to iterated maps. jeffrey (1990) proposed the chaos game representation (cgr) as a scaleindependent representation for genomic sequences. the algorithm exploited iterative function systems to map nucleotide sequences into a continuous space. since then, alignment-free methods based on cgr have aroused much interest in the field of computational biology. further studies by almeida et al. (2001) showed that cgr is a generalized markov chain probability table which can accommodate non-integer orders, and that cgr is a powerful sequence modelling tool because of its computational efficiency and scale-independence (almeida and vinga, 2002 . such alignment-free methods have been successfully applied for sequence comparison, phylogeny, detection of horizontal transfers, detection of oligonucleotides of interest, meta-genomic studies (deschavanne et al., 1999; pride et al., 2003; sandberg et al., 2003; teeling et al., 2004; chapus et al., 2005; wang et al., 2005; dufraigne et al., 2005; joseph and sasikumar, 2006) . the third group is based on text compression technique chen et al., 2004; cilibrasi et al., 2004) . if one sequence which is given the information contained in the other sequence is significantly compressible, the two sequences are considered to be close. there are also some important methods which are based on compression algorithm but do not actually apply the compression, such as lemple-ziv complexity and burrows-wheeler transform (otu and sayood, 2003; mantaci et al., 2007 mantaci et al., , 2008 yang et al., 2010) . in this paper, we propose a new sequence distance for the similarity analysis of dna sequences. based on the properties of primitive discrimination substrings, we construct a discrimination measure (dm) between every two sequences. furthermore, as application, two data sets (bàglobin genes and coronavirus genomes) are prepared and tested to identify the validity of the method. the results demonstrate that the new method is powerful and efficient. dna sequences consist of four nucleotides: a (adenine), g (guanine), c (cytosine), and t (thymine). a dna sequence, of length n, can be viewed as a linear sequence of n symbols from a finite alphabet a ¼ fa,c,g,tg. let s and q be sequences defined over a, l(s) be the length of s, s(i) denotes the ith element of s and s(i, j) is the substring of s composed of the elements of s between positions i and j (inclusive). definition 1. s(i, j) is called a discrimination substring (ds) that distinguishes s from q if sði,jþ a q , particularly, if s(i, j) does not include any other dss distinguishing s from q, we call s(i, j) a primitive discrimination substring (pds) that distinguishes s from q. the set of pdss that distinguish s from q is denoted by dðs,qþ. similarly, dðq,sþ expresses the set of pdss that distinguish q from s. note that every sequence has its own identity, hence dðs,qþ is usually different from dðq,sþ. for example for s¼acctac and q¼gtgact, we can obtain that dðs,qþ ¼ fcc,tag and dðq,sþ ¼ fgt,tg, ga,actg. suppose u a dðs,qþ and l(u)¼k, then we can get uð1,kà1þ a q (otherwise uð1,kà1þ a dðs,qþ, which conflicts with the minimum of u). therefore the larger the k is, the more the same elements both s and q have and correspondingly the smaller the degree of discrimination that s distinguishes from q is. on the other hand, if the number of appearances of u in sequence s is t, we obviously note that the smaller the t is, the smaller the degree of discrimination that s distinguishes from q is. from the above description, we construct the following discrimination measure that one sequence distinguishes from another sequence. definition 2. dmðs 1 -s 2 þ denotes the discrimination measure that s 1 distinguishes from s 2 in which v a dðq,sþ, lðvþ ¼ ku and the number of appearances of v in sequence q is tu. definition 3. the discrimination measure of sequences s and q is for the function dm to be a distance, it must satisfy (a) dmðx,yþ 40 for x ay; (b) dm(x,x) ¼0; (c) dm(x,y) ¼dm(y,x) (symmetric); and (d) dmðx,yþ r dmðx,zþþdmðz,yþ (triangle inequality). apparently, dm satisfies distance conditions (a)-(c). it is not obvious that it also satisfies (d). the following proposition answers this. proposition 1. dm(x,y) satisfies the triangle inequality, that is dmðx,zþ rdmðx,yþþdmðy,zþ. proof. suppose s is an arbitrary element of dðx,zþ. if s is also contained in dðx,yþ, clearly we can obtain that dmðx-zþ r dmðx-yþþdmðy-zþ. if there exists an element t a dðx,zþ, and t is not contained in dðx,yþ, then we can derive t a dðy,zþ, therefore the triangle inequality dmðx-zþ rdmðx-yþ þ dmðy-zþ still comes into existence. similarly, we can prove that dmðz-xþ rdmðy-xþþdmðz-yþ. it is sufficient to prove the following inequality: this is equivalent to, by squaring both sides of the above inequality, ce þdf r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ðc 2 þd 2 þðe 2 þf 2 þ q : to prove this inequality, we just need to prove ðce þ df þ 2 rðc 2 þ d 2 þðe 2 þ f 2 þ, i.e. 2cedf re 2 d 2 þ c 2 f 2 : obviously, this inequality comes into existence. therefore, ffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi hence dm(x,y) satisfies the triangle inequality. & in this section, we apply the discrimination measure to analyze two sets of dna primary sequences. the similarities among these species are computed by calculating the discrimination measure between every two sequences. the smaller the discrimination measure is, the more similar the species are. that is to say, the discrimination measures of evolutionary closely related species are smaller, while those of evolutionary disparate species are larger. fig. 1 illustrates the basic processes of the dm algorithm. the first set we select includes 10 bàglobin genes, whose similarity has been studied by many researchers using their first exon sequences (randic et al., 2003b; liu and wang, 2005) . here we will analyze these species using their complete bàglobin genes. table 1 presents their names, embl accession numbers, locations and lengths. in table 2 , we present the similarity/dissimilarity matrix for the full dna sequences of bàglobin gene from 10 species listed in table 1 by our new method. observing table 2 , we note that the most similar species pairs are human-gorilla, humanchimpanzee and gorilla-chimpanzee, which is expected as their evolutionary relationship. at the same time, we find that gallus and opossum are the most remote from the other species, which coincides with the fact that gallus is the only nonmammalian species among these 10 species and opossum is the most remote species from the remaining mammals. by further study of the values in the table, we can gain more information about their similarity. another usage of the similarity/dissimilarity matrix is that it can be used to construct phylogenetic tree. the quality of the constructed tree may show whether the matrix is good and therefore whether the method of abstracting information from dna sequences is efficient. once a distance matrix has been calculated, it is straightforward to generate a phylogenetic tree using the nj method or the upgma method in the phylip package (http://evolution.genetics.washington.edu/phylip.html). in fig. 2, we show the phylogenetic tree of 10 bàglobin gene sequences based on the distance matrix dm, using nj method. the tree is drawn using the drawgram program in the phylip package. from this figure, we observe that (1) gallus is clearly separated from the rest, this coincides with real biological phenomenon; (2) human, gorilla, chimpanzee and lemur are placed closer to bovine and goat than to mouse and rat, this is in complete agreement with cao et al. (1998) confirming the outgroup status of rodents relative to ferungulates and primates. next, we consider inferring the phylogenetic relationships of coronaviruses with the complete coronavirus genomes. the 24 complete coronavirus genomes used in this paper were downloaded from genbank, of which 12 are sars-covs and 12 are from other groups of coronaviruses. the name, accession number, abbreviation, and genome length for the 24 genomes are listed in table 3 . according to the existing taxonomic groups, sequences 1-3 form group i, and sequences 4-11 belong to group ii, while sequence 12 is the only member of group iii. previous work showed that sars-covs (sequences 13-24) are not closely related to any of the previously characterized coronaviruses and form a distinct group iv. in fig. 3 , we present the phylogenetic tree belonging to 24 species based on the distance matrix dm, using upgma method. the tree is viewed using the drawgram program. as shown in fig. 3 , four groups of coronaviruses can be seen from it: (1) the group i coronaviruses, including tgev, pedv and hcov-229e tend to cluster together; (2) bcov, bcovl, bcovm, bcovq, mhv, mhv2, mhvm, and mhvp, which belong to group ii, are grouped in a monophyletic clade; (3) ibv, belonging to group iii, is situated at an independent branch; (4) the sars-covs from group iv are grouped in a separate branch, which can be distinguished easily from other three groups of coronaviruses. the tree constructed based on dm algorithm is quite consistent with the results obtained by other researchers (zheng et al., 2005; song et al., 2005; liu et al., 2007; li et al., 2008) . the emphasis of the present work is to provide a new method to analyze dna sequences. from the above applications, we can see that our method is feasible for comparing dna sequences and deducing their similarity relationship. in this paper, we propose a new method for the similarity analysis of dna sequences. it is a simple method that yields results reasonably and rapidly. our algorithm is not necessarily an improvement as compared to some existing methods, but an alternative for the similarity analysis of dna sequences. the new approach does not require sequence alignment and graphical representation, and besides, it is fully automatic. the whole operation process utilizes the entire information contained in the dna sequences and do not require any human intervention. the application of the dm algorithm to the sets of bàglobin genes and coronavirus genomes demonstrates its utility. this method will also be useful to researchers who are interested in evolutionary analysis. analysis of genomic sequences by chaos game representation universal sequence map (usm) of arbitrary discrete sequences computing distribution of scale independent motifs in biological sequences biological sequences as pictures: a generic two dimensional solution for iterated maps a measure of similarity of sets of sequences not requiring sequence alignment effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarities of natural sequences conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders exploration of phylogenetic data using a global sequence analysis method shared information and program plagiarism detection algorithmic clustering of music based on string compression markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison genomic signature: characterization and classification of species assessed by chaos game representation of sequences detection and characterization of horizontal transfers in prokaryotes using genomic signature h curves, a novel method of representation of nucleotides series especially suited for long dna sequences characteristic sequences for dna primary sequence metrics for comparing regulatory sequences on the basis of pattern counts chaos game representation of gene structure chaos game representation for comparison of whole genomes a statistical method for alignment free comparison of regulatory sequences dinucleotide relative abundance extremes: a genomic signature reconstructing evolutionary trees from dna and protein sequences: paralinear distances directed graphs of dna sequences and their numerical characterization 2-d graphical representation of protein sequences and its application to coronavirus phylogeny an information based sequence distance and its application to whole mitochondrial genome phylogeny a 2d graphical representation of dna sequence a relative similarity measure for the similarity analysis of dna sequences characteristic distribution of l-tuple for dna primary sequence an extension of the burrows-wheeler transform distance measures for biological sequences: some recent approaches a new graphical representation and analysis of dna sequence structure a new sequence distance measure for phylogenetic tree construction improved tools for biological sequence comparison spectral distortion measures for biological sequence comparisons and database searching a probabilistic measure for alignment-free sequence comparison evolutionary implications of microbial genome tetranucleotide frequency biases whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach new 3d graphical representation of dna sequence based on dual nucleotides on the similarty of dna primary sequences on the characterization of dna primary sequences by triplet of nucleic acid bases novel 2-d graphical representation of dna sequences and their numerical characterization analysis of similarity/ dissimilarity of dna sequences based on novel 2-d graphical representation quantifying the speciesspecificity in genomic signatures, synonymous codon choice, amino acid usage and g +c content statistical analysis of l-tuple frequencies in eubacteria and organells cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human integrated gene and species phylogenies from unaligned whole genome protein sequences application of tetranucleotide frequencies for the assignment of genomic fragments alignment-free sequence comparison-a review the spectrum of genomic signatures: from dinucleotides to chaos game representation a measure of dna sequence dissimilarity based on mahalanobis distance between frequencies of words statistical measures of dna dissimilarity under markov chain models of base composition the burrows-wheeler similarity distribution between biological sequences based on burrows-wheeler transform tn curve: a novel 3d graphical representation of dna sequence based on trinucleotides and its applications the z curve database: a graphic representation of genome sequences coronavirus phylogeny based on a geometric approach we thank all the anonymous referees for their valuable suggestions and support. key: cord-010161-bcuec2fz authors: matson, david o. title: iv, 6. calicivirus rna recombination date: 2004-09-14 journal: perspect med virol doi: 10.1016/s0168-7069(03)09032-3 sha: doc_id: 10161 cord_uid: bcuec2fz rna recombination apparently contributed to the evolution of cvs. nucleic acid sequence homology or identity and similar rna secondary structure of cvs and non-cvs may provide a locus for recombination within cvs or with non-cvs should co-infections of the same cell occur. natural recombinants have been demonstrated among other enteric viruses, including picornaviridae (kirkegaard and baltimore, 1986; furione et al., 1993), astroviridae (walter et al., 2001), and possibly rotaviruses (e.g., desselberger, 1996; suzuki et al., 1998), augmenting the natural diversity of these pathogens and complicating viral gastroenteritis prevention strategies based upon traditional vaccines. such is the case for cvs and astroviridae, whose recombinant strains may be a common portion of naturally circulating strains. the taxonomic — and perhaps biologic — limits of recombination are defined by the suggested recombination of nanovirus and cv, viruses from hosts of different biologic orders; the relationship of picornaviruses and cvs, viruses in different families, as recombination partners; and the intra-generic recombination between different clades of nlvs. in this review, i will discuss evidence for the occurrence of rna recombination in caliciviridae, both within and outside the family. constraints on recombination provided by the genomic diversity of caliciviruses (cvs), as well as implications of recombination on the natural diversity of cv strains and the clinical and biologic significance of rna recombination, also will be considered. first, i will review some features of cvs that affect understanding of recombination. the cv genome is a positive-sense, single-stranded, polyadenylated rna molecule of about 7500 nucleotides in length. cvs fall into four genera that differ in their genomic organization (green et al., 2000a) (fig. 1) . norwalk-like viruses (nlvs) have three open reading frames (orfs). orf1 encodes a polyprotein cleaved during replication into a set of nonstructural proteins, orf2 encodes the capsid protein, and orf3 encodes a protein that appears to be a minor structural protein . where studied, cvs have been shown to synthesize a positive-sense subgenomic rna that begins at the 5' of the capsid gene and that is co-terminus with the genome (meyers et al., 1991; neill and mengeling, 1988 ; sosnovtser and green, section iv, chapter 2 of this book). vesiviruses differ from nlvs in having a longer genome that in some vesiviruses (e.g., pan-i ; rinehart-kim et al., 1999) , but not others (e.g, feline cv; carter et al., 1992) , includes a longer orf1 with an additional predicted protein at its n-terminus. orf2 of vesiviruses is longer than that of nlvs, with the extra nucleotides at the 5' end of orf2. this extra sequence encodes a protein fragment that must be post-translationally cleaved to agree with experimental data of vesivirus virion structure (prasad et al., 1994) . orf3 of vesiviruses is about one-half the size of that of nlvs (~120 amino acids vs. 250-275 amino acids, respectively). in lagoviruses and sapporo-like viruses (slvs), the genes that are in orf1 and orf2 of nlvs and vesiviruses are fused into one longer orfi. a gene comparable to that of orf3 of nlvs also is present. an orf in another frame at the 5' end of the capsid gene occurs among slvs, but not in all slv strains (liu et al., 1999; jiang et al., 1997 the antigenic determinants (neutralization epitopes) that induce immunity against cvs presumably are located on the surface of the virion capsid. this capsid is composed of 180 copies of the capsid gene product, paired into 90 dimers (prasad et al., 1994; prasad et al., 1999) . despite the existence of just one capsid protein, cvs exhibit extensive antigenic diversity. in the best-characterized genus, vesivirus, at least 40 distinct serotypes (neutralization types) exist, not including feline cvs and closely related strains, which among themselves are so diverse antigenically that definition of serotypes has been problematic (lauritzen et al., 1997; hohdatsu et al., 1999; smith, 2000) . the distinct vesivirus serotypes are certainly determined by differences in nucleotide sequence of the capsid gene, resulting in differences in surface epitopes. the nucleotide differences sufficient to change the serotype are unknown, but likely to occur in a few distinct regions of the capsid gene (neill, 1992; rinehart-kim et al., 1999; neill et al., 2000) . when many capsid nucleotide sequences from different cv strains are simultaneously compared in phylogenetic analyses, the sequences within a genus fall into statistically significant clades green et al., 2000b) . the biologic significance of these distinct clades is unknown. it is clear that such clades are related to differences in capsid gene sequences; sequence differences are less marked in the rna polymerase gene: when rna polymerase region sequences are analyzed in phylogenetic analyses, statistically significant differences similar to those observed among capsid gene sequences do not occur . it is possible that separate capsid sequence clades within a genus indicate separate serotypes, but, even for vesivirus capsid sequences, an insufficient number of strains have been analyzed to associate specific sequence differences with differences in serotype. with the description of statistically significant phylogenetic clades within cv genera, data were available to recognize strains that might be natural recombinants within cvs. two examples are the well-characterized argentine strain 320 (arg320) and snow mountain virus (smv), one of the prototype cvs, recognized to be recombinants when the rna polymerase and capsid regions of these strains were characterized (hardy et al., 1997; jiang et al., 1999) (fig. 2) . at the time of publication, recombination was more certain for arg320, because the sequence was derived from a single cdna insert spanning the ~3.0 kb at the 3' end of the genome, including the end of orf1 and all of orf2, orf3, and the 3' non-coding region. in arg320, the change of relative sequence identity occurred at the orf1/capsid gene junction, indicating that the recombination occurred there. this site also was suggested (see below) to be the break-and-rejoin site for recombination between cvs and picornaviruses. for arg320, the orf1 sequence was closest to that of lordsdale virus, among sequenced nlvs, and the capsid and orf3 sequences were closest to those of mexico virus. a similar change of relative sequence identity also occurred in smv, when partial polymerase and capsid sequences were compared to reference mexico and melksham viruses. while smv was likely also to be a recombinant virus, the capsid and rna polymerase region amplicons of smv were generated separately and that fact did not exclude the possibility of different sources of strains. xi jiang has confirmed the recombinant status of smv by sequencing a single cdna derived from a single rt-pcr amplicon (x jiang, personal communication). generation of recombinants within cvs requires biologic and molecular attributes of cvs. outbreaks caused by multiple cv strains and co-infection by different hucv strains occur (matson et al., 1995; gray et al., 1997; reuter et al., 2002) . infection of single cells simultaneously by two cvs implies absence of immune or molecular and of 40 nt near the 5' end of that strain's capsid gene (id="b" sequence for this fig.) . each a or b sequence is from a cv with a known complete genome sequence, is on a single line, and is repeated in two columns. in the left-hand column, each a or b sequence is compared with the first 40 nt of the norwalk virus genome, i.e., the norwalk virus a sequence (jiang et al., 1993) . in the second column, the a or b sequence is compared with the first 40 nt (a sequence) of a prototype sequence for that genus. within a genus, the a sequences are listed first and the b sequences given next. "-" indicates the nt at that site is identical to that in the comparison sequence. for ebhsv, the "*" indicates a residue i inserted into the 5' noncoding interference. cv rna also must have an attribute that permits/favors recombination, such as a site where errors in procession of rna polymerase can occur. the subgenomic rna is the most likely molecule to participate in recombination as noted above. the highly conserved 5' end sequence of the genome and at the 5' end of the capsid gene in nlvs is an obvious common target for cv rna polymerase, for genomic and subgenomic rna synthesis. the sequence data indicated that recombination in strain arg320 occurred at the orf1/capsid gene junction where high sequence identity exists between the putative parent clades. the genomic sequence of 10 nlv strains has been determined. a comparison of sequence at the 5' genomic sequence with sequence near the 5' end of orf2 shows a high degree of sequence identity (fig. 3) . no other region of an nlv strain genome shares this degree of identity, even closely, with the 5' end of the genome. in addition, sequence identity comparable to that shown between the nlv 5' end and near the 5' end of the nlv capsid may occur among cvs of a single genus from a single host, once enough strains are sequenced. the "copy choice" model has been preferred for recombination of single-stranded rna viruses, including picornaviruses and coronaviruses (kirkegaard and baltimore, 1986; makino et al., 1986; lai and cavanagh, 1997; nagy and simon, 1997) . in the copy-choice model, recombination occurs during rna replication when the viral rna polymerase switches templates from the rna derived from one strain (donor template) to the rna derived from a second strain (acceptor template), at a highly conserved genome region, without releasing the nascent strand (lai and cavanagh, 1997) . models for rna virus recombination have utilized two terminologies to describe the degree that features of the donor and acceptor templates are shared: homologous, aberrant homologous, and non-homologous types (lai and cavanagh, 1997) or sequence similarity-essential, similarity-assisted, and similarity-nonessential (nagy and simon, 1997) . the putative parent clades of intrageneric cv recombinants have a long region of identical sequence and predicted stable hairpin structures at the proposed recombination site, which supports the classification of these recombinants as homologous or (at least) similarity-assisted. interaction like that between genomic and subgenomic cv rnas could occur by the same mechanisms for two genomic 5'ends, but the outcome of such recombination events would be hard to predict. furthermore, if a virion contains genomic and subgenomic rnas, recombination could occur in a generation after initial co-infection. evidence for recombination of cvs depends upon sequence comparisons. upon sequencing a portion of a feline cv strain f9, neill (1990) observed that (what later was designated) orf1 contained significant sequence identity with picornaviruses. this significant identity was concentrated around certain amino acid motifs within orf1 that are homologous to those within the non-structural region of picornaviruses, encoding, in order, 2c, 3c, and 3d genes. the order of these motifs and the approxi-561 mate number of nucleotides between them were the same in both virus families (fig. 4) . the capsid gene of cvs also is homologous to the vp1 to vp4 capsid proteins of picornaviruses to the extent of a shared ppg amino acid motif in a relatively conserved 5' portion of the capsid gene(s), formation of capsomeres having polypeptide [3-pleated sheets as a core structural element, and formation of a spherical virion capsid by the protein(s) (prasad et al., 1999) . these findings led to the hypothesis that at some point in time cvs and picornaviruses were/are "recombination partners" (dinulos and matson, 1994) . . 1 ), which lies 3' to the nonstructural genes, and is marked by a "ppg" motif that signals a relatively conserved region between the families. in picornaviruses, this order of nonstructural-structural "gene cassettes" is reversed, with the order of motifs within the nonstructural peptide the same, and about the same number of nucleotides from each other in the genome. (box shadings as in fig. 1.) in a recent report, gibbs and weiller (1999) suggested from sequence analyses that cvs (rna genome, mostly in vertebrates) may have recombined with nanovirus (ss dna genome, plant virus) to generate (a) circovirus(es) (fig. 5) gibbs and weiller, 1999.) nanovirus dna. that this set of steps occurred is suggested by significant sequence identities of two regions of circoviruses, one including the rep (ligation initiation) gene of nanoviruses and the other 2c-like sequences closest to those of cvs. however, a reverse-transcriptase initiation site is not known in the cv genome. the possibility that recombination occurred in invertebrates is not excluded, given the existence of viruses in insects with close sequence identity to cvs (govan et al., 2000) . the differences in genome organization among cv genera imply different constraints on how rna recombination might have occurred. for example, if the different genera are derived from a single "parental" genomic structure, different events must have occurred to generate the diversity of genome structures exhibited by the different cv genera--even within genera--for some genes are absent and others present. alternatively, if, as discussed above, cvs are "recombination partners" with (an)other virus family(ies), then "convergent evolution" might explain the shared genomic features of cvs, despite multiple "parental" genomic structures. recombinants extend our knowledge of the genetic diversity within cvs. they also place constraints on methods that "genotype" cvs. if the rna recombination of arg320 and smv is a common phenomenon, genotyping would be more difficult. for example, many reports of cv genotyping have been based upon sequence of the rna polymerase region, due to its relativity high sequence conservation and relative ease of designing rt-pcr primers. in contrast, fewer capsid genes have been characterized (see also jiang, section iv, chapter 4 of this book). the viral capsid protein is responsible for virion antigenicity and probably for inducing immunity. genotyping of cvs based upon the rna polymerase sequences clearly is not the best choice if recombination at the orf1/capsid gene is common. in addition, it remains unclear whether additional recombination sites exist. recombination in nlvs at the orf1-orf2 junction has been described upon the characterization of this genomic region for relatively few strains. thus, one might discover other types of recombinants as more strains are characterized. one recombinant nlv, arg320, was first recovered from ill children and adults in argentina, the united states and the netherlands jiang et al., 2000; m koopmans, personal communication) . smv (morens et al., 1979) and many very similar strains have been recovered from outbreaks of gastroenteritis worldwide. many smv-like nlvs have been characterized at the genomic level only in the rna polymerase genome region. each of these strains is a potential smv-like recombinant, like the prototype, awaiting sufficient characterization of the capsid sequence to draw this conclusion. the possible widespread occurrence of recombinants in symptomatic persons suggests their ready infectivity in the host(s), their easy transmissibility, furthermore that recombination does not necessarily ablate virulence, and that recombinants are genetically and ecologically stable. perhaps the most striking feature of arg320 and smv is that they and their associated illness were otherwise unremarkable. their recombinant status was recognized only because their genomes were initially characterized in both the rna polymerase region and capsid regions. also, the two potential parental strains for each of arg320 and smv are within the range of genetic diversity of strains currently co-circulating. therefore, the recombination event could have occurred recently, but not necessarily during the infection of the child from whom arg320 was recovered. on the other hand, it would not be difficult to imagine that many cv strains currently co-circulating could have derived from remote recombination events in the past. recombination may permit cvs to escape host immunity quickly, analogous to antigenic shifts in influenza viruses, but by a different molecular mechanism. recombination during calicivirus replication may be common or rare; it is possible to envision the generation of many non-viable or attenuated recombinants. the orf1 polyprotein genes could persist in virus with a new capsid selected for by the host's immunity. viable recombinants could be a model for laboratory manipulation of cap-sids (e.g., neill et al., 2000) , including study of packaging constraints and antigenicity. in the two natural recombinants described above, orf3, encoding a minor structural protein, segregated with the capsid gene in the recombinants. whether recombination can occur with an orf3 derived from another strain is unknown. rna recombination apparently contributed to the evolution of cvs. nucleic acid sequence homology or identity and similar rna secondary structure of cvs and non-cvs may provide a locus for recombination within cvs or with non-cvs should co-infections of the same cell occur. natural recombinants have been demonstrated among other enteric viruses, including picornaviridae (kirkegaard and baltimore, 1986; furione et al., 1993) , astroviridae (walter et al., 2001) , and possibly rotaviruses (e.g., desselberger, 1996; suzuki et al., 1998) , augmenting the natural diversity of these pathogens and complicating viral gastroenteritis prevention strategies based upon traditional vaccines. such is the case for cvs and astroviridae, whose recombinant strains may be a common portion of naturally circulating strains. the taxonomic --and perhaps biologic --limits of recombination are defined by the suggested recombination of nanovirus and cv, viruses from hosts of different biologic orders; the relationship of picornaviruses and cvs, viruses in different families, as recombination partners; and the intra-generic recombination between different clades of nlvs. a phylogenetic analysis of the caliciviruses the complete nucleotide sequence of a feline calicivirus genome rearrangements of rotaviruses recent developments with human caliciviruses polioviruses with natural recombinant genomes isolated from vaccine-associated paralytic poliomyelitis evidence that a plant virus switched hosts to infect a vertebrate and then recombined with a vertebrate-invecting virus analysis of the complete genome sequence of acute bee paralysis virus shows that it belongs to the novel group of insect-infecting rna viruses mixed genogroup srsv infection among a party of canoeists exposed to contaminated recreational water capsid protein diversity among norwalk-like viruses caliciviridae human calicivirus genogroup ii capsid sequence diversity revealed by analyses of the prototype snow mountain agent neutralizing features of commercially available feline calicivirus (fcv) vaccine immune sera against fcv field isolates sapporo-like human caliciviruses are genetically and antigenically diverse sequence and genomic organization of norwalk virus characterization of a novel human calicivirus that may be a naturally occurring recombinant diagnosis of human caliciviruses by use of enzyme immunoassays the mechanism of rna recombination in poliovirus the molecular biology of coronaviruses serological analysis of feline calicivirus isolates from the united states and united kingdom molecular characterization of a bovine enteric calicivirus: relationship to the norwalk-like viruses high-frequency rna recombination of murine coronaviruses enteric viral pathogens as causes of outbreaks of diarrhea among children attending day care centers during one year of observation a waterborne outbreak of gastroenteritis with secondary personto-person spread. association with a viral agent genomic and subgenomic rnas of rabbit hemorrhagic disease virus are both protein-linked and packaged into particles new insights into the mechanisms of rna recombination nucleotide sequence of a region of the feline calicivirus genome which encodes picornavirus-like rna-dependent rna polymerase, cysteine protease and 2c polypeptides nucleotide sequence of the capsid protein gene of two serotypes of san miguel sea lion virus: identification of conserved and non-conserved amino acid sequences among calicivirus capsid proteins further characterization of the virus-specific rnas in feline calicivirus infected cells recovery and altered neutralization specificities of chimeric viruses containing capsid protein domain exchanges from antigenically distinct strains of feline calicivirus x-ray crystallographic structure of the norwalk virus capsid three-dimensional structure of the primate calicivirus molecular epidemiology of human calicivirus gastroenteritis outbreaks in hungary complete nucleotide sequence and genomic organization of a primate calicivirus virus cycles in aquatic mammals, poikilotherms, and invertebrates identification and genomic mapping of the orf3 and vpg proteins in feline calicivirus virions intragenic recombinations in rotaviruses molecular characterization of a novel recombinant strain of human astrovirus associated with gastroenteritis in children molecular mechanisms of variation in influenza viruses thank you to my friend xi jiang for a special 1 o-year collaboration. i thank tamas berke for continued insights. key: cord-017584-9rx4jlw8 authors: kim, kwangsoo; ryoo, hong seo title: selecting genotyping oligo probes via logical analysis of data date: 2007 journal: advances in artificial intelligence doi: 10.1007/978-3-540-72665-4_8 sha: doc_id: 17584 cord_uid: 9rx4jlw8 based on the general framework of logical analysis of data, we develop a probe design method for selecting short oligo probes for genotyping applications in this paper. when extensively tested on genomic sequences downloaded from the lost alamos national laboratory and the national center of biotechnology information websites in various monospecific and polyspecific in silico experimental settings, the proposed probe design method selected a small number of oligo probes of length 7 or 8 nucleotides that perfectly classified all unseen testing sequences. these results well illustrate the utility of the proposed method in genotyping applications. a microarray or a dna chip is a small glass or silica surface bearing dna probes. probes are single stranded reverse transcribed mrnas, each located at a specific spot of the chip for hybridization with its watson-crick complementary sequence in a target to form the double helix [1] . microarrays currently use two forms of probes, namely, oligonucleotide (shortly, oligo) and cdna, and have prevalently been used in the analysis of gene expression levels, which measures the amount of gene expression in a cell by observing the hybridization of mrna to different probes, each targeting a specific gene. with the ability to identify a specific target in a biological sample, microarrays are also well-suited for detecting biological agents for genetic and chronic disease [2, 3, 4, 5] . furthermore, as viral pathogens can be detected at the molecular and genomic level much before the onset of physical symptoms in a patient, the microarray technology can be used for an early detection of patients infected with viral pathogens [6, 7, 8] . the success of microarrays depends on the quality of probes that are tethered on the chip. having an optimized set of probes is beneficial for two reasons. one, the background hybridization is minimized, hence true gene expression levels can be more accurately determined [9] . the other, as the number of oligos needed per gene is minimized, the cost of each microarray is minimized or the number of genes on each chip is increased, yielding oligo fingerprinting a much faster and more cost-efficient technique [9, 10] . short probes consisting of 15-25 nucleotides (nt) are used in genotyping applications [1] . having short optimal probes means a high genotyping accuracy in terms of both sensitivity and specificity [6, 9] and can play a key role in genotyping applications. from the perspective of numerical optimization, genomic data present an unprecedented challenge for supervised learning approaches for a number of reasons. first, genomic data are long sequences over the nucleic acid alphabet σ = {a,c,g,t}. second, for example, the complexity of viral flora, owing to constantly evolving viral serotypes, requires a supervised learning theory to be trained on a large collection of target and non-target samples. that is, a typical training set contains a large number of large-scale samples. third, a supervised learning framework usually requires a systematic pairing or differencing between each target and non-target samples during the course of training a decision rule [10, 11, 12, 13] . adding to these, the nature of data classification is difficult [14] . based on the general framework of logical analysis of data (lad) from [15] , we develop in this paper a probe design method for selecting short oligo probes of length l nt, where l ∈ [6, 10] . to list some advantages of selecting oligo probes by the proposed method, first, the method selects probes via sequential solution of a small number of compact set covering (sc) instances, which offers a great advantage from computational point of view. to be more specific, consider classification of two types of data and suppose that a training set is comprised of m + target and m − non-target sequences. the size of the sc training instances solved by the proposed method is minimum of m + and m − orders of magnitude smaller than optimization learning models used in [10, 11, 12] . second, the method uses the sequence information only and selects probes via optimization based on principles of probability and statistics. that is, the probability of an l−mer (oligo of length l) appearing in a single sequence by chance is (0.25) l , hence the probability of an l−mer appearing in multiple samples of one type but in none or only a few of the sequences of the other type by chance alone is extremely small. third, the proposed method does not rely on any extra tool, such as blastn [16] , a local sequence alignment search tool that is commonly used for probe selection [6, 8, 17] , or the existence of pre-selected representative probes [6] . this makes the method truly stand-alone and free of problems that may possibly be caused by limitations associated with external factors. last, with an array of efficient (meta-)heuristic solution procedures for sc, the proposed method is readily implementable for an efficient selection of oligo probes. as for the organization of this paper, we develop an effective method for selecting short oligo probes in section 2 (for reasons of space, we omit proofs for the mathematical results in this section) and extensively test the proposed probe design method in various in silico genotyping experiments in section 3 with using viral genomic sequences from the los alamos national laboratory and the national center of biotechnology information websites. the task of classifying more than two types of data can be accomplished by sequential classifications of two types of + and − data (see [18, 19, 20] and section 3 below). without loss of generality, therefore, we present the material below in the context of binary classification. the backbone of the proposed procedure is lad. a typical implementation of lad analyzes data on hand via four sequential stages of data binarization, support feature selection, pattern generation and classification rule formation. as a boolean logic-based, lad first converts all non-binary data into equivalent binary observations. a + (−) 'pattern' in lad is defined as a conjunction of one or more binary attributes or their negations that distinguishes one or more + (−) type observations from all − (+) observations. the number of attributes used in a pattern is called the 'degree' of the pattern. as seen from the definition, patterns hold the structural information hidden in data. after patterns are generated, they are aggregated into a partially-defined boolean discriminant function/rule to generalize the discovered knowledge to classify new observations. referring readers to [13, 15, 21] for more background in lad, we design a lad-based method below for efficiently analyzing large-scale genomic data. let there be m + and m − sample observations of type + (target) and − (nontarget), respectively. for a dna sequence is a sequence of nucleic acids a, c, g and t, and the training sequences need to be converted into boolean sequences of 0 and 1 before lad can be applied. toward this end, we first choose an integer value for l, usually l ∈ [6, 10] (see section 3), generate all 4 l possible l−mers over the four nucleic acid letters and then number them consecutively from 1 to 4 l by a mapping scheme. next, each l−mer is selected in turn and every training sample is fingerprinted with the oligo for its presence or absence. that is, with oligo j, we scan each sequence p i , i ∈ s + ∪ s − , from the beginning of the sequence and shifting to the right by a base and stamp p ij = 1, if oligo j is present in sequence i; and 0, otherwise. after this, the oligos that appear in all or none of the training sequences can be deleted from further consideration. we re-number the surviving l−mers consecutively from 1 to n and replace the original training sequences described in the nucleic acid alphabets by their boolean representations. let n = {1, . . . , n}. the data are now described by n attributes a j ∈ {0, 1}, j ∈ n . for observation p i , i ∈ s • , • ∈ {+, −}, let p ij denote the binary value the j−th attribute takes in this observation. denote by l j the literal of binary attribute a j . then, l j = a j (l j = a j ) instructs to take (negate) the value of a j in all sequences. a term t is a conjunction of literals. given a term t, let n t ⊆ n denote the index of literals included in the term. then, note here that n t of a • pattern identifies probes that collectively distinguish one or more • sequences from the sequences of the other type. let us introduce n additional features a n+j , j ∈ n , and use a n+j to negate a j . let n = {1, . . . , 2n} and let us introduce a binary decision variable x j for a j , j ∈ n , to determine whether to include l j in a pattern. [15] formulated a compact mixed integer and linear programming (milp) model below with respect to a reference sample consider the following. we note here that genomic data are large-scale in nature. furthermore, owing to constantly evolving viral serotypes, the complexity of viral flora is high, and this requires large numbers of target and non-target viral samples to be used for selecting optimal genotyping probes. adding to these the difficulties associated with numerical solution of milp, we see that (milp-2.i • ) above presents no practical way of selecting genotyping probes. with the need to develop a more efficient pattern generation scheme, we select a reference sequence for k ∈ s• and j ∈ n. next, we set for l ∈ s • and j ∈ n. now, consider the set covering model where c j (j ∈ n ) are positive real numbers. let (x, y) denote a feasible solution of (sc • i ). then, forms a • lad pattern. although smaller than the milp counterpart by only one constraint and one integer variable, (sc • i ) has a much simpler structure and is defined only in terms of 0-1 variables. in addition, it can exploit any of sc heuristic procedures developed so far (see, for example, [22] and references therein) for its efficient solution, hence is much preferred. note that (sc • i ) is defined by m + + m − − 1 cover inequalities and n + m • − 1 binary variables. also, recall that n is large for genomic sequences and the analysis of viral sequences requires large numbers of target and non-target sequences, that is, m + and m − are also large numbers. to develop a more compact sc-based probe selection model, we select a reference sequence p i , i ∈ s • , • ∈ {+, −}, and set the values of a (i,k) j for k ∈ s• and j ∈ n via (1). consider the following sc model: where c j 's are positive reals. theorem 2. let x denote a feasible solution of (sc-pg • i ). then, p generated on x via (2) forms a • lad pattern. below, we use (sc-pg • i ) to design one simple oligo probe selection procedure. let p • denote the set of • patterns generated so far. in this section, we extensively test the proposed probe design for the classification of viral disease-agents in in silico setting with using genomic sequences obtained from the los alamos national laboratory (lanl) and the national center for biotechnology information (ncbi). table 1 summarizes the number and the length (the minimum, average±1 standard deviation and maximum lengths) of each type of the genomic data that were used in our experiments. in analyzing data in an experiment, we first decided on a length of oligos to use by calculating the smallest integer value l such that 4 l became larger than or equal to the average of the lengths of target and non-target sequences of the experiment. then, 4 l candidate oligos were generated to fingerprint and binarize the data. note here that if a constraint in (sc-pg • i ) has all zero coefficients, then the sc instance has no feasible solution, and this case arises when the reference sequence p i , i ∈ s • , and the sequence p j , j ∈ s• have identical 0-1 fingerprints, which is a contradiction. supervised learning methodologies, including lad, presume for the existence of a classification function that each unique sequence in the training set belongs to exactly one of the two classes. when data under analysis are indeed contradiction-free, then contradiction-free 0-1 clones of the data can always be obtained by using oligos of longer length for data fingerprinting and binarization. therefore, when we generated the identical fingerprint for data of different types, we incremented the value of l by 1 and repeated the data binarization stage until the binary representations of the data became contradiction free. next, procedure sc-pg was applied to generate patterns, hence probes. in applying procedure sc-pg in these in silico experiments, we selected a minimal set of oligo probes by setting c j = 1 for all j ∈ n . for solving the unicost (sc-pg • i )'s generated, we used the textbook greedy heuristic [23] for ease of implementation. denote by p + 1 , . . . , p + n+ and p − 1 , . . . , p − n− the positive and negative patterns, respectively, generated via procedure sc-pg. in classifying unseen + (target) and − (non-target) sequences, we use three decision rules. specifically, for the polyspecific genotyping experiments (in section 3.1 and experiments 2 and 3 in section 3.2), we form the standard lad classification rule [13] δ := where ω • i denotes the number of • training sequences covered by p • i . we assign class + (−) to new sequence p if δ(p) > 0 (δ(p) < 0). we fail to classify sequence p if δ(p) = 0. for monospecific genotyping in experiment 1 in section 3.2, we form a decision rule by where p k 1 , . . . , p k n k are the probe(s) selected to for virus (sub-)type k, and assign p to class k if δ k (p) > 0 while δ i (p) = 0 for all i = 1, . . . , m, i = k. when δ(p) > 0 for more than two virus types or δ k (p) = 0 for all k, then we fail to assign a class to sequence p. in each of the experiments in this section, we tested the proposed oligo probe selection method in 20 independent hold-out experiments, each with randomly selected 90% of the target and of the non-target data forming a training set of sequences and the remaining 10 % of the target and of the non-target sequences forming the testing data. more specifically, after a training set of data was formed, we binarized the training data and selected optimal oligo probes on them via procedure sc-pg. next, a classification rule was formed by one of (3), (4) and (5) above and then used for classifying the corresponding testing sequences. these steps were repeated 20 times to obtain the average testing performance and other relevant information of the experiment. the computational platform used for these experiments was an intel 2.66ghz pentium linux pc with 512mb of memory. (4, 5, 6, 7, 8, 9) 64 1341 1434±25 1467 the infection with hpv is the main cause of cervical cancer, the second most common cancer in women worldwide [24, 25] . there are more than 80 identified types of hpv and the genital hpv types are subdivided into high and low risk types: low risk hpv types are responsible for most common sexually transmitted viral infections while high risk hpv types are a crucial etiological factor for the development of cervical cancer [26] . we applied the proposed probed design method on the 72 hpv sequences downloaded from lanl with their classification found in table 3 of [27] . the selected probes were used to form a decision rule by (3) and tested for their classification capability. results from this polyspecific probe selection experiment are provided in table 2 . in this table and also in the table found in the following subsection, the target (+) and the non-target (−) virus types of the experiments are first specified. then, the tables provide two bits of information on the candidate oligos, namely, the length l and the average and the standard deviation of the number of features generated and used in the 20 runs of each experiment for data binarization and for pattern generation. provided next in the tables is the information on the number of probes selected in the format 'the average ± 1 standard deviation' and information on the lad patterns generated. finally, the testing performance of the probes selected is provided in the last column of the tables, summarized in format 'the average ± 1 standard deviation' of the percentage of the correct classifications of the unseen sequences. briefly summarizing, the proposed probe design method selected probes on the hpv data in a few cpu seconds that tested 90.6% accurate in classifying the unseen hpv samples. for comparison, the same hpv dataset was used in [2] and [27] for the classification of hpv by high and low risk types. in brief, the probe design methods of [2] and [27] required several cpu hours of computation and selected probes that obtained 85.6% and 81.1% correct classification rates, respectively. before moving on, we note that the sequences belonging to the target and the non-target groups in this experiment all have different hpv subtypes (see table 3 in [27] ). the combination of all target and non-target sequences being different from one another and the presence of noise in the data (the classification errors) gave rise to selecting a relatively large number of polyspecific probes in this experiment. the proposed probe design method was tested on genomic viral sequences from ncbi for selecting monospecific and polyspecific probes for screening for sars and ai in a number of different binary and multicategory experimental setting and performed superbly on all counts. we describe individual experiments below and summarize results from these experiments in table 3 . 1±0 d e g r e e1 * : in format average ± standard deviation † : percentage of correct classifications of testing/unseen data experiment 1. sars virus vs. coronavirus sars virus is phylogenetically most closely related to group 2 coronavirus [28] . 105 sars sequences and 39 coronavirus samples were used to select 1 monospecific probe for screening for sars. used in a classification rule (4), the sars probe and one probe selected for coronavirus together perfectly classified all testing sequences. this experiment simulates a sars pandemic where suspected patients with sars-like symptoms are screened for the disease. we used the 105 sars virus sequences and 108 samples of other influenza virus types (the 'other virus' in table 1 ) in this experiment and selected polyspecific probes. used in a classification rule (3), these probes collectively gave the perfect classification of all testing sequences. experiment 3. classification of lethal ai virus h5 & h9 and other influenza virus h subtypes ai virus h5 and h9 subtypes cause a most fatal form of the disease [29] , and they were separated from the other h subtypes of influenza virus in this experiment. 241 h5 and h9 target sequences and 1010 other h subtype sequences were used to select polyspecific probes for detecting ai virus h5 and h9 subtypes from the rest. in a classification rule (3), the selected probes collectively classified all testing sequences correctly. the statement "monospecific neuraminidase (na) subtype probes were insufficiently divers to allow confident na subtype assignment" from [6] motivated us to design this experiment on multicategory and monospecific classification of influenza virus by n subtypes. we used the three influenza virus n subtypes with 30 or more samples in table 1 and selected monospecific probes for their classification. tested in a classification rule (5), the selected probes performed perfectly in classifying all testing sequences. note that only a small number of monospecific probes were selected and proved 'needed' in this experiment. trends in microarray analysis genetic mining of dna sequence structures for effective classification of the risk types of human papillomavirus (hpv) discovery and analysis of inflammatory disease-related genes using cdna microarrays possibility of using dna chip technology for diagnosis of human papillomavirus classification of multiple cancer types by multicategory support vector machines using gene expression data molecular detection and identification of influenza viruses by oligonucleotide microarray hybridization dna-chip technology and infectious diseases microarray-based detection and genotyping of viral pathogens selection of optimal dna oligos for gene expression arrays probe selection algorithms with applications in the analysis of microbial communities fast large scale oligonucleotide selection using the longest common factor approach optimal robust non-unique probe selection using integer linear programming an implementation of logical analysis of data on the complexity of polyhedral separability milp approach to pattern generation in logical analysis of data basic local alignment search tool selection of oligonucleotide probes for protein coding sequences support vector networks pattern recognition techniques. crane statistical learning theory partially defined boolean functions and cause-effect relationships a heuristic method for the set covering problem integer and combinatorial optimization for the international agency for research on cancer multicenter cervical cancer study group: epidemiologic classification of human papillomavirus types associated with cervical cancer the causal relation between human papillomavirus and cervical cancer the role of human papillomavirus in screening for cervical cancer classification of the risk types of human papillomavirus by decision trees unique and conserved features of genome and proteome of sars-coronavirus, an early split-off from the coronavirus group 2 lineage transmission of h7n7 avian influenza a virus to human beings during a large outbreak in commercial poultry farms in the netherlands key: cord-255194-4i9fc0r7 authors: djikeng, appolinaire; halpin, rebecca; kuzmickas, ryan; depasse, jay; feldblyum, jeremy; sengamalay, naomi; afonso, claudio; zhang, xinsheng; anderson, norman g; ghedin, elodie; spiro, david j title: viral genome sequencing by random priming methods date: 2008-01-07 journal: bmc genomics doi: 10.1186/1471-2164-9-5 sha: doc_id: 255194 cord_uid: 4i9fc0r7 background: most emerging health threats are of zoonotic origin. for the overwhelming majority, their causative agents are rna viruses which include but are not limited to hiv, influenza, sars, ebola, dengue, and hantavirus. of increasing importance therefore is a better understanding of global viral diversity to enable better surveillance and prediction of pandemic threats; this will require rapid and flexible methods for complete viral genome sequencing. results: we have adapted the sispa methodology [1-3] to genome sequencing of rna and dna viruses. we have demonstrated the utility of the method on various types and sources of viruses, obtaining near complete genome sequence of viruses ranging in size from 3,000–15,000 kb with a median depth of coverage of 14.33. we used this technique to generate full viral genome sequence in the presence of host contaminants, using viral preparations from cell culture supernatant, allantoic fluid and fecal matter. conclusion: the method described is of great utility in generating whole genome assemblies for viruses with little or no available sequence information, viruses from greatly divergent families, previously uncharacterized viruses, or to more fully describe mixed viral infections. the emergence of highly pathogenic viral agents from zoonotic reservoirs has energized a wave of research into viral ecology, viral discovery [4] [5] [6] [7] and a parallel drive to develop large datasets of complete viral genomes for the study of viral evolution and pandemic prediction [8, 9] . viral discovery has been aided by the development of sequence independent methodologies for the generation of genomic data [10] . the most prominent of these methodologies include representational difference analysis (rda) and sequence independent single primer amplification (sispa) with several variations. the sispa method, first developed by reyes and kim [11] , entails the directional ligation of an asymmetric primer at either end of a blunt-ended dna molecule. following several cycles of denaturation, annealing and amplification, minute amounts of the initial dna are enriched and then cloned, sequenced and analyzed. several modifications of the sispa method have so far been implemented including random-pcr (rpcr) [12] . the rpcr method combines reverse transcription primed with an oligonucleotide made up of random hexamers tagged with a known sequence which is subsequently used as a primer-binding extension sequence. this initial modification was first used to construct a whole cdna library from low amounts of viral rna. a more recent modification, the dnase-sispa technique [1, 2, 5] , includes steps to detect both rna and dna sequences. combining sample filtration through a 0.22 micrometer column and a dnase i digestion step led to the identification of viruses from clinical samples. the dnase-sispa technique has been used for the detection of novel bovine and human viruses from screens of clinical samples [1, 2, 13] . other groups have used the protocol for the characterization of common epitopes in enterovirus [14] , for the identification of a novel human coronavirus [15] and for viral discovery in the plasma of hiv infected patients [16] . in addition to its utility for viral discovery and viral surveillance, the dnase-sispa method has utility in obtaining full genome sequence from uncharacterized viral isolates or viral isolates from highly divergent families. in this study, we demonstrate the utility of the sispa method and its use as a rapid and cost effective method for generating full genome coverage of a wide range of viral types from several sources. optimization of the sispa method for whole genome sequencing given the success of earlier efforts for the identification of novel viral nucleic acids using sispa, we sought to adapt and optimize this method as a general and cost effective technique for large scale de novo viral genome sequencing (figures 1 and 2 ). an rnase treatment step was added to the sispa protocol to reduce contaminating exogenous rnas such as ribosomal rnas. in the case of polya-tailed viruses, we perform reverse transcription using a combination of random (fr26rv-n) and poly t tagged (fr40rv-t) primers in order to increase the coverage of the 3' end ( figure 2 ). additionally, in order to capture 5' ends of viral rna, a random hexamer primer tagged with a conserved sequence at the 5' end was added to the klenow reaction (figure 2 shows a 5' oligo specific for rhinoviruses). we have successfully used the sispa method on viral samples from different viral types. in this paper we discuss seven representative samples (table 1) . we have found that the method works consistently on dsdna, ssdna, ssrna positive and ssrna negative viruses. we have also found that the method can result in complete genome sequence of viruses ranging in size from 3,000-15,000 kb in a single experimental procedure. figure 3 shows the sequence coverage obtained for three viruses: positive ssrna phage ms2, positive ssrna rhinovirus and negative ssrna newcastle disease virus (ndv). figure 4a shows an analysis of sequence coverage for the viruses examined in this study. on average, four contigs were generated per experiment, ranging in size from 248 nt to 4495 nt with a median contig size of 1395 nt. the contigs had high sequence redundancy, with a median depth of coverage of 14.33, varying from 11.18 for turkey astrovirus (ta) to a high of 40.29 for ms2. one parameter that is taken into consideration when designing an efficient protocol for construction of a sequence library is the number of independent colonies needed to obtain sequence coverage of a given reference genome. experiments were conducted using m13 (a 6 kb genome), ndv (a 15 kb genome), and lambda phage (48.5 kb) to compare the level of coverage obtained by bidirectional sequencing of 96, 192, and 288 clones (figure 4b ). for m13, 94% genome coverage was achieved from sequencing one 96 well block of clones, and 97% genome coverage was obtained from two 96 well blocks. for ndv, 89.7%, 97.4% and 97.7% sequence coverage were obtained from one, two or three 96 well blocks respectively. in contrast to m13 and ndv, the coverage for lambda was 26.7%, 42.9%, and 52.4% after one two and three 96 well blocks were sequenced the efficiency of the sispa method as a tool for obtaining full genome coverage was analyzed using the lander and waterman model [17] , which estimates the number of gaps present as a function of sequence number and genome size. table 2 compares the expected coverage and redundancy (depth of coverage) as predicted by the lander-waterman model with the observed genome coverage and redundance. with the exception of lambda phage, observed coverage and redundancy approach expected coverage and redundancy. however when taking into account the scaled difference, as described by wendl [18] , we see a dramatically increased "shortfall" between actual and expected coverage as more clones are sequenced. for example, in the case of ndv which has a genome size of 15 kb, the scaled difference d between the expected coverage and the observed coverage (see equation description in methods section) at the different levels of sequence redundancy is 48.3 for the sequencing of a plate of 96 clones, 464.4 for two plates and 5477.4 for three plates. the sispa method works efficiently on viruses purified from a number of sources and by several methods. enterobacteriophages m13, ms2, and lambda were isolated from bacterial growth media and plasma after concentration by density gradient centrifugation. woodchuck hepatitis virus was purified from plasma by cesium chloride gradient centrifugation. human rhinovirus 16, purchased as a cell culture supernatant from atcc, was subjected to a low speed spin to remove cellular debris. turkey astrovirus was isolated from fecal material collected from turkey poults showing clinical signs of diarrhea. the intestinal fecal content was diluted in pbs and centrifuged at 14,000 k before filtration and nuclease treatment. newcastle dis-ease virus rna was purified from allantoic fluids derived from inoculated eggs. to determine the number of viral particles necessary to generate full genome sequences, we conducted dilution series with viruses whose titer was determined by plaque assays. the results of these experiments demonstrate that the sispa method is very efficient as a genome sequencing method for samples with greater than 10 6 viral particles per rt-pcr reaction ( figure 5 ). below 10 6 particles, the specific viral signal is overwhelmed by competition with non-specific or host sequences and is rarely detected from sequencing two blocks (192) of colonies. our initial results indicated low sequence coverage at the 3' and 5' ends of most viral genomes. in order to address overview of the strategy figure 1 overview of the strategy. viral particles are separated from host contaminants using centrifugation and filtration. viral particles are treated with dnase i to remove contaminated nucleic acids. random priming is used to generate 500-1000 bp amplicons which are size-selected and cloned. colonies are picked and sequenced. sequence is trimmed and assembled. contigs are closed using sequence-specific primers. construction of a library of amplicons and sequencing of randomly selected clones sequence analysis and reconstruction of full or partial genome sequences 0.5 this problem in viruses with polya tails the fr40rv-t primer ( figure 2 ) is added to the rt reaction. this increases the number of cdnas produced at the 3' end of the genome, and results in a much greater depth of coverage at the 3' end. the polyt containing primer is added to the rt reaction at a concentration 200 fold lower than the random primer in order to reduce competition with the random primer. we used human rhinoviruses to develop the methodology for improving the coverage of the 5' end. we took advantage of a conserved region from nucleotide 1 to nucleotide 10 in the 5' untranslated region. the conserved primer was used in the klenow step of the sispa protocol to enrich for the presence of amplicons from the 5' end. when used in combination with the 3' primer, we have been able to obtain full rhinovirus genome coverage in a 192 clone experiment (data not shown). one inherent difficulty of a method that relies on a random reverse transcription and pcr to generate amplicons for sequencing is the likelihood of detecting contaminant sequences as well as sequences of interest. although filtration and nuclease treatment does reduce the presence of nucleic acids from whole cells and host chromosomes, contaminating rna species will inevitably remain and thus be amplified (table 3) . to determine the presence of contaminant sequences in the clone population, all generated sequences were subjected to a blastn search against the ncbi (non-redundant) database. a cutoff e value of 10 -25 was used to identify viral sequences which matched the reference genome. non-specific sequences (i.e., those that did not match the input viral isolate) were identified as mammalian, avian, bacterial, etc., if their best hit was below a cut off value of 10 -10 . if no blast results were found below the 10 -10 cut off value the sequences were not given a specific designation. in experiments resulting in nearly complete genome sequences, contaminant sequences ranged from 3-40%. the nature of the contaminant sequence depended on the initial viral host and included mammalian, avian, bacterial, fungal, viral and unknown sequences. in the case of rhinoviruses, which were purified from hela cell culture, the majority of contaminant was of derived from human or mycobacterial nucleic outline of the sispa method acids. newcastle (ndv) and astrovirus (ta) which were purified from chicken egg allantoic fluid and turkey feces, respectively, were contaminated primarily with nucleic acids of avian origin. table 3 shows the results of blast analyses of two samples, ta and hrv16. the work presented here demonstrates the utility of the random genome sequencing method for the generation of viral sequence from positive strand ssrna (human rhinovirus, turkey astrovirus) and negative strand ssrna viruses (newcastle disease virus), ssdna (enterobacteriphage m13) and dsdna viruses (woodchuck hepatitis virus and lambda phage). in addition, using the dnase i-sispa technique we were able to amplify sufficient target material for sequencing from various sources, including cell culture isolates and field isolates which have not been purified by ultracentrifugation. although ultracentrifugation is an efficient procedure to purify viruses, it is not practical for processing samples of relatively low viral titer in a small volume or high throughput processing of viral samples for genomic sequencing. genome coverage and redundancy for viral samples from 3-15 kb approach the ideal values as predicted by the lander-waterman model [17] . however, as the sequence number increases, the efficiency of the method as measured by the scaled difference [18] decreases dramatically. thus, while the number of gaps declines as more clones are sequenced, the efficiency is reduced (i.e. there is more 'loss'. remaining gaps and areas of 1× coverage may be due to regions of secondary structure, hydrolysis of the rna template or cloning bias. additionally, at rich regions may inhibit the annealing of random primers during the rt, klenow or pcr step. we routinely pick a total of 192 clones (or two 96 well blocks) per viral sample for bidirectional sequencing as this represents the most affordable sequence coverage to efficiency ratio. while significant coverage is obtained from a single experiment, final genome assembly requires varying levels of targeted rt-pcrs to close the genome (figure 3) . the 3' end of the virus generally has the lowest coverage in any use of this protocol. in theory, given the directionality of the reverse transcriptase (3' to 5') and assuming an equal distribution of binding sites for the random primer, the 5' end of any viral genome will get higher depth of sequence coverage than the 3' end. we have found that addition of a tagged oligo dt primer significantly reduces this problem for viruses with polya tails (most positive ssrna viruses), but this remains a limitation for other virus genome types. the 5' end of most viruses has also proved difficult to complete and we have found that the addition to the rt reaction of degenerate oligos based on conserved 5' sequences can increase coverage. however we have not been able to develop a universally applicable method for obtaining complete 5' coverage. we strongly anticipate that specific adaptations of the sispa method to conserved regions of different viruses will demonstrate its versatility in a wide range of viral genome sequencing initiatives. limitations to the method include the need for samples containing a minimum of 10 6 particles (in the original 1 ml or 0.2 ml samples). moreover because the capsid structure renders the viral genomes nuclease-resistant, this protocol requires encapsidated viral genomes to allow the removal of most extra-viral contaminants. the viral nucleic acids in samples whose capsid structures have been disrupted cannot be separated from contaminants, and therefore cannot be efficiently amplified by sispa. in the experiments discussed in this paper dnase i was used to reduce host contaminant. for samples with high levels of host nucleic acid contaminant, we have used 5 µg of rnase a to treat 500 µl of filtered virus for 1 hr. we have found that rnase a treatment eliminates the majority of host rna derived sequence contaminant in these cases. the sispa method is particularly useful for obtaining genome sequence from rna viruses. because most sequencing methods for rna viruses depend on rt-pcr with primers designed from pre-existing sequence data, relationship between initial virus particle number, genome coverage and percent non-specific sequences generated by sispa figure 5 relationship between initial virus particle number, genome coverage and percent non-specific sequences generated by sispa. ms2 viruses were diluted to 10 8 , 10 6 , 10 4 , and 10 2 particles per sispa dnase i reaction. the sum of the total lengths of edited contigs for each dilution was calculated as percent of the total reference genome length. non-specific sequences were determined as those sequences which did not match reference genome with a cutoff value less than 10 -25 . observed coverage and redundancy was compared with the expected coverage and redundancy as predicted by the lander-waterman model for the total number of sequences in each assembly. the utility of this protocol is particularly evident for highly variable or degenerate viral families or for viruses with little available sequence information. in addition, the sispa method will be useful for uncharacterized viruses as no prior sequence information is required. viral rna and dna was prepared following the guidelines provided by [1, 2] sequences were analyzed against a non redundant database using a blastn algorithm. viral specific sequences were identified as matching the reference genome with a blastn cut off below 10 -25 . non-specific (non-viral contaminant) sequences were identified if they had a cut off value below 10 -10 , while none means that no blastn results were found below the 10 -10 cut off value. the extracted rna was processed for random reverse transcription as previously described [1, 2] using the fr26rv-n primer (5' gcc gga gct ctg cag ata tcn nnn nn 3') at a concentration of 1 µm. in addition, fr40rv-t (5' gcc gga gct ctg cag ata tc (t) 20 3') was added at a concentration of 5 nm to specifically amplify the 3' end of positive strand viruses. after the first cdna synthesis, the double stranded cdna was synthesized by klenow reaction the presence of random primers. in order to amplify 5' ends of rhinoviruses the following primer was added to the klenow reaction at a concentration of 10-20 nm (5'gcc gga gct ctg cag ata tc tta aaa ctg g 3'). pcr amplification used high fidelity taq gold dna polymerase (abi) with the fr20rv primer (5' gcc gga gct ctg cag ata tc 3'). pcr amplicons were a-tailed with datp and 5 units of low fidelity dna polymerase (invitrogen) at 72°c for 30 minutes. a-tailed pcr amplicons were analyzed in a 1% agarose gel and fragments between 500 and 1000 nt were gel purified. amplicons were ligated en masse into the topo ta cloning vector (invitrogen) and transformed into competent one shot topo top 10 bacterial cells (invitrogen). for dna viruses, the purified viral dna was denatured and complementary strands synthesized by klenow reaction as indicated for ds-cdna from first strand cdna. clones were plated on lb/amp/xgal agar, and individual colonies were picked for sequencing. the clones were sequenced bidirectionally using the m13 primers from the topo ta vector. we routinely sequenced a total of 192 or more per library. sequencing reactions were performed at the joint technology center (an affiliate of the j craig venter institute: jcvi) on an abi 3730 xl sequencing system using big dye terminator chemistry (applied biosystems). in the lander and waterman analysis of genome coverage [17] g = size (bp) of reference genome, l = sequence length (bp) and n = # sequences; redundancy represents the depth of sequence coverage and coverage represents the fraction of genome covered by sequence data. the ideal redundancy (r) = ln/g and the ideal coverage = 1-e -r [17] observed coverage = sum of the length of all contigs/g. observed redundancy = the average of total sequence length (length of all sequence reads in a contig including gaps)/contig length. both observed coverage and observed redundancy are experimentally derived values. the average sequence read size for the experiments described was 507.83 +/-47. 16 bp. the loss of coverage due to various biases is represented as the difference between the ideal coverage and the actual coverage. to allow quantitative comparison, this 'shortfall' difference is scaled by the standard deviation of the coverage probability distribution as given by wendl [18] . following wendl, we use the moments of the vacancy (which is the complement of the coverage) to calculate the standard deviation. where α is the ratio of the read length and the genome length and n is the number of reads. the second moment is given as: where ρ, the redundancy, is define to be equal to nα the expression for the variance is then: the standard deviation is then: the ideal coverage is given by: using the standard deviation for the vacancy in place of that for the coverage, the correctly-scaled difference d between ideal coverage and the actual coverage a is: note that for large n the mean vacancy converges to exp(ρ) allowing the following simplified approximation of s: sequence reads were trimmed to remove amplicon primer sequence as well as low quality sequence, and assembled. a virus discovery method incorporating dnase treatment and its application to the identification of two bovine parvovirus species cloning of a human parvovirus by molecular screening of respiratory tract samples rna viral community in human feces: prevalence of plant pathogenic viruses f: the marine viromes of four oceanic regions method for discovering novel dna viruses in blood using viral particle selection and shotgun sequencing environmental genome shotgun sequencing of the sargasso sea metagenomic analysis of coastal rna virus communities wholegenome analysis of human influenza a virus reveals multiple persistent lineages and reassortment among recent h3n2 viruses large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution virus discovery by sequence-independent genome amplification. reviews in medical virology sequence-independent, single-primer amplification (sispa) of complex dna populations. molecular and cellular probes a random-pcr method (rpcr) to construct whole cdna library from low amounts of rna identification of a third human polyomavirus identification of enteroviruses by using monoclonal antibodies against a putative common epitope identification of a new human coronavirus new dna viruses identified in patients with acute viral infection syndrome genomic mapping by fingerprinting random clones: a mathematical analysis occupancy modeling of coverage distribution for whole genome shotgun dna sequencing financial support for this project was provided by the institute for genomic research/j. craig venter institute. our thanks to claire fraser, eric eisenstadt and stephen liggett for their advice and support. the following complete genomes were used as reference genomes for the viruses discussed in this study: the author(s) declares that there are no competing interests. ad participated in drafting the article, experimental design, and data analysis and carried out molecular studies. jd participated in data analysis. rh, rk, jf and ns participated in experimental design and carried out molecular studies. ca, xz and ng provided materials used in the study. eg participated in experimental planning and drafting the manuscript. ds conceived of and coordinated the study, and participated in data analysis and drafting the manuscript. key: cord-264135-s2u76pvk authors: patel, amrutlal k.; pandit, ramesh j.; thakkar, jalpa r.; hinsu, ankit t.; pandey, vinod c.; pal, joy k.; prajapati, kantilal s.; jakhesara, subhash j.; joshi, chaitanya g. title: complete genome sequence analysis of chicken astrovirus isolate from india date: 2016-12-23 journal: vet res commun doi: 10.1007/s11259-016-9673-6 sha: doc_id: 264135 cord_uid: s2u76pvk objective: chicken astroviruses have been known to cause severe disease in chickens leading to increased mortality and “white chicks” condition. here we aim to characterize the causative agent of visceral gout suspected for astrovirus infection in broiler breeder chickens. methods: total rna isolated from allantoic fluid of spf embryo passaged with infected chicken sample was sequenced by whole genome shotgun sequencing using ion-torrent pgm platform. the sequence was analysed for the presence of coding and non-coding features, its similarity with reported isolates and epitope analysis of capsid structural protein. results: the consensus length of 7513 bp genome sequence of indian isolate of chicken astrovirus was obtained after assembly of 14,121 high quality reads. the genome was comprised of 13 bp 5′-utr, three open reading frames (orfs) including orf1a encoding serine protease, orf1b encoding rna dependent rna polymerase (rdrp) and orf2 encoding capsid protein, and 298 bp of 3′-utr which harboured two corona virus stem loop ii like “s2m” motifs and a poly a stretch of 19 nucleotides. the genetic analysis of castv/india/anand/2016 suggested highest sequence similarity of 86.94% with the chicken astrovirus isolate castv/ga2011 followed by 84.76% with castv/4175 and 74.48%% with castv/poland/g059/2014 isolates. the capsid structural protein of castv/india/anand/2016 showed 84.67% similarity with chicken astrovirus isolate castv/ga2011, 81.06% with castv/4175 and 41.18% with castv/poland/g059/2014 isolates. however, the capsid protein sequence showed high degree of sequence identity at nucleotide level (98.64-99.32%) and at amino acids level (97.74–98.69%) with reported sequences of indian isolates suggesting their common origin and limited sequence divergence. the epitope analysis by svmtrip identified two unique epitopes in our isolate, seven shared epitopes among indian isolates and two shared epitopes among all isolates except poland isolate which carried all distinct epitopes. electronic supplementary material: the online version of this article (doi:10.1007/s11259-016-9673-6) contains supplementary material, which is available to authorized users. poultry meat production is increasing globally day by day as it is easily manageable animal protein source for human consumption compared to others. however, viral diseases incurs heavy economic losses to the poultry industry. among the different viruses infecting birds, astroviruses are small round viruses, characterized based on its morphology (caul and appleton 1982) . astrovirus was first observed in human during 1975 in faeces of the infant suffering from gastroenteritis (madeley and cosgrove 1975) . they are broadly categorized into two genera, mammoastrovirus infecting mammals and aviastrovirus infecting avian species. both of these genera belongs to the family astroviridae. astroviruses can infect electronic supplementary material the online version of this article (doi:10.1007/s11259-016-9673-6) contains supplementary material, which is available to authorized users. variety of host including human ( (finkbeiner et al. 2008; madeley and cosgrove 1975) , cattle (bouzalas et al. 2014; li et al. 2013; schlottau et al. 2016) , sheep (jonassen et al. 2003; reuter et al. 2012) , cat (hoshino et al. 1981; lau et al. 2013) , dog (martella et al. 2012; takano et al. 2015) and deer (smits et al. 2010) . mammoastrovirus are mostly associated with gastroenteritis in the host. however, aviastrovirus can cause different diseases in the different host. the members of genus aviastrovirus mainly infects turkey, chicken and duck. turkey astrovirus (tastv) causes poultry enteritis mortality syndrome (pems) or poultry enteritis syndrome (pes) (mcnulty et al. 1980; mor et al. 2011) while in duck, astroviruses (dastv) are associated with hepatitis (asplin 1965; gough et al. 1984 gough et al. , 1985 . in chickens, two astrovirus species avian nephritis virus (anv) (imada et al. 2000; shirai et al. 1991) and chicken astrovirus (castv) (baxendale and mebatsion 2004) have been reported. initially castv was accepted as enterovirus causing growth retardation (schultz-cherry et al. 2001; todd et al. 2009 ). later on it was also found to be associated with gout (bulbule et al. 2013 ) and hatchability problem (smyth et al. 2013 ) in broiler chickens. recently castv has been linked with 'white chicks', a disease characterized by weakness and white plumage of hatched chicks (smyth et al. 2013 ) and leading to increased mortality in chicks and embryo has also been reported in poland and brazil (nunez et al. 2016) . chicken astroviruses are non-enveloped, 25-35 nm in diameter and contain non-segmented positive sense ssrna genome comprised of 6.8 to 7.9 kb in length (matsui and greenberg 2001; mendez et al. 2007 ). whole genome sequencing and other studies have revealed that the basic structure and molecular mechanism is almost similar for all the astroviruses sequenced till date (koci et al. 2000) . initially there was no authentic diagnostic method available for astrovirus which relied mostly on electron microscopy (madeley and cosgrove 1975; mcnulty et al. 1980) or immunoassay (baxendale and mebatsion 2004) . however, advances in the molecular biology have led to the development of some easy techniques for detection of viruses. astroviruses can also be diagnosed using rt-pcr (pantin-jackwood et al. 2008; smyth et al. 2009; todd et al. 2009 ). lee et al. (2013) suggested that a recombinant capsid can be used for diagnosis and vaccination. till now the genome of only three chicken astroviruses castv/4175 (directly submitted to ncbi), castv/ga2011 (kang et al. 2012) and castv/poland/g059/2014 (sajewicz-krukowska and domanska-blicharz 2016) have been reported. in this study, we sequenced and characterized whole genome of chicken astrovirus isolated from infected broiler chicks from western part of india. a broiler breeder farm at anand, gujarat, india was facing the problem of visceral gout in the cobb-400 commercial chicks from last three batches of the parents. the outbreaks were noticed only in the initial hatches of the parents aging 28 to 32 weeks of age. these parents were vaccinated with infectious bronchitis nephropathic vaccine strains. the fertility and hatchabilities were normal but the chicks started showing lameness with spiking mortality from 4th day onward. mortality continued for 5-7 days and ranged from 5 to 25%. hatches falling during winter season i.e. november to february had high mortality. dead chicks showed increased amount of abdominal fluid, pale and greyish kidneys with dilated tubules filled with urates. chalky white deposits of urates were found on serosal surfaces of pericardium, liver capsule, air sacs, joint capsules and mucosal surfaces of proventriculus, trachea etc. affected chick sample was submitted to hester biosciences limited for diagnosis of infection during the period of february 2016. the spleen, kidney and lungs tissue samples from the freshly dead birds were collected and triturated in pbs to make a 10% solution, then passed through the 0.22 μm syringe filter and inoculated in 10 day embryonated spf eggs through allantoic cavity route. after three blind passages, the embryos started dying 5 to 6 days post inoculation and showed haemorrhagic lesion on the body surface. the allantoic fluid from the dead embryo was collected and used for total rna isolation for subsequent analysis by whole genome sequencing. total rna was extracted using trizol reagent (invitrogen carlsbad, ca, usa) and treated with rnase free dnase i (qiagen, hilden, germany) to remove any dna contamination. total rna thus obtained was subjected to rnaseq library preparation as per the ion total rnaseq kit v2 (life technologies, carlsbad, ca, usa). the sequencing was carried on ion torrent pgm using 316 chip (life technologies, carlsbad, ca, usa). the sequencing reads generated from the sequencer were subjected to mapping with gallus gallus genome (wgs: insdc: aadn00000000.4) to remove the host sequences. remaining sequences were subjected to quality filtering q > 25 using prinseq v0.20.4 and good quality sequences were de novo assembled using gs de novo assembler. the assembled genome sequence was searched for blastn similarity against nr/nt database. the genome sequence was further analysed for prediction of putative open reading frames (orf) by orf finder tool (stothard 2000) and manual curation for the analysis of ribosomal frameshift signal (rfs) as reported for other astroviruses (koci et al. 2000; . the predic-tionofstemloopstructureofrfswasperformedbyrnafoldweb server (http://rna.tbi.univie.ac.at/cgi-bin/rnafold.cgi). noncoding rna sequences were inferred using similarity search against rfam database (nawrocki et al. 2015) . a nearly complete genome sequences of 11 astroviruses (table 1) were downloaded from the ncbi and predicted for presence of orfs using orffinder tool and manual curation of rfs start site. multiple sequence alignment was performed using clustal omega webserver (http://www.ebi.ac. uk/tools/msa/clustalo/) to analyse the percent identity with other genomes at nucleotide and amino acids level. phylogenetic analysis of genomes and predicted proteins were performed in mega6 (tamura et al. 2013 ) after building alignment by clustalw algorithm and subsequent tree generation using neighbour joining method (saitou and nei 1987) with 1000 bootstraps replicates. nucleotide sequences of orf2 encoding capsid protein of reported indian isolates were retrieved from ncbi nr database and analysed for nucleotide and amino acids similarity and phylogeny as described earlier. we used svmtrip (yao et al. 2012 ) which predict linear antigen epitope based on support vector machine to integrate tri-peptide similarity and propensity. capsid protein sequence was used for epitope prediction. we compared epitopes of three castv isolates from india [castv/india/ anand/2016, vrdc|castv|nz|vhinp-4 (accession no. agb58310.1) south region (accession no. aic32814.1)], two from usa (castv/ga2011 and castv/4175), one from uk (accession no.afk92952.1) and one from poland (castv/poland/g059/2014). allantoic fluid collected from spf embryo inoculated for three blind passages of field sample was subjected to total rna isolation and next generation sequencing using ion torrent platform resulted in a total of 108,109 reads with average read length of 180 bases. after removing the host specific and low quality reads, a total of 14,121 reads were used for assembly. the genome was assembled into a consensus length of 7513 bp which was identified as chicken astrovirus upon nucleotide blast analysis. the full length genome sequence was deposited to the genbank under the accession number ky038163. there was an untranslated region (utr) of 13 bp at 5′ end and of 298 bp at 3′ end with a poly-a tail stretch of 19 nucleotides. the genome was composed of a (30%, 2307 nt), t (29% 2034 nt), g (23%, 1769 nt) and c (18% 1403 nt) with a gc content of 42.2%. genome sequence analysis of identified chicken astrovirus isolate for the presence of coding and non-coding features the genome encoded three orfs (orf1a, orf1b and orf2) with each orf encoding for single protein coding gene and a partial overlap of orf1b with orf1a (fig. 1a) . manual curation of the ribosomal frameshift signal (position nt 3415) revealed presence of atg start codon preceding the slippery heptameric sequence (aaaaaac) (fig. 1b) followed by sequences forming the stem loop structure as predicted by rnafold analysis (fig. 1c) . among the analysed genomes, all four chicken astroviruses and duck astrovirus sl5 isolate were found to possess their own atg start codon for orf1b whereas it was found absent in other astroviruses (fig. 1b) . analysis of 3′-utr by rfam analysis revealed presence of two non-coding rnas similar to "s2m" rna family (accession number: rf00164) at positions 7192-7234 (e value: 1.5e −14 ) and 7287-7329 (e value: 8.1e −14 ). analysis of nucleotide similarity with other full length astroviruses genomes ( phylogenetic analysis of the astrovirus genomes suggested formation of separate cluster of chicken astroviruses and placed castv/india/anand/2016 nearest to the castv/4175 isolate (fig. 2) . similarly, phylogeny based on amino acids sequence of serine protease (fig. 3a) , rdrp (fig. 3b ) and capsid protein (fig. 3c) showed close clustering among the chicken astroviruses except capsid protein of castv/poland/ g059/2014 isolate which was clustered with the dastv/sl5 isolate. among the indian isolates, orf2 nucleotide sequence of the castv/india/anand/2016 placed nearest to the vrdc/castv/nz/vhinp-4 isolate (fig. 4a) whereas based on the amino acid sequence, it was placed nearest to the vrdc/castv/nz/vhinp-7 isolate (fig. 4b) . b-cell epitope analysis of capsid structural protein of identified chicken astrovirus isolate a total of 9-10 epitopes were predicted using svmtrip using the capsid protein sequence of the astroviruses. epitope analysis revealed two unique epitopes in case of the castv/ india/anand/2016. epitopes present at the positions 151-170 and 176-195 were common to all of the viruses analysed. based on the epitopes predicted, castv/poland/ g059/2014 was found to be unique as it has not shared any epitopes with other viruses. comparison of the three indian castvs revealed that seven epitopes were common to all the three and a epitope 649-668 was differing by one amino acid substitution in the castv/india/anand/2016 (supplementary table 1 ). the poultry viral diseases have great economic impact on poultry industries worldwide as it leads to high mortality. chicken astrovirus causes severe disease especially in young chickens (bulbule et al. 2013; li et al. 2013; nunez et al. 2016; schultz-cherry et al. 2001; smyth et al. 2013; todd et al. 2009 ). though culturing methods has been described for astroviruses (baxendale and mebatsion 2004; nunez et al. 2015) , the isolation of virus is somewhat difficult due to its poor growth in the culture (smyth et al. 2010 ). next generation sequencing technology is sophisticated as there is no need to isolate or culture the organism. hence in the present study we directly isolated rna from allantoic fluid of chicken embryo inoculated with clinical sample and sequenced on ion torrent pgm platform. the viral genome of castv/india/anand/2016 was assembled into 7513 bp which is comparable with the size of other published astrovirus genomes (chen et al. 2012; finkbeiner et al. 2008; strain et al. 2008) . the genome showed presence of three orfs encoding for serine protease, rdrp and capsid protein as reported for other astroviruses. the rdrp of most astroviruses do not have its own start codon. kang et al. (2012) reported that rdrp of chicken astrovirus ga2011 has its own start codon. we observed presence of atg start codon for rdrp among all the reported chicken astroviruses and dastv/sl5 isolate whereas other duck astrovirus and avian nephritis viruses were found to lack atg start codon. the rna family analysis by rfam suggested presence of two motifs matching to corona virus stem loop ii (s2m) motif in the 3′-utr as reported for other astroviruses (jonassen et al. 1998; . although, exact function is still not uncovered, the presence of these "s2m" motifs is believed to influence gene expression through rna-interference mechanism (tengs et al. 2013) based on the nucleotide similarity, the virus was found to be closest to the chicken astrovirus isolate ga2011 (kang et al. 2012 ) followed by castv/4175 (direct submission) with other chicken astrovirus capsid protein but higher identity with the duck astroviruses may be due a recombination event and shared ancestors with the duck astroviruses. among the avian astroviruses, chicken astroviruses were found to share higher identity with the duck astroviruses (chen et al. 2012; fu et al. 2009; liu et al. 2014 ) as compared to the turkey astroviruses (koci et al. 2000; strain et al. 2008 ) and the avian nephritis virus isolates (imada et al. 2000; zhao et al. 2011a, b) . we next analysed the sequence similarity of capsid structural protein of the castv/india/anand/2016 with the capsid protein coding sequence and amino acids sequence of reported sequence of the indian astrovirus isolates which revealed about 98-99% sequence identity at nucleotide level and about 97-98% sequence identity at amino acids level suggesting limited structural divergence and their common origin in the indian isolates reported till date. phylogenetic analysis of the genome sequences as well as the protein sequences showed clustering of the castv/ india/anand/2016 nearest to that of castv/4175 and castv/ga2011 and all four chicken astrovirus formed separate cluster except capsid protein of the castv/poland/g059/ 2014 isolate which was clustered along with the duck astroviruses. the clustering of castv/poland/g059/2014 with the duck astrovirus isolate sl5 suggest possible recombination between these isolates (sajewicz-krukowska and domanska-blicharz 2016). based on nucleotide sequence of genomes and amino acids sequence of serine protease and rdrp, chicken astroviruses were placed closed to the duck astroviruses compared to turkey astroviruses or avian nephritis viruses. however, based on capsid protein the turkey astroviruses were phylogenetically placed between different isolates of the duck astroviruses as well as near to the castv/poland/g059/2014 isolate. these observations suggest that the capsid protein of turkey, duck and chicken astroviruses evolved through possible recombination between the astroviruses of different avian species and suggests that the turkey and duck may play an important role in epidemiology of avian astroviruses similar to that of influenza viruses (alexander 2000) . among the indian isolates, phylogenetic analysis of capsid protein showed placement of the castv/ india/anand/2016 between two north zone isolates, however very limited sequence divergence was seen among the reported indian isolates suggesting their recent emergence and common origin. epitope analysis of the capsid protein sequence revealed two unique epitopes in our isolate whereas 7 epitopes were found to be shared among the indian astrovirus isolates. except castv/poland/g059/2014 isolate which contained all unique epitopes, other isolates shared two common epitopes. our analysis suggests that the vaccine design using the indian astrovirus isolate may provide cross protection against prevailing isolates in india. further, epitope mapping would be useful to design safe and effective vaccine against divergent astroviruses (ahmad et al. 2016; soria-guerra et al. 2015) . in summary, whole genome analysis of indian astrovirus isolate by next generation sequencing technology determined full length genome of the chicken astrovirus isolate for the first time in india. the present study provided genetic relatedness of the circulating indian isolate with the other reported nearly complete genome sequences of avian astroviruses. the analysis of capsid protein sequence of reported chicken astroviruses from india revealed limited structural divergence suggesting their common ancestral origin and recent emergence. considering high sequence identity of capsid structural protein among prevailing strains in the india, the castv/india/anand/2016 isolate could serve as the potential source for its further development as a vaccine candidate. the identification of unique and shared epitopes among different astroviruses will be helpful in designing effective epitope based vaccine formulation. conflict of interest the authors declare that they have no conflict of interest. fig. 4 phylogenetic relatedness of chicken astrovirus isolate castv/india/anand/2016 orf2 coding sequences (a) and orf2 encoded capsid protein (b) with reported indian isolates based on neighbour-joining method with b-cell epitope mapping for the design of vaccines and effective diagnostics a review of avian influenza in different bird species duck hepatitis: vaccination against two serological types the isolation and characterisation of astroviruses from chickens neurotropic astrovirus in cattle with nonsuppurative encephalitis in europe role of chicken astrovirus as a causative agent of gout in commercial broilers in india the electron microscopical and physical characteristics of small round human fecal viruses: an interim scheme for classification complete genome sequence of a duck astrovirus discovered in eastern china complete genome sequence of a highly divergent astrovirus isolated from a child with acute diarrhea complete sequence of a duck astrovirus associated with fatal hepatitis in ducklings astrovirus-like particles associated with hepatitis in ducklings an outbreak of duck hepatitis type ii in commercial ducks detection of astroviruses in feces of a cat with diarrhea avian nephritis virus (anv) as a new member of the family astroviridae and construction of infectious anv cdna a common rna motif in the 3′ end of the genomes of astroviruses, avian infectious bronchitis virus and an equine rhinovirus complete genomic sequences of astroviruses from sheep and turkey: comparison with related viruses determination of the full length sequence of a chicken astrovirus suggests a different replication mechanism molecular characterization of an avian astrovirus complete genome sequence of a novel feline astrovirus from a domestic cat in hong kong chicken astrovirus capsid proteins produced by recombinant baculoviruses: potential use for diagnosis and vaccina divergent astrovirus associated with neurologic disease in cattle complete sequence of a novel duck astrovirus viruses in infantile gastroenteritis enteric disease in dogs naturally infected by a novel canine astrovirus detection of astroviruses in turkey faeces by direct electron microscopy association of the astrovirus structural protein vp90 with membranes plays a role in virus morphogenesis the role of type-2 turkey astrovirus in poult enteritis syndrome rfam 12.0: updates to the rna families database isolation of chicken astrovirus from specific pathogen-free chicken embryonated eggs detection and molecular characterization of chicken astrovirus associated with chicks that have an unusual condition known as "white chicks" in brazil enteric viruses detected by molecular methods in commercial chicken and turkey flocks in the united states between identification of a novel astrovirus in domestic sheep in hungary the neighbor-joining method: a new method for reconstructing phylogenetic trees nearly full-length genome sequence of a novel astrovirus isolated from chickens with 'white chicks' condition astrovirus-induced "white chicks" condition -field observation, virus detection and preliminary characterization detection of a novel bovine astrovirus in a cow with encephalitis inactivation of an astrovirus associated with poult enteritis mortality syndrome pathogenicity and antigenicity of avian nephritis isolates identification and characterization of deer astroviruses detection of chicken astrovirus by reverse transcriptase-polymerase chain reaction development and evaluation of real-time taqman(r) rt-pcr assays for the detection of avian nephritis virus and chicken astrovirus in chickens chicken astrovirus detected in hatchability problems associated with 'white chicks an overview of bioinformatics tools for epitope prediction: implications on vaccine development the sequence manipulation suite: javascript programs for analyzing and formatting protein and dna sequences genomic analysis of closely related astroviruses detection of canine astrovirus in dogs with diarrhea in japan mega6: molecular evolutionary genetics analysis version 6.0 a mobile genetic element with unknown function found in distantly related viruses a seroprevalence investigation of chicken astrovirus infections svmtrip: a method to predict antigenic epitopes using support vector machine to integrate tripeptide similarity and propensity sequence analyses of the representative chineseprevalent strain of avian nephritis virus in healthy chicken flocks complete sequence and genetic characterization of pigeon avian nephritis virus, a member of the family astroviridae acknowledgements authors are thankful to mr. rajiv gandhi, managing director and ceo, hester biosciences limited, ahmedabad, india and anand agricultural university, anand, india for providing the facility to carry out the research work.compliance with ethical standards key: cord-025610-7vouj8pp authors: latif, seemab; bashir, sarmad; agha, mir muntasar ali; latif, rabia title: backward-forward sequence generative network for multiple lexical constraints date: 2020-05-06 journal: artificial intelligence applications and innovations doi: 10.1007/978-3-030-49186-4_4 sha: doc_id: 25610 cord_uid: 7vouj8pp advancements in long short term memory (lstm) networks have shown remarkable success in various natural language generation (nlg) tasks. however, generating sequence from pre-specified lexical constraints is a new, challenging and less researched area in nlg. lexical constraints take the form of words in the language model’s output to create fluent and meaningful sequences. furthermore, most of the previous approaches cater this problem by allowing the inclusion of pre-specified lexical constraints during the decoding process, which increases the decoding complexity exponentially or linearly with the number of constraints. moreover, some of the previous approaches can only deal with single constraint. additionally, most of the previous approaches only deal with single constraints. in this paper, we propose a novel neural probabilistic architecture based on backward-forward language model and word embedding substitution method that can cater multiple lexical constraints for generating quality sequences. experiments shows that our proposed architecture outperforms previous methods in terms of intrinsic evaluation. recently, recurrent neural networks (rnns) and their variants such as long short term memory networks (lstms) and gated recurrent units (grus) based language models have shown promising results in generating high quality text sequences, especially when the input and output are of variable length. rnn based language models (lm) have the ability to capture the sequential nature of language, be it for words, characters or whole sentences. this allows them to outperform other language models in sequence prediction and classification tasks. to learn the distributed representation of data efficiently by rnns, multiple methods have been proposed such as word embeddings. it mainly include continuous bag-of-words (cbow) and skip-gram (sg) model [10, 12] . cbow model predicts the word as vector at a current time step, given preceding and proceeding context word vectors. the sg model is opposite in approach to predict the representation of target word vector, but same in the architecture. existing methods to incorporate constraints in the output sentences or generating lexical constrained sentences have multiple limitations. [13] proposed variants of backward-forward generation approach which can not handle out-of-vocabulary (oov) words and only generate sentences with single lexical constraint. similarly, [8] proposed a synchronous training approach to generate lexical constrained sequences with generative adversarial networks (gans). moreover, various lexical constrained decoding methods have been proposed for constrained sequence generation through the extension of beam search to allow the inclusion of constraints [1, 6] . such lexical constrained decoding methods do not examine what specific words need to be included at the start of generation, but try to force specific words at each time step during the generation process at a cost of high computational complexity [14] . the remainder of this paper is organized as follows. we review the related work in sect. 2. section 3 describes our proposed architecture and sect. 4 explains the dataset, experimental setup, comparison models and evaluation criteria. section 5 gives in detail result analysis, finding and discussions about future directions. finally, sect. 6 concludes the paper. in general, the purpose of lm is to capture the regularities of a language as well as its morphological and distributional properties. lm aims to compute the probability of a word sequence in order to estimate the maximum likelihood of an upcoming word to be predicted in the sequence. lm learns the distributed representation of words to interpret semantic and syntactic relations between the sequence of words. in past, rnn has shown progressive success in language modeling over traditional methods based on statistical counts. the ability of rnn language model (rnnlm) to learn long term contextual dependency and capturing inherited sequential nature of language makes it better than other traditional methods [11] . particularly in sentence generation task, rnnlm performed well because of its capability of learning highly complicated structures of language. rnnlm makes maximum a posteriori (map) estimation for predicting words in a sentence [17] . mou et al. first proposed multiple variants of backward and forward (b/f) language models based on grus for constrained sentence generation [13] . for training the b/f language models, sentences were split by choosing a word randomly. this resulted in the positional information of words getting smoothed out while generating sentences, and thus they lose the positional information of the word. this method of choosing a split word badly influences the joint probability estimation of a sentence. liu et al. proposed an algorithmic framework dubbed as backward and forward generative adversarial networks (bfgan) for constrained sentence generation [8] . bfgan constitutes three modules; a discriminator, lstm based backward and a forward generator with attention mechanism. the purpose of discriminator is to distinguish the real sentences from constrained sentences generated by machine and to guide the joint training of both backward and forward generators by assigning them reward signals. the backward generator takes lexical constraint as an input, which can be a word, phrase or fragment and generate the first half of the sentence backwards. the forward generator takes the input half sentence generated by backward generator to complete the sentence with the aim of fooling the discriminator. the sentences prepared for training of backward generator relies on random splitting of sentences and the proposed framework can tackle single lexical constrained sentence generation. another line of work tackles the problem of constrained sentence generation by sampling the sentences from search space. su et al. proposed a gibbs sampling method based on markov chain monte carlo (mcmc) method for decoding constrained sentences [16] . the proposed approach consists of a discriminator and a pure language model conditioned on a bi-directional rnn. introducing discriminator in the proposed method caters the job for calculating probability of a sentence satisfying the constraints. gibbs method samples the set of random variables x 1 ...n from a joint distribution, which takes the form of words to make a sentence. the shortcoming of gibbs sampling is that it cannot change the length of sentences and hence not able to solve complicated tasks like directly generating sentences from constraints established in advance. miao et al. extends gibbs sampling by introducing metropolis-hastings for constrained sentence generation (cgmh) [9] . the proposed method directly samples from the sentence space by defining local operations in the sentence space such as word replacement, insertion and deletion. hokamp et al. proposed grid beam search (gbs) algorithm, an extension of beam search, for incorporating specified lexical constraints in the output sequences [6] . in neural machine translation (nmt) task, the proposed algorithm ensures that all specified constraints must meet the hypothesis before they can be considered to be completed. to generalize image caption generative models for out-of-domain images constituting novel scenes or objects, anderson et al. proposed a constrained beam search (cbs) decoding method, which utilizes finite-state machine (fsm) [1] . the proposed search algorithm is capable of forcing certain image tags over resulting output sequences by recognizing valid sequences with fsm. table 1 summarizes techniques for generating constrained sequences. it is evident that many of the architectures are designed for specific scenarios and have high computational complexity. due to performance gaps and inability to handle multiple constraints efficiently, a method need to be addressed. therefore, we have proposed a neural probabilistic backward-forward architecture that can generate high quality sequences, with word embedding substitution method to satisfy multiple constraints. to begin with, we state the problem of constrained sequence generation as follows: given the constraint(s) c as input, the proposed b/f lm needs to generate a fluent sequence s = w 1 , · · ·, w v , · · ·, w m maximizing the conditional probability p(s|c). for this purpose, we need to select a split word in a sequence s to train the proposed b/f lm. as a sequence provides us an expression, the parts-of-speech (pos) verb plays a vital role in placing the subject of a sequence into motion and offers more clarification about sequence. in this section, we first discuss the general seq2seq model for generation of sequences. after that, we discuss our proposed architecture to deal with constrained sequence generation. conventionally, rnnlms for text generation are trained to maximize the likelihood of a word w t or character c t at time step t while given the context of previous observations in the sequence. this type of learning technique for generating sequences is known as teacher forcing [4] . in such learning technique, input to the recurrent neural probabilistic language model is of fixed size. the training purpose is to predict only next token until a special stop sign is generated or specific constraint is satisfied in a sequence given the context of previous observations. in traditional seq2seq models we cannot satisfy lexical constraints, where disintegrating joint probability of a sentence y = y 1 , y 2 · · ·y m for given input sentence x = x 1 , x 2 · · ·x n is given by thus, the output sentence y is predicted from y 1 to y m in sequence either by a greedy or beam decoder. such decomposition is because of natural language's sequential nature. our proposed approach consists of a neural probabilistic architecture that is an ensemble of two lstm based b/f lm for generating lexical constrained sequences, which captures the statistical properties of text sequences effectively. in order to generate the coherent sequences from given multiple constraints as input, we first generate the sequence from verb constraint w v through b/f lm, and then we satisfy the other given constraints by word embedding substitution method during the inference process. the predicted verb v splits the sequence into two sub-sequences as: if m denotes the length of words in a sequence s i.e. s = w 1 , · · ·, w v , · · ·, w m , then the joint conditional probability of remaining m words, given lexical constraint w v and training parameters θ can be calculated as: where p bw θ and p fw θ depict the probabilities of generated sub-sequences by backward and forward language models. the sub-sequences are generated asynchronously i.e. we first generate the half sequence s v conditioned on backward sequence s 1 : w v . therefore, following the spirit of ensemble models that are trained separately, joint probability factors in eq. 2 becomes where 1 ≤ j ≤ v − 1. backward lm decodes the output in reverse order from w v−1 , w v−2 to w 1 , which is reversed again to input forward language model for decoding the complete sequence. consequently, as the output order of sub-sequence generated by backward lm is reversed again to decode the entire sequence from forward language model, therefore s 1:v is equal to w 1 , · · ·, w v . for learning the sequences, we used lstm networks in proposed architecture. the lstm networks have the capability of capturing sequential data effectively where the network transforms a sequence of given input word vectors x = x 1 , · · ·, x n into the sequence of hidden states h = h 1 , · · ·, h t by maintaining a history of inputs at each hidden state. the lstm cell depends on gating mechanism for information processing. lstm network's hidden state h at time step t is dependent on the previous state h t−1 and current input x t word vectors. particularly, in our scenario for generating variable length text sequences, the probability of an output word w out from both language models calculated as: where w bw out and w fw out are shared across all time steps in their respective lstm models, which projects the hidden state vector h t into a fixed same size vector as target vocabulary in order to generate a sequence of outputs y t = w v−t , · · ·, w 1 for backward language model and y t = w v+t , · · ·, w m for forward language model. the softmax function is in the final layer of lstm network, applied to each word vector for calculating the probability distribution over vocabulary of distinct word vectors. in order to satisfy the given lexical constraints c other than verb constraint w v , we have used a lexical substitution method based on word embedding substitution. sg model embeds both target words and their context in the same dimensional space. in this space, the vector representations of words are drawn closer together when they co-occur more frequently in a learning corpus. thus, cosine distance between them can be viewed as target-to-target distributional similarity measure. our method relies on a natural assumption that a good lexical constraint substitution for a target word w instance in a generated sequence s = w 1 , · · ·, w v , · · ·, w m needs to be consistent with the given sequence and lexically similar to the target word w instance. during inference, we find the cosine similarity [2] of given input constraint c with every word w in a sequence s generated by the proposed b/f lm. after that, we replace the constraint c with the closest matching (least cosine distance) word w in a sequence s. step 3 of fig. 1 illustrates the concept. for this purpose, we have created word embedding vectorization from fasttext. in this section, we introduced our experimental designs, containing the preparation of dataset for training and testing, experimental configuration, comparison architectures and evaluation criteria. there are many benchmark datasets for evaluating pure lm consisting of seq2seq networks for text classification and generative models, but specifically there is no such benchmark corpus for evaluation of constrained sequence generation based on statistical language models. as far, we have used stanford natural language inference (snli) [3] dataset for evaluation and training of proposed architecture. as we target the domain of generating sequences from lexical constraints, we extracted unlabeled sequences within range of minimum 3 and maximum 25 tokens, resulting in 451k sequences for training of proposed architecture. the proposed architecture ensemble backward-forward lm, therefore, to prepare training sequences for backward lm, following steps have been carried out: -annotate the tokens with their lexical categories using pos tagging. -split the sentences on verb category instead of random splitting. -sentences with more than one verb are broken up into multiple sequences. -after splitting the sequence on verb category, invert the half sequences. for the forward language model, the dataset contains complete sequences for training the network. here, it should be noted that backward language model requires only half sequences till verb token for training the network. we follow the work of bojanowski et al. [2] to create dense representations of words in dataset. a word vector is represented by augmenting the character n-grams appearing in the word, where the scoring function s takes into consideration the internal structure information of words, which is ignored by conventional skip-gram models [10] . the proposed model represents each word w as a bag of character n-gram, where adding special boundary symbols at the beginning and end of words for distinguishing prefixes and suffixes from other character sequences. in addition to character n-grams of word w, the word w is also included in its set of n-grams for learning representation of each word. for example, taking the word 'apple' and let n = 3, it will be represented by the character n-grams as and the special sequence . let a dictionary of n-grams with size g. given a word w where l w ⊂ 1, ...g is the set of n-grams appearing in word w. vector z g represents the each n-gram g, therefore a word w is represented by the sum of vectors of its n-gram g. in this regard, scoring function of word w with surrounding set of word indices c is calculated by: this extension of skip-gram model for creating word embedding allow the sharing of word vector representations across all words, thus enabling the reliable representational learning of rare or out-of-vocabulary (oov) words. we have used extension of fasttext's sg model to learn such data representations for both backward and forward language model given their respective data sets. in order to train the fasttext model, the word embedding dimension set to 300. min count value set to 2, which represents that all the word frequencies lower than 2 were ignored while learning the word representations. window size set to 5, defining the maximum distance between a current and predicted word within a sequence. workers parameter set to 16, explaining the worker threads for faster training of fasttext sg model. epochs value set to 30 iteration, over the whole data set. we performed different experiments on test set to get the most optimal hyperparameters and evaluate change in performance of the model. table 2 shows the different experimental configurations and change in performance w.r.t perplexity metric. in the proposed architecture, we get the best results by employing 2-layers of lstm in both backward and forward language model. both the lstm networks were trained with adam algorithm [7] for stochastic optimization of networks. during training, the parameters were adjusted using adam optimizer for minimizing the training loss function, also known as misclassification rate. for calculating optimization score, we used categorical cross entropy loss function between the actual y and predictedŷ word probability distribution [5] . in target of accurately capturing the regularities by the neural networks and preventing overfitting, we appended drop-out layer after every lstm layer in both the networks. the idea of drop-out layer is to randomly drop units with their connections while training, thus preventing units from co-adapting too much. dropping units significantly leads to major improvements than other regularization methods [15] . the epochs value was set to 50 and mini batch size was set to 128 in both the networks. both the backward and forward models are trained on nvidia gtx 1080 ti gpu. the lstm based networks are developed in keras. training took about 17 h approx. per model with this implementation and optimal hyper-parameter configuration. we compared our proposed methodology with state-of-the art sampling method cgmh [9] for satisfying multiple constraints in a sequence. we also evaluated our methodology of verb based split generation with different variants [13] , which can only handle single lexical constraint. we have used intrinsic evaluation metric that allows to determine the quality of a lm without being associated or embedded to a particular application. the most conventional intrinsic evaluation metric is perplexity (ppl). ppl of a language model given a test set w = w 1 , w 2 , ...w m is the inverse probability of w where the probability is normalized by the number of words for intrinsic evaluation of our proposed methodology, we first make comparisons with variants such as separate b/f and asynchronous b/f language models proposed by [13] . as mentioned earlier, in our proposed methodology the given word is verb constraint w v through which we decode complete sequence whereas in variants of b/f, the complete sequence is decoded by random split word. we calculated ppl with both verb and random constraint as input to decode the complete sequences. table 3 represents the comparison in terms of ppl, where the higher probability of a sequence results in the lower of perplexity, which is better. separate b/f variant yields worse sequences with huge perplexity score because both the b/f lm were enforced to output separately with the input constraint and concatenated after decoding of sequences. this is due to the fact that forward lm does not have the context of half sequence decoded by backward lm. our proposed approach is more similar to asynchronous b/f lm, but technically very different as we are satisfying multiple constraints while asynchronous approach can deal with only single constraint. the results clearly shows that decoding a sequence on specific verb constraint can make use of the positional information of words in a sequence, that is smoothed out when we generate a sequence with random constraint. table 4 shows the comparison of our proposed approach for catering multiple constraints with cgmh [9] . our proposed approach shows lower perplexity than cgmh sampling method for sentence generation through keywords/constraints 1 to 3, while with 4 constraints as input cgmh shows slightly better result than our approach of generating sequence with verb constraint and during inference replacing the words in sequence with closest embedding similarity. the decoding complexity of cgmh increases linearly with the number of constraints, while there is no such factor in our approach for catering multiple constraints. there is always a trade-off between fluency of sequence and decoding complexity. in practice, the downside of cgmh sampling methods is that we are not sure of which sampling step size is best for proposal distribution. to validate our proposed architecture of generating sequence, we performed a series of experiments. results of intrinsic evaluation confirms that our proposed approach for sequence generation given constraint(s) outperforms previous methods. splitting and generating a sequence on verb constraint makes use of positional information, which is smoothed out in breaking down a sequence with random word. we observe that decoding a sequence given random word as input in proposed b/f lm even performs better when the backward lm is trained over half sequences till verb. moreover, in future we would like to explore about the constraint-to-target context similarity, indicating their syntagmatic compatibility for improving the word embedding substitution method. introducing attention mechanism as context vectors for constraints would be an interesting side in the proposed architecture. in this paper, we have proposed a novel method, dubbed neural probabilistic backward-forward language model and word embedding substitution method to address the issue of lexical constrained sequence generation. our proposed system can generate constrained sequences given multiple lexical constraints as input. to the best of our knowledge, this is the first time that multiple constraints have been handled through lstm based backward-forward lm and word embedding substitution of the sequences. the proposed method contains a backward language model based on lstm network, which learns the half representation of a sentence until the verb splitting word and forward language model constitute lstm network, learning the complete representation of a sequence. moreover, word embedding substitution method satisfy other constraints by substituting the target word in the sequence with given constraints based on similar context in an embedding space. guided open vocabulary image captioning with constrained beam search enriching word vectors with subword information a large annotated corpus for learning natural language inference a comparison of mlp, rnn and esn in determining harmonic contributions from nonlinear loads a tutorial on the crossentropy method lexically constrained decoding for sequence generation using grid beam search adam: a method for stochastic optimization bfgan: backward and forward generative adversarial networks for lexically constrained sentence generation cgmh: constrained sentence generation by metropolis-hastings sampling efficient estimation of word representations in vector space recurrent neural network based language model distributed representations of words and phrases and their compositionality backward and forward language modeling for constrained sentence generation fast lexically constrained decoding with dynamic beam allocation for neural machine translation dropout: a simple way to prevent neural networks from overfitting incorporating discriminator in sentence generation: a gibbs sampling method sequence to sequence learning with neural networks key: cord-001537-i34vmfpp authors: lima, francisco esmaile de sales; cibulski, samuel paulo; dos santos, helton fernandes; teixeira, thais fumaco; varela, ana paula muterle; roehe, paulo michel; delwart, eric; franco, ana cláudia title: genomic characterization of novel circular ssdna viruses from insectivorous bats in southern brazil date: 2015-02-17 journal: plos one doi: 10.1371/journal.pone.0118070 sha: doc_id: 1537 cord_uid: i34vmfpp circoviruses are highly prevalent porcine and avian pathogens. in recent years, novel circular ssdna genomes have recently been detected in a variety of fecal and environmental samples using deep sequencing approaches. in this study the identification of genomes of novel circoviruses and cycloviruses in feces of insectivorous bats is reported. pan-reactive primers were used targeting the conserved rep region of circoviruses and cycloviruses to screen dna bat fecal samples. using this approach, partial rep sequences were detected which formed five phylogenetic groups distributed among the circovirus and the recently proposed cyclovirus genera of the circoviridae. further analysis using inverse pcr and sanger sequencing led to the characterization of four new putative members of the family circoviridae with genome size ranging from 1,608 to 1,790 nt, two inversely arranged orfs, and canonical nonamer sequences atop a stem loop. viruses of the circoviridae family are known to infect a wide range of vertebrates. the virions consist of naked nucleocapsids of about 20 nm in diameter, with a circular single stranded dna (ssdna) genome of approximately 2.0 kb [1] . they have an ambisense genome organization containing two major open reading frames (orfs) inversely arranged, responsible for encoding the replicase (rep) and capsid (cap) proteins, and are separated by a 3' intergenic region (igr) between the stop codons and a 5' igr between the start codons [2] . some circoviruses are major pathogens of pigs [3] [4] [5] , e.g. porcine circovirus 2 (pcv2) which causes either asymptomatic infections or clearly apparent disease which may be responsible for significant economic losses [6] [7] [8] [9] [10] . in birds, avian circoviruses, within the genus gyrovirus, have been identified in a broad range of avian species; one of them, chicken anemia virus (cav), is a major cause of disease, associated to lymphoid depletion, immunosuppression and developmental abnormalities [11] [12] [13] [14] [15] . according to the document 2014.006a-gv from ictv, there is a proposal of the gyrovirus genus removal from circoviridae to anelloviridae family due to recent metagenomic studies on gyroviruses showing a very high sequence divergence when compared to other circoviruses members. recent metagenomic approaches, high-throughput sequencing techniques and degenerate pcrs have led to the identification of small circular dna genomes in fecal samples of wild mammals, in insects as well as from environmental samples [2, [16] [17] [18] . some of the newly described circular genomes are similar to those of circoviruses, but phylogenetically different from the previously known avian and porcine circoviruses [18] . their distinct nucleotide/ amino acid composition and genome organization allowed authors to propose the creation of a new genus within the circoviridae, which was named cyclovirus. in comparison to members of the genus circovirus, both rep and cap cycloviruses genes are smaller, with shorter or no 3' igr between the stop codons of the two major orfs and a longer 5' igr between the start codons of the two major orfs [2] . sequences related to circoviruses have been identified based on the detection of the conserved rep region involved in rolling circle replication (rcr) [19] . cyclovirus genomes were detected in wild animal's samples, human feces and cerebrospinal fluids; muscular tissues of farm animals such as chickens, cows, sheep, goats, and camels [20, 21] . currently, eight different species of cycloviruses have been detected in winged-insect populations highlighting they circulate in a wide host range possessing a high genetic diversity, as well [18] [19] [20] [22] [23] [24] . so far, classification for the genus circovirus considers circoviruses sharing >75% genomewide nucleotide identity and >70% amino acid identity in the capsid (cap) protein to the same species. although, there are no species demarcation criteria for the genus cyclovirus, the taxonomic classification for the family circoviridae considers viruses sharing >60% in their cap amino acid identity level as belonging to distinct genera [19] . in the present article, the detection of ssdna genomes from bat fecal samples is reported. genome segments were amplified by consensus/degenerate pcr. whole genome sequencing and phylogenetic analyses of the sequences obtained revealed that four of the sequences represent viral genomes of new members of the family circoviridae. permission for this work on protected bats was granted by health monitoring (cevs-centro estadual de vigilância em saúde) of the brazilian federal state of rio grande do sul. the study did not involve any direct manipulations of bats and relied entirely on collection of fecal samples from the attic floor. all experiments were performed in compliance with the european convention for the protection of vertebrate animals used for experimental and other scientific purposes (european treaty series-no. 170 revised 2005) and the procedures of the brazilian college of animal experimentation (cobea). it must be highlighted that we had the owner's permission to access the attic for the purposes of this study. in case of future surveys in porto alegre, the health monitoring (cevs) will be contacted to obtain the permissions. a maternity roost of bats was identified in the summer of 2012 in the attic of a private residence in the central area of the municipality of porto alegre, rio grande do sul, southern brazil. the colony was estimated to harbor about 500 bat specimens of insectivorous bats of two species, velvety free tailed bats (molossus molossus) and brazilian free tailed bats (tadarida brasiliensis) [25] . speciation was confirmed by dna extraction from fecal pellets, amplification and sequencing of the mitochondrial cytochrome b (cytb) gene as described [26] . one hundred fecal samples were collected from the attic floor as follows: a plastic film was spread on the ground of the attic compartment and fresh droppings were collected with clean disposable forks in the following night. each sample consisted of pool of 5 fecal droppings, which were immediately sent to the laboratory and stored at -80°c. the samples were then thawed, resuspended and in 1 ml of hank's balanced salt solution (hbss), vortexed and centrifuged at 10 .000 x g for 5 min. the supernatants were then transferred to fresh tubes, filtered through 0.45 μm pore-size syringe filters (fisher scientific, pittsburgh, pa) and submitted to dna extraction. total fecal dna was extracted from 400 μl of the supernatants (above) with phenol-chloroform (invitrogen) [27] . the extracted dna was eluted in 50 μl of te (tris-hydrochloride buffer, ph 8.0, containing 1.0 mm edta), treated with 20 μg/ml of rnase a (invitrogen) and stored at -80°c. subsequently, samples were submitted to amplification in a nested-pcr targeting the rep gene of circoviruses/cycloviruses with the following degenerate primers: cv-f1 (5´-ggiayiccicayyticargg-3´), cv-r1 (5`-awccaiccrtaraartcrtc-3`), cv-f2 (5´-ggiayiccicayyticarggitt-3´), and cv-r2 (5´-tgytgytcrtaiccrtccc acca-3´) [2] . briefly, the nested pcr was performed as follows: the first reaction was performed in a 25 μl volume containing 20 to 50 ng of sample dna 1 mm mgcl 2 , 0.2 μm of each primer (cv-f1 and cv-r1), 1.5 u taq dna polymerase (invitrogen), 10% pcr buffer and 0.6 mm dntps. the cycling conditions were: 5 min at 95°c; 40 cycles of 1 min at 95°c, 1 min at 52°c, 1 min at 72°c and a final incubation at 72°c for 10 min. for the second (nested) reaction, the 25 μl mix components were: 1 μl of the 1 st reaction product, 1 mm mgcl 2 , 0.2 μm of each primer (cv-f2 and cv-r2), 1.5 u taq dna polymerase (invitrogen), 10% pcr buffer and 0.6 mm dntps. the cycling conditions were: 5 min at 95°c; 40 cycles of 1 min at 95°c, 1 min at 56°c, 1 min at 72°c, and a final incubation at 72°c for 10 min. products with a size of approximately 400 bp were purified and directly sequenced using primer cv-r2. to confirm the sequences, each product was sequenced three times. samples were sequenced with the big dye terminator cycle sequencing ready reaction (applied biosystems, uk) in an abi-prism 3100 genetic analyzer (abi, foster city, ca), according to the protocol of the manufacturer. sequences similar to the rep gene sequences of circovirus-like-genomes were aligned for designing of new sets of primers to perform the inverse pcr (ipcr). the ipcr were carried out in a 25 μl reaction mixture optimized with platinum taq hi-fi (invitrogen™) (cycling conditions can be informed upon request) and the primer sequences as follows: . standard precautions were taken to avoid contamination and negative controls were added to each batch of reactions. five microliters of the pcr products were electrophoresed in 0.7% agarose gels and the products visualized on uv light after staining with ethidium bromide. the amplicons corresponding to the sizes ranging from 1-2 kb were purified and cloned into pcr 2.1-topo cloning kit (invitrogen™). three insert-containing plasmids of each clone were sequenced with m13 forward and reverse primers as described above. the full-length sequence of genomes was constructed by "genome walking" using the geneious software (version 7.1.3). identification of putative orfs was made with aid of orf finder (ncbi; http://www.ncbi. nlm.nih.gov/gorf/gorf.html). sequence analyses were performed with the blastx software (http://www.ncbi.nlm.nih.gov/blast/). nucleotide sequences were aligned and compared to sequences of human, animal and sewage-associated members of the circoviridae available at genbank database using clustalw [28] . the alignments were optimized with the bioedit sequence alignment editor program version 7.0.9 [29] . the hairpin and stem-loop structures were identified in mfold [30] . phylogenetic analysis was carried out in mega5 [31] . the confidence of each branch in the phylogeny was estimated with bootstrap values calculated from 2000 replicates. for the purpose of this work, the samples were named bat circovirus porto alegre (batcv poa), followed by the cluster number to which each one was assigned. amplicons with the expected size (about 400 bp) were obtained from 24 out of the 100 (24%) fecal samples screened. the amplified dna was direct sequenced. the nucleotide sequences corresponding to part of the rep gene were determined and submitted to genbank (km401658-km401681). blastx analysis showed that these partial rep sequences have an amino acid identity of 10-76% with those of known circoviruses and 87-100% among themselves. a phylogenetic tree was constructed based on the alignment of the deduced amino acid sequences herein detected with those of the representative circovirus and cyclovirus sequences (fig. 1) . as shown in the tree, it was observed the arrangement of five main groups with clusters ii (4 sequences), vi (3 sequences) and vii (2 sequences) falling into the clade of cycloviruses, in contrast to clusters i (13 sequences) and v (2 sequences) that formed distinct and distant groups from those formed by circoviruses and cycloviruses. the arbitrary division of these sequences in clusters was carried out to analyze their genomic features, assuming that according to the criteria used for circovirus diversity analysis, distinct species comprising more than >20% sequence divergence are considered to be classified as an individual viral [32] . according to this, we could infer the detection of five potential new species from bat samples (3 cycloviruses and 2 circoviruses). the impossibility to achieve the complete sequencing of virus dna from clusters i and v was probably due to the high gc-rich content present in the 3´igr gc region, even though attempts on pcr amplification before sequencing were made without much success by varying the concentrations of dmso and/or in the presence of 50% 7-deaza-gtp and 50% dgtp (new england biolabs), as performed by rijsewijk et al. [33] . the predicted two orfs, rep and cap, are present and inversely arranged in all sequences as shown in fig. 2 . the predicted cap protein sequences consist of 197-231 amino acids and share an amino acid identity of 24-76% with the known cycloviruses/circoviruses and 15.5-88.8% among themselves (tables 1 and 2) . the predicted rep protein sequences ranged from 232 to 280 amino acid and have an amino acid identity ranging from 9.2-44.4% among themselves (tables 1 and 2) . stem-loop structures were found in all 4 bat circular genomes. they have a conserved nonanucleotide motif located at the 5' igr (nantattac) and are considered to be responsible for initiating the rolling-cycle replication of circoviruses [18, 34] . as shown in table 3 , all four batcv poa also contain a conserved nonamer sequence in the loop region of the 5' igr, different from the conserved cyclovirus and circovirus nonanucleotide motif sequence, but similar to the loop motif of cycloviruses found on bat, human and chimpanzee feces (batcv poa ii, v, vi) and slightly modified from those of cyclovirus and circovirus (batcv poa i) [2, 17, 18, 20] . the predicted protein sequences encoded by orf2 (cap) and orf1 (rep) of batcv i-vi genomes were used for phylogenetic analysis with representative and recently discovered circoviruses/cycloviruses; pepper golden mosaic virus was used as outgroup, as they are somewhat related to other members in the circoviridae family (fig. 3a, 3b and 3c ). as shown in the trees, batcv poa/2012/ii and vi fell into the cyclovirus clade already identified in chickens, chimps, bats, goats, humans and dragonflies [2, 17, 18, 20, 22] . when analyzing the cap-encoding region (fig. 3a) , batcv poa/2012/ii was related to a cyclovirus detected in muscle tissues of a goat from pakistan through degenerate/consensus pcr [20] , and batcv poa/2012/vi was more related to dragonfly cyclovirus detected through viral metagenomics [22] . however, when analyzing both genomes according to the conserved rep-encoding region, it was observed that they formed a monophyletic clade (fig. 3b) . on the other hand, batcv poa/2012/i and v fell outside the circovirus and cyclovirus clades, not yet related to any genus of circoviridae family along with bat circovirus-like virus tm6 and batcv-sc703 [17, 18] . this situation was confirmed based on the alignments of the whole genomes, producing a similar tree topology (see fig. 3c ). these sequences are closer to sequences detected in guano and fecal samples collected from bats in the united states and china through metagenomic approaches, suggesting that these viruses have the same host origin, likely from bats [17, 18] . however, currently, no classification has been fully considered to these sequences. in this work we report the discovery of 4 novel circular ssdna genomic sequences from insectivorous bats feces from brazil. in the recent years, many genomes of circoviruses, cycloviruses and rep-containing circular dna viruses have been characterized in mammals, birds, insects and environmental samples [19] bringing to light a high level of genetic diversity among these viruses [19, 35] . according to our results, two genomes belong to genus cyclovirus (batcv poa ii and vi). these genomes are organized and contain two major orfs in opposite directions, presenting in their 5' igr of the rep orf the cyclovirus-conserved nonanucleotide motif (5'-taatactat-3') in their loop region (table 3) . batcv poa i and v present their cap located in the positive strand and the larger rep located on the minus strand, as expected for circoviruses, but this pattern was not present in batcv poa ii and vi, as shown in table 1 . the phylogenetic analysis constructed based on the alignments of the complete rep and cap protein confirms that batcv poa/ii and vi cluster into the genus cyclovirus along with the chinese cycloviruses sequences clade detected in bat feces [18] and sharing less than 65% of identity at the cap/rep amino acid level. batcv poa i and v had a low amino acid identity with cap (<20%) and rep (<10%) sequences of two other sequences detected in bat feces in this study with known circoviruses/cycloviruses (table 2) . consequently, they formed a distinct clade along with other bat-sourced sequences, expanding the view of diversity in these new ssdna viruses that are divergent enough at the sequence level that they could very likely be part of a different genus. in our study, we detected cyclovirus and circovirus related sequences at a frequency of 24% in the examined samples. however, due to methodological limitations, restriction in location and variety of bat species, we were not able to extrapolate our results to epidemiological data (such as incidence and prevalence) or to which bat species the ssdna positive samples belonged. as performed by ge et al. in china [36] , further investigation is needed to determine the prevalence of circoviruses in other brazilian bat species. nevertheless, it becomes clear that such study is worthy to understand the great diversity of circoviruses found worldwide. our study was based on the phylogenetic analysis and comparison to the sequences recovered. the finding of known insect viruses in bat feces simply reflects the diet of these insectivorous bats, which play an important role on predating insects. viral dna detection in bat feces does not allow one to differentiate between viral replication in bats or simple passage through the digestive track from ingested food [20, 35] . to date, few members of the circovirus genera can be related to severe clinical conditions in animals, with the exception of pcv2 and some of the avian circoviruses [5] . even with the recent discovery of many cycloviruses, circoviruses-like or rep-like sequences in a variety of mammals tissues and feces, including humans fecal samples [20, 36, 37] , there is no syndrome yet associated with these viruses. nevertheless, a recent identification of a new cyclovirus from vietnamese and malawi patients with acute central nervous system infection of unknown etiology raises the possibility of disease association, yet to be proven [38, 39] , although possibly with limited geographic distribution [38] . in this work, two more circular dna genomes were characterized which did not fall within the circo/cycloviruses clade grouping instead distantly with tm6 and batcv-sc703 [17, 18] both also from bat feces. these new genomes have in common the presence in the rep n-terminus of the same motifs associated with rolling circle replication (ftlnn, tphlqgy) and dntp-binding (gxgks), as well as the conserved identified in the carboxy half of rep amino acid motifs associated with 2c helicase function (wwdgy and dryp) [19] . the n-terminal regions related to cap proteins of batcv poa i and v are highly basic and arginine-rich, as is typical for circoviruses capsid proteins with arginine residues ranging from 36%-42% (genome i and v, respectively) along the first 50 aa, in contrast to tm6 (28%) and sc703 (26%). they are also distinguishable based on their cap and rep sizes (data not shown), as well as on the low amino acid level for both proteins, as the percentage of amino acid identity of batcv poa i and v shows a rep identity <45% and <35% for cap identity in relation to tm6 and sc703. based on these genomes characteristics, even though they are clustered in a separate clade, not yet characterized, they are new viral species. upon the discovery of other sequences grouping along with these genomes, it will be of interest to propose the creation of a new genus within circoviridae by the international committee on taxonomy of viruses (ictv). here we report the detection of four novel circular ssdnas from bat feces after whole-genome characterization within the family circoviridae. so far, it is not clear if these new ssdna detected have some important role on pathogenesis. in addition to bioinformatics analysis, future investigations must include attempts in virus isolation to confirm host origin, which will give some light to better understand the relationships between these circular dna viruses and bats. conceived and designed the experiments: fesl spc pmr. performed the experiments: fesl spc hfs tft apmv. analyzed the data: spc ed. contributed reagents/materials/analysis tools: pmr acf. wrote the paper: fesl spc pmr acf ed. virus taxonomy: classification and nomenclature of viruses multiple diverse circoviruses infect farm animals and are commonly found in human and chimpanzee feces pathogenesis of postweaning multisystemic wasting syndrome caused by porcine circovirus 2: an immune riddle isolation of circovirus from lesions of pigs with postweaning multisystemic wasting syndrome insights into the evolutionary history of an emerging livestock pathogen: porcine circovirus 2 porcine circoviruses: a review quantification of porcine circovirus type 2 (pcv2) dna in serum and tonsillar, nasal, tracheo-bronchial, urinary and faecal swabs of pigs with and without postweaning multisystemic wasting syndrome (pmws) a review of porcine circovirus 2-associated syndromes and diseases porcine circovirus type 2 associated disease: update on current terminology, clinical manifestations, pathogenesis, diagnosis, and intervention strategies recent advances in the epidemiology, diagnosis and control of diseases caused by porcine circovirus type 2 psittacine beak and feather disease virus nucleotide sequence analysis and its relationship to porcine circovirus, plant circoviruses, and chicken anaemia virus cloning and sequencing of duck circovirus (ducv) genome sequence determinations and analyses of novel circoviruses from goose and pigeon circoviruses: immunosuppressive threats to avian species: a review identification of a novel circovirus in australian ravens (corvus coronoides) with feather disease frequent detection of highly diverse variants of cardiovirus, cosavirus, bocavirus, and circovirus in sewage samples collected in the united states bat guano virome: predominance of dietary viruses from insects and plants plus novel mammalian viruses genetic diversity of novel circular ssdna viruses in bats in china a field guide to eukaryotic circular single-stranded dna viruses: insights gained from metagenomics possible cross-species transmission of circoviruses and cycloviruses among farm animals identification of a new cyclovirus in cerebrospinal fluid of patients with acute central nervous system infections dragonfly cyclovirus, a novel single-stranded dna virus discovered in dragonflies (odonata: anisoptera) novel cyclovirus discovered in the florida woods cockroach eurycotis floridana (walker) high global diversity of cycloviruses amongst dragonflies detection of alphacoronavirus in velvety free-tailed bats (molossus molossus) and brazilian free-tailed bats (tadarida brasiliensis) from urban area of southern brazil genomic characterization of severe acute respiratory syndrome-related coronavirus in european bats and classification of coronaviruses based on partial rna-dependent rna polymerase gene sequences molecular cloning: a laboratory manual: cold spring harbor laboratory press clustal w and clustal x version 2.0 bioedit: a user-friendly biological sequence alignment editor and analysis program for windows 95/98/nt; 1999 well-determined" regions in rna secondary structure prediction: analysis of small subunit ribosomal rna mega5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods virus taxonomy, ixth report of the international committee for the taxonomy of viruses discovery of a genome of a distant relative of chicken anemia virus reveals a new member of the genus gyrovirus rolling-circle replication of an animal circovirus genome in a theta-replicating bacterial plasmid in escherichia coli rapidly expanding genetic diversity and host range of the circoviridae viral family and other rep encoding small circular ssdna genomes evaluation of the in vivo radiosensitizing activity of etanidazole using tumor-bearing chick embryo host effect on the genetic diversification of beet necrotic yellow vein virus single-plant populations limited geographic distribution of the novel cyclovirus novel cyclovirus in human cerebrospinal fluid key: cord-002473-2kpxhzbe authors: das, jayanta kumar; pal choudhury, pabitra title: chemical property based sequence characterization of ppca and its homolog proteins ppcb-e: a mathematical approach date: 2017-03-31 journal: plos one doi: 10.1371/journal.pone.0175031 sha: doc_id: 2473 cord_uid: 2kpxhzbe periplasmic c7 type cytochrome a (ppca) protein is determined in geobacter sulfurreducens along with its other four homologs (ppcb-e). from the crystal structure viewpoint the observation emerges that ppca protein can bind with deoxycholate (dxca), while its other homologs do not. but it is yet to be established with certainty the reason behind this from primary protein sequence information. this study is primarily based on primary protein sequence analysis through the chemical basis of embedded amino acids. firstly, we look for the chemical group specific score of amino acids. along with this, we have developed a new methodology for the phylogenetic analysis based on chemical group dissimilarities of amino acids. this new methodology is applied to the cytochrome c7 family members and pinpoint how a particular sequence is differing with others. secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c7 family members and some unique characteristics and their domains are highlighted. thirdly, we search for unique patterns as subsequences which are common among the group or specific individual member. in all the cases, we are able to show some distinct features of ppca that emerges ppca as an outstanding protein compared to its other homologs, resulting towards its binding with deoxycholate. similarly, some notable features for the structurally dissimilar protein ppcd compared to the other homologs are also brought out. further, the five members of cytochrome family being homolog proteins, they must have some common significant features which are also enumerated in this study. amino acids play the vital role for determining the protein structure and functions. but it is informative to know how the functionality of the group of proteins is changed while amino acid patterns are getting changed from one protein to another. it becomes quite harder and mostly time consuming to identify the uniqueness of proteins and their functionality from the wet lab experiments while working with complete sequence. in this regard, several techniques have been developed for the analysis of primary protein sequence that is helping the plos biochemist to work with only specific domain instead of the whole sequence which reduces the experiment time. geobacter sulfurreducens is one of the predominant metal and sulphur reducing bacteria [1] . the organism geobacter sulfurreducens is known to act as an electron donar and participate in redox reaction [2] . periplasmic c7 type cytochrome a (ppca) protein along with its four additional homologs (ppcb-e: ppcb, ppcc, ppcd, ppce) are identified in geobacter sulfurreducens genome [3] [4] [5] [6] . altogether, five proteins are highly conserved around "heme iv" but are not identical, and mostely differ in two hemes, "heme i" and "heme iii" [4] . these two regions are known to interact with its own redox partner. deoxycholic acid (conjugate base deoxycholate), also known as cholanoic acid, is one of the secondary bile acids, which are metabolic byproducts of intestinal bacteria used in medicinal field and for the isolation of membrane associated proteins [7, 8] . among the five members of cytochromes c7 family, only ppca can interact with deoxycholate (dxca) while its other homologs cannot. while interacting with dxca, it is observed that few residues are utilized [4, 6, 9] . it would be worthy if the reason of such an amazing difference towards recognizing a single compound can be found through the amino acids sequence viewpoints. further, one can also see the reason of the structural dissimilarity of ppcd compared to the other homologs [5] . in literature, in-silico techniques have been used to tackle the various problems through the analysis of dna, rna and protein sequences in bioinformatics field. specially, the authors are searching the protein blocks which are highly similar and conserved among the sub-group or entire family members [10] [11] [12] [13] . there are twenty standard naturally occurring amino acids which are diverse, arises complexity in the sequences, and have some group specific susceptibility. various reduced alphabet methods are established which can perform much better in certain conditions [14] [15] [16] [17] . sequence similarity is the most widely reliable strategy that has been used for characterizing the newly determined sequences [18] [19] [20] [21] . finding the functional/structural similarity from homolog sequences with low sequence similarity is a big challenging task in bioinformatics. to tackle this problem, several methods have been introduced that can identify homolog proteins which are distantly distributed in their evolutionary relationships [22] [23] [24] [25] . again, in microrna field the authors have developed a new identification technique of microrna precursors emphasizing on different data distributions of negative samples [26] . further, phylogenetic analysis are also studied from different viewpoints to find the evolutionary relationship among various species [27] [28] [29] . some authors have used the statistical tools for sequence alignment, alignment-free sequence comparison and phylogenetic tree [30] [31] [32] [33] . although every amino acid has individual activity, group specific function of amino acid is also obvious. methods have been introduced for the 2d graphical representation of dna/rna or protein sequences [34] [35] [36] [37] [38] [39] [40] where methods are based on individual score and position wise graphical representation. so, in this field establishment of a new methodology is always welcome with distinct findings. combining with various features for dna, rna and protein sequence a web server called pse-in-one (http:// bioinformatics.hitsz.edu.cn/pse-in-one/home/) is developed [41] which is user friendly and can be modified by users themselves. recently, the authors have classified the twenty standard amino acids into the eight chemical groups and have found some group and/or family specific conserved patterns which are involved in some functional role specially in motor protein family members [17] . in this study, the previously defined method [17] of reduced alphabets are used as an application into the cytochrome c7 family protein members. we introduced a new method of phylogenetic analysis based on chemical group dissimilarity of amino acids. in addition, we build the graph from primary protein sequence. in the designing of graph, we have designated the various chemical groups of amino acids as thevertices in the graph. the primary protein sequence is read as consecutive order pairs serially from first amino acid to the end of sequence, and each order pair is nothing but a connected edge between the two nodes where nodes in the graph are involved with different chemical groups of amino acids. the graph is drawn for every individual protein sequence and we look for various unique edges/ cycles among the entire family members. so any unique findings from the graph may be hypothesized as having a significant functional role in the primary protein sequence. because the variation in the graph is directly affected by the amino acid residues in some specific domain where a change of chemical group has taken place. we highlight all the significant points which are differing from one sequence to other. further, working with reduced alphabets and designing the graph require less complexity and easy visualization even if working with the larger sequences. order pair directed graph a directed graph g = (v, e) is a graph which consists of a set of vertices denoted by v = {v 1 , v 2 , . . ., v i }, and a set of connected edges denoted by e = {e 1,1 , e 1,2 , . . ., e i, j } where an edge e i, j exists if the corresponding two vertices v i and v j are connected and the direction of edge is from the vertex v i to the vertex v j . from the graph, various graph theoretic properties like edge connectivity, cycles, graph isomorphism etc. can be investigated to differentiate the graphs. given an arbitrary amino acids sequence, it is first transformed into the numerical sequence as described previously where amino acids are categorized into eight chemical groups according to the side chain/chemical nature of the amino acids [17] . the transformation is done using the following rules (eq 1) as per the classification. if a particular amino acid is read as a i , then the corresponding transformed group is g k and the numerical value k is defined by the following eq (1). : ifa i 2 fd; eg here, g 1 , g 2 , . . ., g 8 are the acidic, basic, aliphatic, aromatic, cyclic, sulfur containing, hydroxyl containing and acidic amide groups respectively [17] . the eight numerical values are considered as the vertices of the graph g i.e. v i 2 {1, 2, . . .8}. algorithm 1 is used to generate the directed graph from the primary protein sequence using matlab16b software. here, we obtain the graph which is the order pair digraph because an edge is constructed through the pair (source node, target node) which is obtained from the consecutive order pair list of amino acids in the primary protein sequence. so given an arbitrary amino acid sequence, we can find an order pair directed graph having at most eight vertices/nodes. output: an adjacency matrix and the corresponding order pair directed graph. define a null matrix (m) of size 8 by 8; define a 1-d array (t) of size l,; find x as the chamical group number of a i uisng eq (1); the phylogenetic tree is an acyclic graph showing the evolutionary relationship among the various biological species based on their genetic closeness. although various phylogenetic tree methods have already been studied, based on chemical nature of amino acids are not yet explored in the literature as per our knowledge. our method of phylogenetic tree formation used the dissimilarity matrix which is obtained for every pair of sequence on the basis of chemical group specific score of amino acids. so this method is completely alignment free and requires less computational complexity. firstly, we calculate the percentage of occurrence of amino acids from each chemical group using the following equation eq (2) . if there are n number of sequences which are denoted as s 1 , s 2 , . . .s n , then the corresponding length of the sequences are denoted as l 1 , l 2 , . . .l n . and a particular sequence s i is read as for the sequence s 1 , the first amino acid is read as s 1 1 , the second amino acid is read as s 2 1 and so on. for each g k group and a particular sequence s i , we count the total number of amino acids s i (t k ) and score per hundred s i (g k ) on using the following eqs (2) and (3) respectively. for example, if the primary protein sequence length is 80 aa, out of which 20 aa are from acidic group i.e. g 1 , then the score per hundred of the acidic group is 20 80 â 100 à á ¼ 25%. secondly, we measure the dissimilarity measure for every possible pair of sequence. the dissimilarity of two sequences s i and s j is denoted as d s i ; s j . for each group g k , we count the percentage of amino acid differences of the two sequences taking the mod value of the score obtained on using eq (4). this is done for all the respective eight chemical groups and all the values are added. finally, we get the dissimilarity matrix d of size n by n as shown below. dðn; nþ ¼ to draw the phylogenetic tree, we use the nearest distance (single linkage) method. the pair wise distances are the entities of the obtained dissimilarity matrix and the whole procedure is written in matlab 2016b software. five homologous triheme cytochromes (ppca-e) are identified in g. sulfurreducens periplasm and gene knockout studies revealed their involvement in fe(iii) and u(vi) extracellular reduction [1, 2] . cytochromes have been thoroughly studied for laboratory experiments because of their small size (about 90 amino acids). table 1 shows the gene name, accession number, protein name, length (#amino acids). the primary protein sequences are collected from http://www.uniprot.org/. sequence identity and the phylogenetic tree firstly, our analysis is directed to measure the primary protein sequence for every member. we obtain the percentage identity matrix of every pair of sequences ( sequence characterization of ppca and its homolog proteins ppcb-e: a mathematical approach exported from clustalw. it is observed that sequences are at least 47% similar. the maximum similarity is 76% which is found between ppca and ppcb. if we consider the ppca sequence which shows the minimum of 50% similarity with ppce and the maximum of 76% similarity with ppcb, we are not able to differentiate the ppca from other homologs on using the similarity percentage. secondly, we count rate of occurrence (frequancy of amino acids) of every individual amino acid of the respective five sequences which are shown in table 3 . then, we look for chemical group specific frequency for every sequence shown in table 4 using eq (3). now, we obtain the dissimilarity score of all possible two sequences (using eq (4)). say for an example, we compare the seq. no. 1 and seq. no. 2, we get the difference for acidic group is 2.1978 (10.9890-8.7912), basic group is 4.3956 (27.4725-23.0769) and so on (from table 4 ). total score after summing the eight groups is 17.5824 which measures the dissimilarity percentage of the said two sequences. similar results we get for all other pairs which are shown in table 5 . this table shows the biological distances between each pair of sequences. from this pair wise distance matrix, the phylogenetic tree is constructed as shown in fig 1, also discussed in method section. based on the phylogenetic tree of five members, we find that the ppca and ppcd, ppcb and ppce are mostly closed with regards to the frequency of amino acids of respective eight chemical groups. from fig 1 it is not obvious that ppca differs from other homologs, but if we go through the dissimilarity matrix (table 5) , we find some variations. here, it is observed that ppca differs by minimum of 16.5313% with ppcd, whereas for other homologs minimum dissimilarity sequence characterization of ppca and its homolog proteins ppcb-e: a mathematical approach is found for ppcd with ppcc which is 11.8535%. therefore among all the pairs, the high dissimilarity of ppca shows its uniqueness compared to its homologs. if we have a closer look into the list of amino acids, it is observed that the amino acids d, e, h, k, f, i, l, v, a, g, p, m, c, t are present among all the sequences. other amino acids are not common to all the member sequences. therefore, on the basis of chemical groups, all the amino acids from acidic, aliphatic, cyclic and hydroxyl containing groups are present. it is observed that the acidic, basic and hydroxyl containing groups percentage distinctly differ while compared ppca with other homologs. further, it is observed that only one proline(p) from cyclic group is present in ppcd while in other homologs, proline (p) is present at least 3 times. and another important observation is that the amino acid tryptophan (w) from aromatic group is present only in ppcd sequence. for every member of cytochrome c7 family, we draw a order pair directed graph using algorithm 1 which are shown in fig 2. there are maximum of eight possible nodes and the various directed edges among the nodes. we try to highlight the connected edges that show the uniqueness, specially in between the ppca and its homolog members and ppcd with other members separately as well as commonality to all members. details of the edge connectivity information for ppca and its homologs are shown in table 6 . we say two nodes (direction is from row to column) are connected or present if the cell symbol is 1, not present if the cell symbol is 0, and common to all the members if the cell symbol is ã . an edge between two nodes (in order) is basically a pattern https://doi.org/10.1371/journal.pone.0175031.g002 table 6 . existance of unique edges comparison between ppca and ppcb-e groups obtained from directed graph (fig 2) . ppcb-e node vs. node 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 * * * * * * (two distinct nodes or two distinct amino acids from two different chemical groups) of length 2. we find two particular edges, one edge (82) is present only in ppca sequence (approx. residues 41-42, s1 table) that is not found in other member sequences, and one edge (73) which is present in ppcb-e sequences (approx. residues 25-36, s1 table) , but this edge is not present in ppca sequence. while considering all the members, we find many edges which are common to all. further, ppcd is structurally dissimilar among the homologs [4] . while looking into the order pair directed graph, we find only one variation i.e. there is an edge (54) node 5 to node 4 among the ppca-c and ppce sequences which is not observed in ppcd (table 7) . this node transition where amino acid changes proline(p) to glycine (g) for ppca-c and ppce and for ppcd this transition is from glycine (g) to glycine (g), located in approximately residues 54-55 (s1 table) . again existence of edges between any two nodes either common to all or individual member specific have some significant role in the primary protein sequences. because node to node connectivity is the point of changes from one chemical group to the other in the primary protein sequence positions and this could be the effective characteristic for the structural or functional variation of proteins. although few residues are being responsible while interacting with dxca, the neighbouring residues of amino acids must be having a role for their unique characteristics. so the subdomain identification involving with different unique cycles would be worth mentioning in this regard. here, we have calculated the various cycles of length c l (3 c l 6) for group specific and individual member specific which are shown in in s2 table. say for an example, the cycle 7216457 of length 6 i.e. the directed edges are 7 ! 2 ! 1 ! 6 ! 4 ! 5 ! 7. for completing this cycle a particular subdomain is responsible. interestingly, we find various unique cycles for ppca, ppcd and ppcb-e. so there are some unique cycles which are distinctly present for ppca and its homolog proteins and vice versa. there are some unique cycles which are present in ppcd, but no unique cycle is present for ppca-c and ppce. highlighiting the sub-domain for some of the unique cycles of length 3, 4 and 5 are shown in fig 3(a) for ppca and fig 3(b) for ppcd. from fig 3, the cycle (2362) of length 3 whose sub-domain residues are within 13 to 48, that is the numerical sequence is 36. . .62 from fig 3(a) . one can see the corresponding amino acids residues from s1 table. for some cycles, there is a possibility of different sub domains because some edges are repeating more than once in the different positions of the sequence that can be counted for the same cycle. similarly, on varying the cycle length, we get different sub-domains or amino acid residues. these sub-domain findings might be of immense help to the bio-chemists for the understanding of physicochemical nature and the unique activity of various proteins. table 7 . existance of unique edges comparison between ppcd and ppca-c, ppce groups obtained from directed graph (fig 2) . ppca-c, ppce node vs. node 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 * * * * * * we take all the five sequences of ppca-e members, obtain the alignment sequence from clus-talw2. the alignment figure is shown fig 4. we mark the various blocks as r1, r2. . .r16 which are conserved. rectangular with highlighted regions are chemically conserved, and only highlighted regions are conserved based on individual amino acid. we find two highly conserved regions r13 and r22 which are having some variations. the first region (r13) is with 4 residues block (hkk/rh or 2222) among the members ppcb-e where all the amino acids from basic group, but in ppca this block is hkah or 2242 i.e. the 3rd position k/r is replaced by aliphatic amino acid alanine (a). the second region (r22) is gche/k or 4622/1 where 4th position amino acid is either from acidic or basic group i.e. both fall under charge group. if we look into the ppca sequence some dissimilarities are found in "heme i" region [3] [4] [5] . the two consecutive amino acids between regions r14 and r15 in ppca is kk (from basic group), but for ppcb-e only one amino acid is from basic group. previously it is observed that ppcd is structurally dissimilar [5] and the authors have shown that there is an addition of amino acid threonine (t) for ppcd sequence after the r15 region in fig 4. but, from figure we can see that another one amino acid valnine (v) insertion is viewed in region of r8 and r9. besides, various patterns which are common to ppca, but not in ppcb-e and vice versa shown in table 8 with bold color. for the pattern "624621" which is located with the combined regions of r21 and r22 ("heme iii" region), there is a change of amino acid threonine (t) for ppcd and lysine (k) for others. apart from these, we find an amino acid deletion both for the ppcd and ppce before the "heme iii" region. further, on combining the regions r6, r7 and sequence characterization of ppca and its homolog proteins ppcb-e: a mathematical approach r8 (pattern "44441"), the change for ppca sequence is phenylalanine (f) which is from arometic group whereas other sequences are from aliphatic group, and the change for ppcd sequence is histidine (h) which is from basic group whereas other sequences are also from aliphatic group. again the region between r17 and r18 ppcd contains the amino acid methionine (m) from the sulfur containing group while the other homologs contain phenylalanine (f) from aromatic group. altogether, group specific changes have significant role towards the binding with the dxca for ppca and the structural dissimilarity of ppcd. in this work, we have presented the sequence based characterization of cytochromes c7 family members. we specifically emphasize the distinguished features of ppca and ppcd compared to the other homologs. although the study suggests that percent identity among the five members varies between 46% and 75%, on the basis of chemical groups these are shown between 75% and 89%. we highlight some of the chemical groups and their percentage that can distinguish ppca and ppcd. the dissimilarity features of ppca may play significant role towards its binding with dxca. similar is the case that may happen for ppcd for its structural dissimilarity. our proposed graph theoretic model can easily show the instant change of amino acids from one group to the other in the sequences. further, the unique cycles for ppca and ppcd may expose their outstanding nature. and finally from the alignment graph, chemically conserved regions are highlighted. we observe some special patterns where amino acid(s) from some of the sequences are abruptly changed. all the cases will provide the features for ppca and ppcd that would explain their unique functionality and/or structural dissimilarity. it may be noted that there are some existing methodologies [11, 14, 16, 20, 22, 25, 30] which would reflect the sequence pattern information or key features of the observed sequence. many characteristics of the dna, rna and protein sequences can be found out from the web servers and standalone existing tools, one of the important web servers in this regard is defined in [41] . we look at the problem in a different manner, one dealing with embedded chemical properties of amino acids and various mathematical structures. in general, methodology defined in this article is very easy to implement to get the unique features of observed sequences. so, collectively our methodology will add to be combined for the machine learning algorithms to develop refined computational predictors. hence, the use of reduced alphabets (amino acids) technique involving mathematical basis with the embedded chemical properties of amino acids will be very much useful for the protein homology detection. supporting information s1 table. amino acids and transformed numerical sequence based on eight chemical groups for c7 five members. (pdf) s2 table. unique cycles for ppca-e, ppca, ppcb-e, ppcd. these cycles are involved in various sub-domains, some of which are shown in fig 3. (pdf) electricity production by geobacter sulfurreducens attached to electrodes geobacter sulfurreducens sp. nov., a hydrogen-and acetate-oxidizing dissimilatory metal-reducing microorganism. applied and environmental microbiology thermodynamic characterization of a triheme cytochrome family from geobacter sulfurreducens reveals mechanistic and functional diversity family of cytochrome c 7-type proteins from geobacter sulfurreducens: structure of one cytochrome c 7 at 1.45 å resolution † structural characterization of a family of cytochromes c 7 involved in fe (iii) respiration by geobacter sulfurreducens structure of a novel dodecaheme cytochrome c from geobacter sulfurreducens reveals an extended 12nm protein with interacting hemes lipomas treated with subcutaneous deoxycholate injections guide to protein purification dissecting the functional role of key residues in triheme cytochrome ppca: a path to rational design of g. sulfurreducens strains with enhanced electron transfer capabilities conservation within the myosin motor domain: implications for structure and function identification of common molecular subsequences selection of conserved blocks from multiple alignments for their use in phylogenetic analysis amino acid substitution matrices from protein blocks reduction of protein sequence complexity by residue grouping reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment protein sequence analysis based on hydropathy profile of amino acids mathematical characterization of protein sequences using patterns as chemical group combinations of amino acids an introduction to sequence similarity ("homology") searching. current protocols in bioinformatics similarity/dissimilarity studies of protein sequences based on a new 2d graphical representation improved tools for biological sequence comparison analysis of similarity/dissimilarity of protein sequences combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection application of learning to rank to protein remote homology detection. bioinformatics protein remote homology detection by combining chou's pseudo amino acid composition and profile-based protein representation a comprehensive review and comparison of different computational methods for protein remote homology detection imirna-ssf: improving the identification of microrna precursors by combining negative sets with different distributions phylogenetic analysis of protein sequence data using the randomized axelerated maximum likelihood (raxml) program. current protocols in molecular biology phylogenetic analysis of protein sequences based on conditional lz complexity analyzing and synthesizing phylogenies using tree alignment graphs a probabilistic measure for alignment-free sequence comparison simplification of protein sequence and alignment-free sequence analysis phylogenies and the comparative method progressive sequence alignment as a prerequisitetto correct phylogenetic trees graph theory with applications to engineering and computer science protein flexibility predictions using graph theory dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features use of information discrepancy measure to compare protein secondary structures 2-d graphical representation of protein sequences and its application to coronavirus phylogeny a 2d graphical representation of protein sequence and its numerical characterization similarity/dissimilarity analysis of protein sequences based on a new spectrum-like graphical representation pse-in-one: a web server for generating various modes of pseudo components of dna, rna, and protein sequences we thank dr. pokkuluri, phani raj (argonne lab, usa) for the initial discussions of the problem. key: cord-000257-ampip7od authors: bagowski, christoph p; bruins, wouter; te velthuis, aartjan j.w title: the nature of protein domain evolution: shaping the interaction network date: 2010-08-17 journal: curr genomics doi: 10.2174/138920210791616725 sha: doc_id: 257 cord_uid: ampip7od the proteomes that make up the collection of proteins in contemporary organisms evolved through recombination and duplication of a limited set of domains. these protein domains are essentially the main components of globular proteins and are the most principal level at which protein function and protein interactions can be understood. an important aspect of domain evolution is their atomic structure and biochemical function, which are both specified by the information in the amino acid sequence. changes in this information may bring about new folds, functions and protein architectures. with the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. such investigations not only help predict the function of newly discovered proteins, but also assist in mapping unforeseen pathways of evolution and reveal crucial, co-evolving interand intra-molecular interactions. in turn this will help us describe how protein domains shaped cellular interaction networks and the dynamics with which they are regulated in the cell. additionally, these studies can be used for the design of new and optimized protein domains for therapy. in this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks. the protein universe is the collection of proteins of all biological species that exist or have once existed on earth [1] . our sampling and understanding of it began over half a century ago, when the first peptide and protein sequences were determined by sanger [2, 3] and, subsequently, the sequencing of rna and dna [4] [5] [6] . in the meantime, the genome projects of the last decade have uncovered an overwhelming amount of sequence data and researchers are now starting to address a series of fundamental questions that should shed light onto protein evolution processes [7] [8] [9] [10] . for instance, how many gene encoding sequences are present in one genome? how many sequences are repetitive and are these sequences similar in the various organisms on earth? which genes were involved in the large scale genome duplications that we see in animals? a comparison of sequences for evolutionary insight is best achieved by looking at the structural and functional (sub)units of proteins, the protein domains. by convention, domains are defined as conserved, functionally independent protein sequences, which bind or process ligands using a core structural motif [11] [12] [13] . examples of domain modes of actions in signaling cascades for instance, are to connect different components into a larger complex or to bind signaling-molecules [14, 15] . protein domains can usually fold independently, likely due to their relatively limited size, and are well known to behave as independent genetic elements within genomes [16, 17] . the sum of these features makes protein domains readily identifiable from raw nucleotide and amino acid sequences and many protein family resources (e.g., superfamily and smart [see table 1 ]) indeed fully rely on such sequence similarity and motif identifications [18, 19] . the algorithms that are used for domain identification are built around a set of simple assumptions that describe the process of evolution. in general, evolution is believed to form and mold genomes largely via three mechanisms, namely i) chemical changes through the incorporation of base analogs, the effects of radiation or random enzymatic errors by polymerases, ii) cellular repair processes that counter mutations, and iii) selection pressures that manifest themselves as the positive or negative influence that determines whether the mutation will be present in subsequent generations [20, 21] . by definition, each of these phenomena styles, reproductive strategies, or the lack of apparent polymerase-dependent proofreading such as in positivestranded rna viruses [22] [23] [24] [25] . consequently, substitution rates need therefore be calculated to correctly compare two or more sequences and hunt uncharted genomes for comparable domains. particularly this last strategy, using general rate matrices like blosom and pam, is an elegant example of how new protein functions can be discovered [26] [27] [28] [29] [30] . fast algorithms for pair-wise alignments can be found in the basic local alignment search tool (blast), whereas multiple sequence alignments (msas, fig. 1a) in which multiple sequences are compared simultaneously are commonly created with for example clustalx and muscle (see table 1 ) [31] [32] [33] [34] . close relatives, sharing an overall sequence identity above for example 50% and a set of functional properties, can also be grouped into families and subfamilies. in turn, these families share also evolutionary relationships with other domains and form together so-called domain superfamilies [18, 35] . evolutionary distances between related domain sequences can easily be estimated from sequence alignments, provided that the correct rate assumptions are made. subsequently, these can be used to compute the phylogenies of the domain that share an evolutionary history. these, often tree-like graphs (fig. 1b) , depend heavily on rate variation models, such as molecular clocks or relaxed molecular clocks (e.g., maximum likelyhood and bayesian estimation), which are calibrated with additional evidence fig. (1a) . it was computed using bayesian estimation and presents the best-supported topology for the alignment. numbers indicate % support by the two methods used, while # indicates gene duplication events in the common ancestor and * marks a species-specific duplication event. for computational details, please see [42] . such as fossils and may therefore also provide valuable information on aspects like divergence times and ancestral sequences [36] [37] [38] . commonly used phylogenetic analysis strategies are listed in table 1 . a limitation of all inferred phylogenetic data is that it is directly dependent on the alignment and less so on the programs used to build the phylogenetic tree [39] . one of the shortcomings of automated alignments may thus derive from the fact that they commonly employ a scoring and penalty procedure to find the best possible alignment, since these parameters vary from species to species [22, 23] , as mentioned above. careful inspection of alignments is therefore advisable, even though software has been developed that combines the alignment procedure and phylogenetic analysis iteratively in one single program [40] . although sequence and phylogenetic analysis provide a relatively straightforward way for looking at domain divergence, comparison of solved protein structures has shown that protein tertiary organizations are much more conserved (>50%) than their primary sequence (>5%) [41] . for this reason, protein structures and their models provide significantly more insight into the relations of protein domains and how domain families diverged [16] . for example, the inactive guanylate kinase (gk) domain present in the maguk family was shown to originate from an active form of the gk domain residing in ca2+ channel beta-subunits (cacnbs) through both sequence and structural comparison [42] . furthermore, identification of functionally or structurally related amino acid sites in a fold sheds light on the complex, co-evolutionary dynamics that took place during selection [43] . as described above, the evolution of a protein domain is generally the result of a combination of a series of random mutations and a selection constraint imposed on function, i.e., the interaction with a ligand. the interaction between protein and ligand can be imagined as disturbances of the protein's energy landscape, which in turn bring about specific, three-dimensional changes in the protein structure [44, 45] . binding energies however, need not be smoothly distributed over the protein's binding pocket as a limited number of amino acids may account for most of the free-energy change that occurs upon binding [45] [46] [47] . in these cases, new binding specificities (including loss of binding) may therefore arise through mutations at these hot spots. an example is a recent study of the pdz domain in which it was shown that only a selected set of residues, and in particular the first residue of -helix 2 ( b1), directly confers binding to a set of c-terminal peptides [48] . the folding of a domain is essentially based on a complex network of sequential inter-molecular interactions in time [49] . this has of course significant implications for domain integrity, particularly if one assumes that the core of a protein domain is and has to be largely structurally conserved. indeed, even single mutations that arise in this area may easily derail the folding process, either because their free energy contribution influences residues in the direct vicinity or disturbs connections higher up in the intermolecular network [49] . it is therefore hypothesized that protein evolution took place at the periphery of the protein domain core, and that gradual changes via point mutations, insertions and deletions in surface loops brought about the evolutionary distance we see among proteins to date [21, [50] [51] [52] . however, distant sites also contribute to the thermodynamics of catalytic residues. this is achieved through a mechanism called energetic coupling, which is shaped by a continuous pathway of van der waals interactions that ultimately influences residues at the binding site with similar efficiency as the thermodynamic hotspots [53, 54] . indeed in such cases, evolutionary constraints are not placed on merely one amino acid in the binding pocket, but on two or more residues that can be shown to be statistically coupled in msas [54, 55] . in addition to contributions to binding, these principles also explain why the core of a domain structure will remain largely conserved, while at functionally related places residues can (rapidly) co-evolve with an overall neutral effect [56] . of course, these aspects of co-evolution are also of practical consequence for structure prediction and rational drug design [43] . through selective mutation, protein domains have been the tools of evolution to create an enormous and diverse assembly of proteins from likely an initially relatively limited set of domains. the combined data in genbank and other databases now covers over 200.000 species with at least 50 complete genomes and this greatly facilitates genome comparisons [57] [58] [59] . following such extensive comparisons, currently > 1700 domain superfamilies are recognized in the recent release of the structural classification of proteins (scop) [60] and it has become clear that many proteins consist of more than one domain [17, 61, 62] . indeed, it has been estimated that at least 70% of the domains is duplicated in prokaryotes, whereas this number may even be higher in eukaryotes, likely reaching up to 90% [35] . there are various mechanisms through which protein domain or whole proteins may have been duplicated. on the largest scale, whole genome duplication such as those seen in the vertebrate genomes duplicated whole gene families, including postsynaptic proteins, hormone receptors and muscle proteins, and thereby dramatically increased the domain content and expanded networks [42, 63, 64] . on the other end of the scale, domains and proteins have been duplicated through genetic mechanisms like exon-shuffling, retrotranspositions, recombination and horizontal gene transfer [65] [66] [67] . since the genetic forces, like exon-shuffling and genome duplication vary among species, the total number of domains and the types of domains present fluctuate per genome. interestingly, comparative analyses of genomes have shown that the number of unique domains encoded in organisms is generally proportional to its genome size [60, 68] . within genomes, the number of domains per gene, the socalled modularity, is related to genome size via a power-law, which is essentially the relation between the frequency f and an occurrence x raised by a scaling constant k (i.e., f (x) x k ) [69, 70] . a similar correlation is found when the multi-domain architecture is compared to the number of cell types that is present in an organism, i.e., the organism complexity or when the number of domains in a abundant superfamily is plotted against genome size (fig. 2) [71, 72] . given the amount of domain duplication and apparent selection for specific multi-domain encoding genes in, for example, vertebrates, it may come as little surprise that not all domains have had the same tendency to recombine and distribute themselves over the genomes [68, 73] . in fact, some are highly abundant and can be found in many different multi-domain architectures, whereas others are abundant yet confined to a small sample of architectures or not abundant at all [68, 70] . is there any significant correlation between the propensity to distribute and the functional roles domains have in cellular pathways? some of the most abundant domains can be found in association with cellular signaling cascades and have been shown to accumulate non-linearly in relation to the overall number of domains encoded or the genome size [70] . additionally, the on-set of the exponential expansion of the number of abundant and highly recombining domains has been linked to the appearance of multicellularity [70] . a reoccurring theme among these abundant domains is the function of protein-protein interaction and it appears that particularly these, usually globular domains, have been particularly selected for in more complex organisms [70] . this positive relation is underlined by the association of these abundant domains with disease such as cancer and gene essentiality as the highly interacting proteins that they are part of have central places in cascades and need to orchestrate a high number of molecular connections [74, 75] . their shape and coding regions, which usually lie within the boundaries of one or two exons, make them ideally suited for such a selection, since domains are most frequently gained through insertions at the n-or c-terminus and through exon shuffling [76] [77] [78] . from a mutational point of view, protein-protein interaction domains are different from other domains as well and this appears to be particularly true for the group of small, relatively promiscuous domains like sh3 and pdz. these domains are promiscuous in the sense that they both tend to physically interact with a large number of ligands [79, 80] and are prone to move through the genome to recombine with many other domains. it has been found that particularly these domains evolve more slowly than non-promiscuous domains [70] . this likely stems from the fact that they are required to participate in many different interactions, which makes selection pressures more stringent and the appearance of the branches on phylogenetic trees relatively short and more difficult to assess when co-evolutionary data in terms of other domains in the same gene family or expression patterns is limited [42, 63] . non-promiscuous domains on the other hand can quite easily evade the selection pressure by obtaining compensatory mutations either within themselves or their specific binding partner [70] . the overall phenomenon that the number of protein domains and their modularity increases as the genome expands has not been linked to a conclusive biological explanation yet. a rationale for the increase in interactions and functional subunits, however, may derive from the paradoxical absence of correlation between the number of genes encoded and organism complexity, the so-called g-value paradox [81] . there is indeed evidence that domains involved in the same functional pathway tend to converge in a single protein sequence, which would make pathways more controllable and reliable without the need for supplementary genes [73] . additionally, the number of different arrangements found in higher eukaryotes is, given the vast scale of unique domains present, relatively limited. this in turn implies that evolutionary constraints have played an important role in selecting the right domain combinations and the right order from n-to c-terminus in multi-domain proteins [13, 82] . in fact, the ordering and co-occurrence of domains was demonstrated to hold enough evolutionary information to construct a tree of life similar to those based on canonical sequence data [70] . furthermore, the increased use of alternative splicing and exon skipping in higher eukaryotes likely supplied a novel way of proteome diversification by restricting gene duplication and stimulating the formation of multi-domain proteins [83, 84] . in plants, however, the latter notion is not supported since both mono-and dicots show limited alternative splicing and a more extensive polyploidy [85] [86] [87] . it is clear that some of the above characteristics are underappreciated in the phylogenetic analysis of linear amino acid sequences. moreover, the effects of evolution extend even further than these aspects and entail transcriptional and translational regulation, intramolecular domain-domain interactions, gene modifications and post-translational protein modifications [88] [89] [90] [91] [92] [93] [94] [95] [96] . new methods are thus being developed to take into account that when sequences evolve, their close and distant functional relationships evolve in parallel. correlations of mutations have already been found between residues of different proteins [97, 98] and compensating mutational changes at an interaction interface were shown to recover the instability of a complex [99] . these observations are evidence for the current evolutionary models for the protein-protein interaction (ppis) networks that are being constructed through large-scale screens [100] [101] [102] . in these, a gene duplication or domain duplication (depending on the resolution of the network) implies the addition of a node, while the deletion of a gene or domain reduces the amount of links in the network (fig. 3) . in the next step, extensive network rewiring may take place, driven by the effect of node addition or node loss in the network (i.e., the duplicability or essentiality of a domain/protein) and mutations in the domain-interaction interface [67, 74, [103] [104] [105] . beyond mutations at the domain and protein level, regulation of protein expression provides another vital mechanism through which protein networks can evolve. microarray studies are now well under way to map genome-wide ex-pression levels of related and non-related genes under a variety of conditions [91, [94] [95] [96] . for example, transcriptional comparisons have investigated aging [106] and pathogenicity [107] . unfortunately, given the highly variable nature of gene expression and the fact that different species may respond different to external stimuli, such comparisons can only be performed under strictly controlled research conditions. to date most studies have therefore focused on the embryogenesis, metamorphosis, sex-dependency and mutation rates of subspecies [94, [108] [109] [110] [111] . other studies have revealed valuable information on promoter types and duplication events [91] [92] [93] [94] . to overcome the limitations mentioned in the previous paragraph, the analysis of co-expression data has been developed to supplement the direct comparison of individual gene expression changes [95] . in this procedure, a coexpression analysis of gene pairs within each species precedes the cross-comparison of the different organisms in the study. this approach thus primarily focuses on the similarity and differences of the orthologous genes within network, and is therefore ideally suited for the study of protein domain evolution and has already revealed that species-specific parts fig. (3) . evolutionary models for protein-protein interactions. the evolution of protein networks is tightly coupled to the addition or deletion of nodes. additionally, events that introduce mutations in binding interfaces of proteins may result in the addition or loss of links in the network. node addition may take place through e.g., domain duplication or horizontal gene transfer, while rewiring of the network is mediated by point mutations, alternative splice variants and changes in gene expression patterns. of an expression network resulted via a merge of conserved and newly evolved modules [95, 112, 113] . finding evolutionary relationships protein domains is mostly based on orthology and thus commonly performed on best sequence matches. identifying these and categorizing them depends largely on multiple sequence alignments and this will in most cases give good indications for function, fold and ultimately evolution. however, this approach usually discards apparent ambiguities that arise from speciesspecific variations (e.g., due to population size, metabolism or species-specific domain duplications or losses) and may therefore introduce significant biases [114] . biases may also derive from the method of alignment, the rate variation model used to infer the phylogeny, and the sample size used to build the alignment [39, 40, 115] . care should therefore be taken to not regard orthology as a one-to-one relationship, but as a family of homologous relations [91] , to select for appropriate analysis methods [39, 115] and extend comparative data to protein interactions and expression profiles [91] . indeed, as our wealth of biological information expands, our systems perspective will improve and provide us with an opportunity to reveal protein domain evolution at the level network organization and dynamics. large-scale expression studies are beginning to show us evolutionary correlations between gene expression levels and timings [94, 106, 107, 112, 116] , while others demonstrate spatial differences between paralogs or (partial) overlap between interaction partners [117] [118] [119] [120] . indeed, when we are able to map the spatiotemporal aspects of inter-and intra-molecular interactions we will begin to fully understand the versatile power of evolution that shaped the protein universe and life on earth [118] . phylogenetic continuum indicates "galaxies" in the protein universe: perliminary results on the natural group structures of proteins the chemistry of amino acids and proteins some peptides from insulin nucleotide sequence from the coat protein cistron of r17 bacteriophage rna use of dna polymerase i primed by a synthetic oligonucleotide to determine a nucleotide sequenc of phage fl dna dna sequencing with chain-terminating inhibitors the genome sequence of drosophila melanogaster flybase: genomes by the dozen initial sequencing and comparative analysis of the mouse genome insights into social insects from the genome of the honeybee apis mellifera selectivity and promiscuity in the interaction network mediated by protein recognition modules modular peptide recognition domains in eukaryotic signaling the multiplicity of domains in proteins the modular nature of apoptotic signaling proteins regulatory potential, phyletic distribution and evolution of ancient, intracellular smallmolecule-binding domains protein families and their evolution: a structural perspective the folding and evolution of multidomain proteins the superfamily database in 2007: families and functions smart: identification and annotation of domains from signalling and extracellular protein sequences comparative genomics: genome-wide analysis in metazoan eukaryotes distribution of indel lengths heterogeneity of nucleotide frequencies among evolutionary lineages and phylogenetic inference review of concepts, case studies and implications who do species vary in their rate of molecular evolution infidelity of sars-cov nsp14-exonuclease mutant virus replication is revealed by complete genome sequencing who's your neighbor? new computational approaches for functional genomics protein function in the post-genomic era the role of pattern databases in sequence analysis gene ontology: tool for the unification of biology unique and conserved features of genome and proteome of sarscoronavirus, an early split-off from the coronavirus group 2 lineage phylip version 3.63. deptartment of genetics gapped blast and psi-blast: a new generation of protein database search programs the clustal_x windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools comparison of methods for searching protein sequence databases an insight into domain combinations evolutionary trees from dna sequences: a maximum likelihood approach mrbayes: bayesian inference of phylogenetic trees mammalian evolution and biomedicine: new views from phylogeny multiple sequence alignment: in pursuit of homologous dna positions bayesian coestimation of phylogeny and sequence alignment the relation between the divergence of sequence and structure in proteins molecular evolution of the maguk family in metazoan genomes why should we care about molecular coevolution the propagation of binding interactions to remote sites in proteins: analysis of the binding of the monoclonal antibody d1.3 to lysozyme structural stability of binding sites: consequences for binding affinity and allosteric effects revealing the architecture of a k+ channel pore through mutant cycles with a peptide inhibitor structural plasticity in a remodeled protein-protein interface a specificity map for the pdz domain family the linkage between protein folding and functional cooperativity: two sides of the same coin? empirical and structural models for insertions and deletions in the divergent evolution of proteins analysis of insertions/deletions in protein structures structural similarity of loops in protein families: toward the understanding of protein evolution the effect of inhibitor binding on the structural stability and cooperativity of the hiv-1 protease evolutionary conserved pathways of energetic connectivity in protein families how frequent are correlated changes in families of protein sequences? an improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution evolution of vertebrate genes related to prion and shadoo proteins--clues from comparative genomic analysis evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes data growth and its impact on the scop database: new developments estimating the number of protein folds and families from complete genome data insights into the molecular evolution of the pdz-lim family and indentification of a novel conserved protein motif independent elaboration of steroid hormone signaling pathways in metazoans integration of horizontally transferred genes into regulatory interaction networks takes many million years prokaryotic evolution in light of gene transfer how the global structure of protein interaction networks evolves the impact of comparative genomics on our understanding of evolution modular genes with metazoan-specific domains have increased tissue specificity evolution of protein domain promiscuity in eukaryotes the structure of the protein universe and genome evolution modules, multidomain proteins and organismic complexity detecting protein function and protein-protein interaction from genome sequences lethality and centrality in protein networks comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks domain deletions and substitutions in the modular protein evolution genome evolution and the evolution of exon-shuffling-a review significant expansion of exon-bordering protein domains during animal proteome evolution thermodynamic basis for promiscuity and selectivity in protein-protein interactions: pdz domains, a case study promiscuous binding nature of sh3 domains to their target proteins expansion of genome coding regions by acquisition of new genes the geometry of domain combination in proteins different levels of alternative splicing among eukaryotes how did alternative splicing evolve? alternative splicing and gene duplication are inversely correlated evolutionary mechanisms polyploidy and genome evolution in plants comparative analysis indicates that alternative splicing in plants has a limited role in functional expansion of the proteome structural characterization of the intramolecular interaction between the sh3 and guanylate kinase domains of psd-95 identification of an intramolecular interaction between the sh3 and guanylate kinase domains of psd-95 interplay of pdz and protease domain of degp ensures efficient elimination of misfolded proteins comparative biology: beyond sequence analysis a genetic signature of interspecies variations in gene expression genome-wide scan reveals that genetic variation for transcriptional plasticity in yeast is biased towards multi-copy and dispensable genes identification of tightly regulated groups of genes during drosophila melanogaster embryogenesis a gene-coexpression network for global discovery of conserved genetic modules similarities and differences in genome-wide expression data of six organisms accurate prediction of proteinprotein interactions from sequence alignments using a bayesian method correlated mutations contain information about protein-protein interaction mutually compensatory mutations during evolution of the tetramerization domain of tumor suppressor p53 lead to impaired hetero-oligomerization functional organization of the yeast proteome by systematic analysis of protein complexes a human protein-protein interaction network: a resource for annotating the proteome protein function, connectivity, and duplicability in yeast evolution and topology in the yeast protein interaction network modularity and evolutionary constraint on proteins comparing genomic expression patterns across species identifies shared transcriptional profile in aging genome-wide functional analysis of pathogenicity genes in the rice blast fungus a mutation accumulation assay reveals a broad capacity for rapid evolution of gene expression evolution of gene expression in the drosophila melanogaster subgroup sexdependent gene expression and evolution of the drosophila transcriptome microarray analysis of drosophila development during metamorphosis conservation and coevolution in the scale-free human gene coexpression network conservation and evolution of gene coexpression networks in human and chimpanzee brains cross-species sequence comparisons: a review of methods and available resources impact of taxon sampling on the estimation of rates of evolution at sites comparative genomics beyond sequence-based alignments: rna structures in the encode regions comparative analysis of splice form-specific expression of lim kinases during zebrafish development towards cellular systems in 4d gene expression map of the arabidopsis shoot apical meristem stem cell niche a gene expression map of arabidopsis thaliana development key: cord-256608-ajzk86rq authors: van weezep, erik; kooi, engbert a.; van rijn, piet a. title: pcr diagnostics: in silico validation by an automated tool using freely available software programs date: 2019-05-13 journal: j virol methods doi: 10.1016/j.jviromet.2019.05.002 sha: doc_id: 256608 cord_uid: ajzk86rq pcr diagnostics are often the first line of laboratory diagnostics and are regularly designed to either differentiate between or detect all pathogen variants of a family, genus or species. the ideal pcr test detects all variants of the target pathogen, including newly discovered and emerging variants, while closely related pathogens and their variants should not be detected. this is challenging as pathogens show a high degree of genetic variation due to genetic drift, adaptation and evolution. therefore, frequent re-evaluation of pcr diagnostics is needed to monitor its usefulness. validation of pcr diagnostics recognizes three stages, in silico, in vitro and in vivo validation. in vitro and in vivo testing are usually costly, labour intensive and imply a risk of handling dangerous pathogens. in silico validation reduces this burden. in silico validation checks primers and probes by comparing their sequences with available nucleotide sequences. in recent years the amount of available sequences has dramatically increased by high throughput and deep sequencing projects. this makes in silico validation more informative, but also more computing intensive. to facilitate validation of pcr tests, a software tool named pcrv was developed. pcrv consists of a user friendly graphical user interface and coordinates the use of the software programs clustalw and ssearch in order to perform in silico validation of pcr tests of different formats. use of internal control sequences makes the analysis compliant to laboratory quality control systems. finally, pcrv generates a validation report that includes an overview as well as a list of detailed results. in-house developed, published and oie-recommended pcr tests were easily (re-) evaluated by use of pcrv. to demonstrate the power of pcrv, in silico validation of several pcr tests are shown and discussed. pathogens exhibit genetic variation as a result of genetic drift, adaptation and evolution, but also by random variation. since the late nineties of the 20 th century, due to the improved sequencing techniques and high throughput sequencing machines, the number of sequences submitted to databases like genbank ® has increased exponentially. this results in an enormous increase of identified variants and quasi-species as well as sequences of newly discovered pathogens from all over the world. a few examples are the discovery of coronaviruses causing severe acute respiratory syndrome (sars) and middle east respiratory syndrome (mers), nipah and hendra viruses, atypical pestiviruses, atypical and new serotypes of bluetongue virus, schmallenberg virus and new variants of avian influenza viruses (chua et al., 2000; demmler and ligon, 2003; drosten et al., 2003; hoffmann et al., 2012; hofmann et al., 2008; maan et al., 2011; marcacci et al., 2018; schirrmeier et al., 2004; van boheemen et al., 2012; wang, 2011; zientara et al., 2014) . currently, in many countries, the first line of pathogen detection is real-time pcr diagnostics. favourably, pcr tests can be highly sensitive and specific, and are often designed to detect all variants of a defined family, genus or species, while not detecting closely related pathogens. in addition, pcrv can also be used to validate in silico pcr assays that differentiate between lineages, serotypes or variants. therefore, pcr targets must be unique, and highly conserved. nonetheless, false negative results can arise by genetic drifting or by emergence of new variants, while false positive results can be caused by new variants of closely related pathogens. it is therefore important to frequently reevaluate and, if necessary, redesign pcr tests taking sequences of newly discovered pathogen variants into account. validation of pcr diagnostics should be organized in three stages, in silico, in vitro and in vivo validation. in silico validation covers the study on inventory of matching and non-matching sequences of the pcr target sequence in a nucleotide database. matching sequences enable in silico sensitivity (detection of all variants), while non-matching sequences support in silico specificity (selective detection of variants of the respective group of pathogens). in vitro and in vivo validation include testing of cultured pathogens, and field samples of defined positive and negative status. in vitro and in vivo validation for all virus variants is practically impossible and extremely costly. even more, not every pathogen variant has been cultured or isolated, and transport and handling of pathogens could imply safety issues. in contrast, sequences are rapidly becoming available by high throughput and deep sequencing, even without culturing of pathogens. therefore, in silico re-evaluation of validated pcr diagnostics is and will be an attractive alternative to obtain detailed insight in detection of circulating and (re-) emerging virus variants, and should be frequently executed. it will however become an increasing task due to the rapid increase of available sequences and full genome sequences of numerous species. we developed a software tool named pcrv to facilitate in silico validation of pcr tests entirely based on freely available software programs. pcrv links freely available software programs to automate the whole process, reduces labour, and generates a validation report that includes a brief summary as well as a list of detailed results. the software tool pcrv is written in the python programming language. pcrv consists of a user friendly graphical user interface and coordinates the use of software programs clustalw2.1 (larkin et al., 2007; thompson et al., 2002) and ssearch (brenner et al., 1998; pearson, 1991; pearson et al., 2017; to perform in silico validation. pcrv is suitable to determine the in silico sensitivity (conservation of sequences) and in silico specificity (selectivity) of different pcr formats. to monitor the performance of pcrv, a set of flagged internal control sequences (fics) are randomly added to the sequence database. pcrv processes data and analyses results, and generates a validation report that includes a summarizing table as well as a list of detailed results for an easy check of potential false positives and false negatives. an overview of all actions executed by pcrv is shown in fig. 2 . the sequences of a target organism are downloaded from the national center for biotechnology information (ncbi) database (https://www.ncbi.nlm.nih.gov/nuccore/) by using the respective taxonomy id number as search query. this guarantees that all available sequences of the defined taxon in the database are downloaded. to generate a multiple sequence alignment (msa) of these sequences, a full genome sequence was selected as a reference sequence. genome segments of pathogens with a segmented genome were concatenated to serve as an artificial full length genome. if a full genome sequence was not available, a representative large sequence of the taxon was selected as a reference sequence. a prerequisite is that this partial sequence contained the full target of the pcr test being validated. in order to drastically reduce computing time, pairwise alignments were calculated for each downloaded sequence to the reference sequence by using software program clustalw 2.1 (larkin et al., 2007) . to correct for orientation errors in the database sequences, alignment in the reverse complement orientation was also attempted. a score was calculated using a scoring scheme as follows: match (+1), mismatch (-2), point deletion or gap (-3), every next adjacent point deletion (-2). the aligned orientation with the highest score was selected. to enable efficient alignment of large sequences, these large sequences were segmented in fragments of 10,000 nucleotides in length and individually aligned to the reference sequence and subsequently combined into one pairwise alignment. pcrv combined all individual pairwise alignments into one multiple sequence alignment (msa), including the pairwise alignments of primers and probes. the calculation of the msa was performed by a computer with an intel ® xeon(r) cpu e5-1650 v2 @ 3.50 ghz processor and 16 gb of internal computer memory. the regions corresponding to primers and probes were selected from the msa to construct a conservation plot sorted in decreasing total number of mismatches. the in silico sensitivity was expressed as the percentage of hits with a cut-off value of a maximum of one mismatch per primer or probe. the entire nucleotide sequence database (compressed gzip file: nt.gz) was downloaded from the ncbi ftp-website (ftp:// ftp.ncbi.nlm.nih.gov/blast/db/) using pcrv. the integrity of the download was confirmed by calculation of the md5 checksum and subsequent comparison with the checksum published on the ftp-website (file nt.gz.md5). pcrv processed the data stream during download by several optimizations to improve the analysis. nucleotide code 'n' was replaced by the meaningless code 'z', which prevents infinite number of hits by the alignment search. the data stream was unpacked and subdivided into multiple fasta formatted text files. fasta files with a maximum size of 500 mb were sequentially numbered and stored because the ncbi nucleotide database is too large to be analysed all at once. to increase the accuracy of the alignment search (see discussion), large sequences were fragmented in sequences of maximal 3000 nucleotides with an overlap of 50 nucleotides to prevent the loss of hits of primer or probe sequences spanning the split site. fragmented sequences were tagged with a unique code allowing reconstruction of the original sequence. any nucleotide database in fasta format is compatible and could be added. flagged internal control sequences (fics) were added to enable validation of the alignment search. fics consisted of randomly generated sequences of 3000 nucleotides in length containing primer and probe sequences of the pcr test being validated. primer and probe sequences were inserted in all possible combinations and orientations potentially initiating amplification ( fig. 1 ). multiple copies of each combination were inserted with an increasing number of randomly introduced mismatches from 0-10 in each primer and probe sequence ( fig. 1 ). in total, ten copies of each control sequence per number of mismatches were linearly spread in each 500 mb fasta file. an alignment search was performed with the default expectancy threshold value on all fasta files using primers and probes of the pcr test as search queries and the program ssearch available in the fasta sequence analysis package (brenner et al., 1998; pearson, 1991; pearson et al., 2017; . pcrv produced a list of hits of the alignment search of all possible primer/probe combinations potentially leading to detectable amplicons. hits of fics were stored separately. the percentage of returned hits of control sequences with an increasing number of mismatches was indicative for the sensitivity and accuracy of the alignment search per 500 mb fasta file. the maximum number of returned mismatches in the control sequences was determined by use of the spearman-kärber method and demonstrated the validity of the computing process (wulff et al., 2012 ). an aborted search caused by an unknown error was visible by the incompleteness of returned fics. if the accuracy of the alignment search was not acceptable, the alignment search was repeated with a higher expectancy threshold value, which usually resulted in a longer analysis time. the specificity check was limited to a maximum of 5000 nucleotides in amplicon length and up to four mismatches per primer or probe. this limitation was however not applied to the fics in order to fics consist of randomly generated sequences of 3000 nucleotides in length containing the primer and probe sequence of the pcr test being validated. multiple copies were inserted with an increasing number of randomly introduced mismatches from 0-10 in each primer and probe sequence. ten copies of each fics per number of mismatches were linearly spread in each 500 mb fasta file. b) overview of all eight possible combinations of positional orientations of forward primer (fwd), reverse (rev) primer and probe used as fics which are all capable of initiating an (nonspecific) amplification reaction in combination with a detectable probe signal. combinations of primers and probes according to other pcr formats (e.g. nested pcr, pcr using hybridisation probes or hydrolysis probe) are also supported by pcrv but are not shown. fully ascertain the validity of the executed alignment search. hits were interpreted as specific or nonspecific according to the taxonomy classified sequences as used to generate the msa. the in silico specificity is expressed as the percentage of specific hits of taxonomy classified sequences with a maximum of one mismatch per primer or probe as these are considered to be detected with the respective pcr test. to demonstrate the suitability of our in-house developed software tool pcrv, we determined the in silico sensitivity and specificity of three pcr tests for west nile virus (wnv) recommended by the world organisation for animal health (oie) (eiden et al., 2010; johnson et al., 2001) . these wnv pcr tests represented three different formats; a real-time pcr test, a conventional pcr test and a nested pcr test (table 1) . available west nile virus nucleotide sequences were downloaded from the ncbi website using taxonomy id 11,082 (search query ncbi:txid11082 on january 15 th , 2019). in total, the download contained 20,964 wnv sequences. a msa was calculated using the full genome sequence with accession number nc_009942 as a reference sequence (borisevich et al., 2006) . primer and probe sequences were included in the alignment. the calculation of the msa with pcrv was completed in about 4.5 h. a limited number of 10-15% of the aligned sequences encompassed the locations of primers or probes of the selected oie-recommended wnv pcr tests. the regions corresponding to primers and probes were taken from the alignment in order to construct a conservation plot. detailed results were sorted according to the number of mismatches to easily select individual sequences with > 1 mismatch in order to check their origin (supplemented data a). note, sequences incorrectly classified as wnv as well as synthetically derived sequences should be discarded as these are irrelevant. results of the conservation plot were summarized according to the number of mismatches to a maximum of four mismatches per primer or probe ( table 2 ).the overall in silico sensitivity of each pcr test was calculated and expressed as the percentage of sequences with a maximum of one mismatch per primer or probe. the real time pcr test for wnv showed the highest in silico sensitivity of 98.8% (83.3%+15.47%). the conventional and nested pcr tests showed an in silico sensitivity of 87.1% and 86.5%, respectively. the entire nucleotide sequence database from the ncbi ftp-website was downloaded as a compressed gzip file (nt.gz) of 502 gb on january 7 th , 2019. the download was valid according to the calculated md5 (johnson et al., 2001) , and the real time pcr test have been described (eiden et al., 2010) . sequences. an alignment search with primer and probe sequences was performed with a cut-off expectation value e of 5000. the search per pcr test was completed in less than two hours. about 3.7-6.9 million individual primer and probe alignment hits were found and processed by pcrv as described (fig. 2) . fics were found homogeneously in all 371 database files indicating that the alignment search was completed properly. fics for each pcr test were returned with a mean of 3.7-4.3 mismatches per primer or probe demonstrating completeness and acceptable accuracy of the alignment search (table 2) . potential amplicons were interpreted as specific or non-specific according to the presence of its ncbi accession number in the list of sequences as used for the in silico sensitivity check (table 2) . we noticed that the number of specific hits differed from the numbers as scored by the in silico sensitivity check (table 2) . however, several reasons for this apparent inconsistency can be considered, see discussion. in summary, using wnv pcr tests as an example, pcrv easily determined the in silico sensitivity and specificity of these pcr tests of different formats in a highly automated manner. all results are included in the validation report generated by pcrv, such as a summarizing table of results, conservation plot and a list of nonspecific hits. the summarizing table clearly demonstrates the differences of the in silico sensitivity and specificity between these pcr tests (table 2 ). in addition, the detailed conservation plot (supplemented data a) and detailed list of nonspecific hits up to 4 mismatches per primer or probe (supplemented data b) support manual check of individual sequences on correctness, background, submission details, and other information. validation of diagnostics by testing all variants of a target pathogen in cultured or field samples, named in vitro and in vivo validation, respectively, is hardly feasible. because of the availability of sequences of pathogens in databases, checking conservation and uniqueness of primer and probe sequences, so-called in silico validation, has become an attractive and reliable alternative to (re-) evaluate specificity and sensitivity of molecular diagnostics. exponential expansion of available sequences, genetic drift of pathogens, and discovery of new pathogens drive the need to frequently validate established pcr tests. this, however, will also become an increasing significant effort. we automated the in silico validation process by integrating freely available software programs into a single tool named pcrv. public databases, such as ncbi as well as other available databases and sequences formatted in single sequence fasta files are compatible with pcrv. pcrv generates a multiple sequence alignment (msa) using a selected reference sequence, which is preferably a full length genome but at least a partially large sequence encompassing the pcr target. software program clustalw2.1 (larkin et al., 2007) is used to calculate pair-wise alignments of each sequence to the reference sequence, and subsequently a msa is generated using these pair-wise alignments. this strategy exponentially reduces calculation time, in particular for large numbers of sequences. additionally, more than one reference sequence could be used to improve the generation of a msa in case of extreme variability among a group of pathogens. the msa is used to determine the in silico sensitivity, since this is less prone to mismatches in primers or probes (not shown). for example, sequences with numerous mismatches in one of the primers or probes will not be found by an alignment search using these primer or probe sequences as search queries. however, such sequences will be present in the msa, see conservation plots of wnv pcr tests. supplemented data a shows the summarised -without accession numbers -conservation plots of the three wnv pcr tests. pcrv generates a conservation plot listing all hits according to decreasing number of mismatches. hits with the most mismatches needs attention as these could lead to false negative pcr results. we calculated and defined the in silico sensitivity as the percentage of hits with a maximum of one mismatch per primer or probe as these are assumed to be detected with the respective pcr test. the software program ssearch that is available in the fasta sequence analysis package from the university of virginia (pearson, 1991) uses a calculated expectation value e in combination with a supplied threshold value to determine whether a hit is returned. the expectation value e depends on the number and length of sequences in the database. consequently, the e value of a search hit depends on the location of the found sequence in the database. large sequences are therefore segmented into fragments of maximal 3000 nucleotides in length. this reduces the variability in sequence length leading to a more homogenous sensitivity of ssearch across the database and improves the overall sensitivity of ssearch. the sensitivity of the well-known and commonly used blastn alignment search program was compared to that of ssearch (fig. 3) . clearly, ssearch returns 100% of the primers up to six mismatches. in contrast, the percentage of returns with blastn is slightly less than 100% for three mismatches and rapidly declines by an increasing number of mismatches. we conclude that ssearch is much more accurate, and thus more suitable than blastn to determine the in silico fig. 3 . comparison of the accuracy of an alignment search performed by the blastn and the ssearch software programs. a test database of randomly generated nucleotide sequences was generated containing 10,000 sequences of 3000 nucleotides in length. 875 sequences contained a primer sequence of 24 nucleotides in length. each primer contained randomly 0-10 mismatches. the cut-off expectation value e used in both programs was 1000. the inserted primer with up to 2 mismatches completely returned with blastn, whereas ssearch completely returned the primer with up to 6 mismatches. specificity. we also noticed that blastn tends to find partial/fractional nucleotide alignment hits which is not desirable for primers and probes. in addition, pcrv using ssearch is suitable for use in a laboratory quality control system, since the search process is monitored per 500 mb fasta file for completeness and accuracy/sensitivity by returned hits of flagged internal control sequences (fics). an overview of this monitoring is added to the validation report. examples of incomplete, inaccurate or alignment searches with a low sensitivity are presented (supplemented data c). in case alignment search results are not sufficient, the threshold value can be changed to increase the sensitivity but the calculation time will also increase. here, we showed in silico validation results of wnv pcr tests of different formats as an example. pcrv was also used to validate real time pcr tests at wbvr (fig. 4) . ssearch quantifies hits for any combination of primers and probes potentially leading to detectable amplicons, see fig. 2 . this can result in more hits for the in silico specificity check by ssearch than for the in silico sensitivity check by clustalw 2.1. for example, sequences partially overlapping with the pcr target sequence will not be found by the in silico specificity check, since this check only finds complete amplicons. further, ncbi only stores unique nucleotide sequences in its downloadable database export file "nt.gz". identical sequences are combined as one sequence with the sequence name as a concatenation of all individual sequence names separated by the ascii code 1. pcrv does not recognize merged names as multiple sequences, resulting in less hits by ssearch. detailed analysis of in silico validation results enables a focus on specific test problems, as shown for the pcr test for peste-des-petits ruminants virus (pprv) of wbvr that presumably does not detect pprv strain ghana2010 because of three mismatches in the probe sequence. indeed, the pcr target of this pprv strain was amplified but was not detected by the taqman probe (van rijn et al., 2018a) . we used pcrv to analyse oie-recommended and published pcr tests for other pathogens in order to select the best option for implementation in laboratory diagnostics. upon preparedness on incursions, frequent in silico (re-)validation could also show the need for adaptation of operational pcr tests to emerging epidemics caused by new variants in other parts of the world. pcrv depends on compatible and reliable nucleotide databases. for example, in silico validation by pcrv depends on submission of accurately determined sequences which are coded with the correct taxonomy id number. for example, classical swine fever virus (csfv) sequences that are taxonomy classified as bovine viral diarrhoea virus type 2 (bvdv ii) were consequently interpreted as false positives in the csfv pcr test and as false negatives in the bvdv pcr test. further, in our example of wnv pcr tests, five nonspecific hits appeared to be sequences without taxonomy id. still, these sequences are definitely wnv sequences, although 2 out of 5 nonspecific hits have been synthetically derived (supplemented data b). on the other hand, a more specific taxonomy classification or labelling of sequences in databases could be used for the development of pcr tests specific for subspecies, serotypes or lineages. considering the expected rapid expansion of available sequences, pcrv will be further improved by allowing incremental analyses in which only newly submitted sequences with respect to the previously analysed sequences are processed. this will keep the required analysis time manageable for in silico re-validation of pcr tests. the number of hits for the in silico sensitivity and specificity are not representative for the field situation but represents that of the sequences in the database. in other words, the percentages could be skewed by a small number of sequences in the database, or by a large number of very closely related sequences caused by a huge effort during one epidemic. submitted sequences are sometimes not trimmed for synthetic adaptors like pcr primers causing misleading positive analysis results. synthetic or optimized genes of pathogens can lead to misleading negative pcrv results. synthetic and genetically modified sequences should be labelled as 'nonnatural' in databases to prevent misleading results of in silico validation efforts. finally, negative pcrv results can be created on purpose by development of diva (differentiating infected from vaccinated) vaccine viruses with a deleted or mutated diva target, like ge deletion mutants of bovine herpes virus type 1 and pseudorabies virus (kaashoek et al., 1994; van oirschot et al., 1990) , ns3 deletion mutants of bluetongue virus and african horse sickness virus (feenstra et al., 2014; van rijn et al., 2018b van rijn et al., , 2013 , and liveattenuated lumpy skin disease (lsd) vaccine (agianniotaki et al., 2017) . viral pathogens belonging to the same taxon showing an extreme variation in their sequence cannot be aggregated in one msa using one reference sequence. further, large scale genomic rearrangements, such as duplication, deletion, insertion, inversion, and translocation, are very common in genomes of bacterial pathogens, and will undoubtedly challenge the calculation of a msa, if this is even possible. currently, we are investigating alignment-free analysis methods to address these challenges. even more, we foresee the development of a next generation in silico tool, partially based on pcrv, to find highly conserved targets for new or confirmatory pcr tests. fig. 4 . overview of the in silico sensitivity and specificity of several real time pcr tests at wbvr as determined by pcrv. the in silico sensitivity of pcr tests is expressed as the percentage of hits with a maximum of one mismatch per primer or probe (squares, line). the in silico specificity is expressed as the percentage of specific hits with 0 mismatches (black) and 1 mismatch per primer of probe (grey). real time pcr tests are indicated: wnv; west-nile virus (eiden et al., 2010; johnson et al., 2001) , btv; bluetongue virus (van rijn et al., 2012; 2013) , pprv; peste des petits ruminants virus (van rijn et al., 2018a) , ashv_s4; african horse sickness virus segment 4 (van rijn et al., 2018b) , ashv_s5; african horse sickness virus segment 5 (van rijn et al., 2018b) ; in-house developed assays: rvfv; rift valley fever virus, sgpv; sheepand-goat pox virus, ehdv-a; epizootic haemorrhagic disease virus test a, ehdv-b; epizootic haemorrhagic disease virus test b, eav; equine arteritis virus, eblv-1; european bat lyssa virus type 1, csfv; classical swine fever virus, asfv; african swine fever virus, prv-gb; pseudorabies virus glycoprotein gene gb, prv-ge; pseudorabies virus glycoprotein gene ge. results of pcrv could demonstrate the need to optimize or redesign a pcr test, like for ehdv-a and ahsv_s4. note: hits of non-natural sequences were not discarded. (for interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article). development and validation of a taqman probe-based real-time pcr method for the differentiation of wild type lumpy skin disease virus from vaccine virus strains biological properties of chimeric west nile viruses assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships nipah virus: a recently emergent deadly paramyxovirus severe acute respiratory syndrome (sars): a review of the history, epidemiology, prevention, and concerns for the future identification of a novel coronavirus in patients with severe acute respiratory syndrome two new real-time quantitative reverse transcription polymerase chain reaction assays with unique target sites for the specific and sensitive detection of lineages 1 and 2 west nile virus strains vp2-serotyped live-attenuated bluetongue virus without ns3/ns3a expression provides serotype-specific protection and enables diva novel orthobunyavirus in cattle genetic characterization of toggenburg orbivirus, a new bluetongue virus, from goats detection of north american west nile virus in animal tissue by a reverse transcription-nested polymerase chain reaction assay a conventionally attenuated glycoprotein e-negative strain of bovine herpesvirus type 1 is an efficacious and safe vaccine clustal w and clustal x version 2.0 novel bluetongue virus serotype from kuwait one after the other: a novel bluetongue virus strain related to toggenburg virus detected in the piedmont region searching protein sequence libraries: comparison of the sensitivity and selectivity of the smith-waterman and fasta algorithms query-seeded iterative sequence similarity searching improves selectivity 5-20-fold genetic and antigenic characterization of an atypical pestivirus isolate, a putative member of a novel pestivirus species identification of common molecular subsequences comparative biosequence metrics multiple sequence alignment using clustalw and clustalx genomic characterization of a newly discovered coronavirus associated with acute respiratory distress syndrome in humans marker vaccines, virus protein-specific antibody assays and the control of aujeszky's disease sustained high-throughput polymerase chain reaction diagnostics during the european epidemic of bluetongue virus serotype 8 bluetongue virus with mutated genome segment 10 to differentiate infected from vaccinated animals: a genetic diva approach recombinant newcastle disease viruses with targets for pcr diagnostics for rinderpest and peste des petits ruminants diagnostic diva tests accompanying the disabled infectious single animal (disa) vaccine platform for african horse sickness discovering novel zoonotic viruses monte carlo simulation of the spearman-kaerber tcid50 novel bluetongue virus in goats the authors are grateful to colleagues of wbvr, in particular to jan boonstra and rené van gennip, for fruitful discussions and suggestions. this research was financially supported by project wot-01-003-015 of the dutch ministry of agriculture, nature and food quality (lnv) (wbvr-project number 1600013-01). all authors declare no conflict of interest. supplementary material related to this article can be found, in the online version, at doi:https://doi.org/10.1016/j.jviromet.2019.05.002. key: cord-203232-1nnqx1g9 authors: canturk, semih; singh, aman; st-amant, patrick; behrmann, jason title: machine-learning driven drug repurposing for covid-19 date: 2020-06-25 journal: nan doi: nan sha: doc_id: 203232 cord_uid: 1nnqx1g9 the integration of machine learning methods into bioinformatics provides particular benefits in identifying how therapeutics effective in one context might have utility in an unknown clinical context or against a novel pathology. we aim to discover the underlying associations between viral proteins and antiviral therapeutics that are effective against them by employing neural network models. using the national center for biotechnology information virus protein database and the drugvirus database, which provides a comprehensive report of broad-spectrum antiviral agents (bsaas) and viruses they inhibit, we trained ann models with virus protein sequences as inputs and antiviral agents deemed safe-in-humans as outputs. model training excluded sars-cov-2 proteins and included only phases ii, iii, iv and approved level drugs. using sequences for sars-cov-2 (the coronavirus that causes covid-19) as inputs to the trained models produces outputs of tentative safe-in-human antiviral candidates for treating covid-19. our results suggest multiple drug candidates, some of which complement recent findings from noteworthy clinical studies. our in-silico approach to drug repurposing has promise in identifying new drug candidates and treatments for other viruses. artificial intelligence (ai) technology is a recent addition to bioinformatics that shows much promise in streamlining the discovery of pharmacologically active compounds [1] . machine learning (ml) provides particular benefits in identifying how drugs effective in one context might have utility in an unknown clinical context or against a novel pathology [2] . the application of ml in biomedical research provides new means to conduct exploratory studies and high-throughput analyses using information already available. in addition to deriving more value from past research, researchers can develop ml tools in relatively short periods of time. past research now provides a sizable bank of information concerning drug-biomolecule interactions. using drug repurposing as an example, we can now train predictive algorithms to identify patterns in how antiviral compounds bind to proteins from diverse virus species. we aim to train an ml model so that when presented with the proteome of a novel virus, it will suggest antivirals based on the protein segments present in the proteome. the final output from the model is a best-fit prediction as to which known antivirals are likely to associate with those familiar protein segments. these benefits are of particular interest for the current covid-19 health crisis. the novelty of sars-cov-2 requires that we execute health interventions based on past observations. grappling with an unforeseen pandemic with no known treatments or vaccines, the potential for rapid innovation from ml is of utmost significance. the ability to conduct complex analyses with ml enables us to research insights quickly that can help steer us in the right direction for future studies likely to produce fruitful results. we present here multiple models that produced a number of antiviral candidates for treating covid-19. out of our 12 top predicted drugs, 6 of them have shown positive results in recent findings based on cell culture results and clinical trials. these promising antivirals are lopinavir, ritonavir, ribavirin [3] , cyclosporine, [4] , rapamycin [5] , and nitazoxanide [6] . for the 7 other predicted drugs, further research is needed to evaluate their effectiveness against sars-cov-2. we used two main data sources for this study. the first database was the drugvirus database [7] ; drugvirus provides a database of broad-spectrum antiviral agents (bsaas) and the associated viruses they inhibit. the database covers 83 viruses and 126 compounds, and provides information on the status as antiviral of each compound-virus pair. these statuses fall into eight categories representing the progressive drug trial phases: cell cultures/co-cultures, primary cells/ organoids, animal model, phases i-iv and approved. see appendix a for a more intuitive pivot table view of the database. the second database is the national center for biotechnology information (ncbi) virus portal [8] ; as of april 2020, this database provides approximately 2.8 million amino-acid and 2 million nucleotide sequences from viruses with humans as hosts. each row of this database contains an amino acid sequence specimen from a study, as well as metadata that includes the associated virus species. in our work, we considered sequences only from the 83 virus species in the drugvirus database or their subspecies in order to be able to merge the two data sources successfully. we also constrained ourselves to amino-acid sequences only in the current iteration. the main reasons for this are two-fold: 1. amino-acid sequences are essentially derived from the dna sequences, which may encode overlapping information on different levels. in somewhat simplified terms, amino-acid sequences are the outputs of a layer of preprocessing on genetic material (in the form of dna/rna). 2. nucleotide triplets (codons) map to amino-acids, making amino-acid sequences much shorter and easier to extract features both in preprocessing and in the machine learning methods themselves. shorter sequences also mean the ml pipeline will be more resource-efficient, i.e. easier to train. the amino-acids were downloaded as three datasets: hiv types 1 & 2 (1, 192 ,754 sequences), influenza types a, b & c (644,483 sequences), and the "main" dataset for all other types including sars-cov-2 (785,624 sequences). each dataset came with two components. the "sequence" component is composed on accession ids and the amino-acid sequence itself, while the "metadata" component includes all other data (e.g. virus species, date specimen was taken, an identifier of the related study) as well as the accession id to enable merging the two components. the amount of research with a focus on influenza and hiv naturally lead to these viruses comprising most of the samples. in our experiments, we have excluded these viruses, and have worked only with dataset #3, though the other datasets can be integrated into the main one during the class balancing process, an idea we will discuss in section 4, future work. the first step of the preparation phase was to merge the "sequence" and "metadata" components into a single ncbi dataset based on sequence ids. afterwards, we mapped the "species" column in this main dataset to the virus name column in the drugvirus database. this step was required as these two columns that denote the virus species in the respective datasets did not match due to subspecies present in the sequence dataset and alternative naming of some viruses. afterwards, we processed the drugvirus dataset to a format suitable for merging with the ncbi data frame. every row of the drugvirus dataset consists of a single drug-virus pairing and their respective interaction/drug trial phase, meaning any given drug and virus appeared in multiple rows of the dataset. we derived a new drugvirus dataset that functioned as a dictionary where each unique virus was a key, and the interactions with antivirals encoded as a multi-label binary vector (1 if viable antiviral according to the original dataset, 0 if not) of length 126 (the number of antivirals) which corresponded to the value. we came up with three "versions" depending on how we decided an antiviral was a viable candidate to inhibit a virus. the criteria depended on drug trial phases: 1. in the first version, any interaction between a drug-virus pair is designated by a 1. this means drugs that did not go past cell cultures/co-cultures or primary cells/organoids testing are still considered viable candidates. 2. this second version expands upon the first stemming from our discovery that an attained trial phase in the database does not necessarily mean previous phases were also listed in the database. for example, we found that for a given virus, a given drug had undergone phase iii testing, designated by a 1, but phase i & ii were listed as 0s. this undermined our assumption that drug trials are hierarchical; though, in reality this is usually the case. this can be caused by missing data reporting or possibly skipped phases. we proceeded with the hierarchy assumption, and extended the database in (1) to account for the previous phases. this meant that in this second version, an approved drug will have all phases designated with 1s, for example. keeping track of the 8 phases meant that the size of the database also grew by 8. 3. in the third version, we considered a drug-virus pair as viable only if it has attained phase ii or further drug trials, signifying some success with human trials have been observed. in the results presented in section 3, our training database was based on this third version of the drugvirus database. the full dataset was then generated by merging this "new" version of the drugvirus dataset with the ncbi dataset. we then generated two versions of this full dataset: one that consists of all sars-cov-2 sequences and one that consists of all other viruses available. this enabled us to compare how successful our models are in a case when they have not been trained on the virus species at all and have to detect peptide substructures in the sequences to suggest antivirals. a sample of this final database (with some columns excluded for brevity) is available in appendix b. upon inspection of the data, we found that there were replete of duplicate or extremely similar virus sequences. to reduce this exploitability and pose a more challenging problem, we removed the duplicate sequences that belonged to the same species and had the exact same length. this reduced the size of the dataset by approximately 98%. the counts for each virus species before and after dropping duplicate viruses is available in appendix c1 and c2. our main database also contained a class imbalance in the number of times certain virus species appeared in the database. we oversampled rare viruses (e.g., west nile virus: 175 sequences) and excluded the very rare species which compose less than 0.5% of the available unique samples in the dataset (e.g., andes virus: 4 sequences), and undersampled the common viruses (e.g., hepatitis c: 16,040 sequences). this produced a more modest database of 30,479 amino acid sequences, with each virus having samples in the 400-900 range (see appendix c3). we kept the size of the dataset small both to enable easier model training and validation in early iterations and to handle data imbalance more smoothly. the class imbalance problem also presented itself in the antiviral compounds. even with balanced virus classes, the number of times each drug occurred within the dataset varied, simply because some drugs apply to more viruses than others. to alleviate this, we computed class weights for each drug, which we then provided to the models in training. this enabled a fairer assessment and a more varied distribution of antivirals in predicted outputs. the final step of data processing involved generating the training and validation sets. we split the data in two different ways, resulting in two different experiments (see section 2.3, experiment setup for the full experiment pipeline). experiment i is based on a standard, randomized an 80% training/20% validation split on the main dataset. for experiment ii, we split the data on virus species, meaning the models were forced to predict drugs for a species that it was not trained on, and have to detect peptide substructures in the amino-acid sequences to suggest drugs. in this setup we also guaranteed that the sars-cov-2 sequences were always in the test set, in addition to three other viruses randomly picked from the dataset. we used a variant of this setup that trains on all virus sequences except sars-cov-2 and is validated on sars-cov-2 only to generate the results presented in section 3. a growing number of studies demonstrate the success of using artificial neural networks (ann) in evaluating biological sequences in drug repositioning and repurposing [9] [10] . previous work on training neural networks on nucleotide or amino-acid sequences have been successful with recurrent models such as gated recurrent units (gru), long short-term memory networks (lstm) and bidirectional lstms (bilstm), as well as 1d convolutions and 2d convolutional neural networks (cnn) [11] [12] . we have therefore focused on these network architectures, and conducted our experiments with an lstm with 1d convolutions and bidirectional layers as well as a cnn. the network architectures are explained briefly below. lstm and 1d convolutions for the lstm, a character-level tokenizer was used to encode the fasta sequences into vectors consumable by the network. the sequences were then padded with zeros or cut off to a fixed length 500 to maintain a fixed input size. the network architecture consisted of an embedding layer, followed by 1d convolution and bidirectional lstm layers (each followed by maxpooling), and two fully connected layers. a more detailed architecture diagram is available in appendix d. convolutional neural network (cnn) for the cnn, the input features were one-hot encoded based on the fasta alphabet/charset, which assisted in interpretability when examining the 2d input arrays as images. the inputs are also fixed at a length of 500, resulting in 500 x 28 images, where 28 is the number of elements in the fasta charset. the network architecture consists of four 2d convolutions with filter sizes of 1x28, 2x28, 3x28 and 5x28 respectively, which are maxpooled, concatenated and passed through a fully connected layer. a more detailed architecture diagram is available in appendix e. the experiments were run on a computer with an 2.7 ghz intel broadwell cpu (61 gb ram) and nvidia k80 gpu (12 gb) . both models completed a 20-epoch experiment in 60-90 minutes. one to three training and evaluation runs were made for each setup during model and hyperparameter selections, and ten training and evaluation runs were done to produce the average metrics in section 3. the experiments start by determining the model to use and apply the appropriate preprocessing steps mentioned in section 2.2. we then proceed with determining the dataset to train and validate on. this part of the experiment setup is more extensively covered in section 2.1.4, train/test splitting. we used binary cross entropy (bce) loss, adam optimizer, precision, recall and f1-score as metrics since accuracy tends to be an unreliable metric given the class imbalance and the sparse nature of our outputs. after training and validation, predictions were done on the validation set and the results were post-processed for interpretability. in post-processing, we applied a threshold to the sigmoid function outputs of the neural network, where we assigned each drug a probability of being a potential antiviral for a given amino acid sequence. after experimenting with different values, we settled on a threshold value of 0.2. postprocessing outputs a list of drugs that were selected along with the respective probabilities for the drugs being "effective" against the virus with the given amino acid sequence. for other hyperparameters involved as well as information on hyperparameter tuning, see appendix f. here we present the results for the two experiments described in section 2.1.4, train/test splitting. the figures and tables presented in this section are based on the lstm and cnn architectures described in section 2.2, which were trained on 128 batch size and 0.01 and 0.001 learning rates respectively for 20 epochs with an adam optimizer. in the regular setup, we performed an 80%/20% train-test split on our data of 30,479 sequences. the metrics for the best set of hyperparameters (based on validation set f1-score) for both the cnn and lstm architectures respectively are presented in table 1 . similarly, plots for the same set of models and hyperparameters over 20 epochs are presented in figures 1 and 2 . our models handled the task successfully, achieving 0.958 f1-score in a multi-label multi-class problem setting. this means that the models were able to match the virus species with the sequence substructures and appropriately assign the inhibiting antivirals with accuracy. these satisfactory results led to us implementing experiment ii. in experiment ii, the models predicted antiviral drugs for virus species they haven't been trained on. this meant the models were not able to recommend drugs by "recognizing" the virus from the sequence and therefore had to rely only on peptide substructures in the sequences to assign drugs. in the results presented below, the test set consists of sars-cov-2, herpes simplex virus 1, human astrovirus and ebola virus, whose sequences were removed from the training set. we see here that the cnn (and the lstm) had issues with convergence, and the accuracies are clearly below their counterparts in the regular setup, though this is certainly expected. we now turn to the actual predictions on the sequences and attempt to interpret them. upon examination of drug predictions for herpes simplex virus 1 (hsv-1), however, we see that our cnn was in fact quite successful. in table 3 and table 4 , count represents how many times each drug was flagged as potentially effective for hsv-1 sequences, and mean probability denotes the average confidence predicted over all instances of the drug. a sample of the outputs where these metrics are derived from is available in appendix g. antivirals used for phase ii and further trials for hsv-1 are highlighted in bold, meaning all six drugs in the database that are used for phase ii and further trials are predicted by our model. three of the top five predictions are approved antivirals for hsv-1 and the only remaining one is predicted 11th among 126 antivirals. this high level of accuracy is remarkable given that our model has not been trained on hsv-1 sequences. predictions for sars-cov-2 with some variation between the two, both the lstm (table 4a ) and the cnn (table 4b) seem to converge on a number of drugs: ritonavir, lopinavir (both phase iii for mers-cov), tilorone (approved for mers-cov) and brincidofovir are in the top five candidates in both, while valacyclovir, ganciclovir, rapamycin and cidofovir rank high up in both lists. most of the remaining drugs are present in both lists as well. the lstm is more conservative in its predictions than the cnn, and the overall counts for sars-cov-2 are significantly lower than for herpes simplex virus 1 for both, pointing a comparable lack of confidence on the models' part in predicting sars-cov-2 sequences. a further step we took for the sars-cov-2 sequences was visualizing the layer activations in the zetane engine to validate that the model was processing the data at a fine-grained level. this was done in similar fashion to a study where integrated gradients were used to generate attributions on a neural network performing molecule classification [13] . the layer activations in both models showed that different antivirals activated different subsequences of a given sequence at the amino acid level, thus validating our approach. the filter activations are available in appendix h. the preliminary results of our experiments show promise and merit further investigation. we note that our ml models predict that some antivirals that show promise as treatments against mers-cov may also be effective against sars-cov-2. these include the broad-spectrum antiviral tilorone [14] and the drug lopinavir [15] , the latter of which is now in phase iv clinical trials to determine its efficacy against covid-19 [16] . such observations suggest with confidence that our models can recognize reliable patterns between particular antivirals and species of viruses containing homologous amino acid sequences in their proteome. additional observations that support our findings have come to light from a study in the lancet published shortly before this article [3] . this open-label, randomized, phase ii trial observed that the combined administration of the drugs interferon beta-1b, lopinavir, ritonavir and ribavirin provides an effective treatment of covid-19 in patients with mild to moderate symptoms. both of our models flagged three of the drugs in that trial (note that interferon was not part of our datasets). in terms of number of occurrences aka count, ritonavir, lopinavir and ribavirin were ranked 4th, 5th and 11th by the lstm, while the cnn model ranked them 3rd, 4th and 10th, respectively. other studies have also focused on the treatment of sars-cov-2 by drugs predicted in our experiments. wang et al. discovered that nitazoxanide (lstm rank 10th, cnn rank 8th) inhibited sars-cov-2 at a low-micromolar concentration [6] . gordon 12th) is known to be effective against diverse coronaviruses [4] . such observations are encouraging. they demonstrate that predictive models may have value in identifying potential therapeutics that merit priority for advanced clinical trials. they also add to growing observations that support using ml to streamline drug discovery. from that perspective, our models suggest that the broad spectrum antiviral tilorone, for instance, may be a top candidate for covid-19 clinical trials in the near future. other candidates highlighted by our results and may merit further studies are brincidofovir, foscarnet, artesunate, cidofovir, valacyclovir and ganciclovir. the antivirals identified here have some discrepancies with emerging research findings as well. for instance, our models did not highlight the widely available anti-parasitic ivermectin. one research study observed that ivermectin could inhibit the replication of sars-cov-2 in vitro [17] . another large-scale drug repositioning survey screened a library of nearly 12,000 drugs and identified six candidate antivirals for sars-cov-2: pikfyve kinase inhibitor apilimod, cysteine protease inhibitors mdl-28170, z lvg chn2, vby-825, and ono 5334, and the ccr1 antagonist mln-3897 [18] . it comes as no surprise that our models did not identify these compounds as our data sources did not contain them. future efforts to strengthen our ml models will thus require us to integrate a growing bank of novel data from emerging research findings into our ml pipeline. in terms of our machine learning models, better feature extraction can improve predictions drastically. this step involves improvements through better data engineering and working with domain experts who are familiar with applied bioinformatics to better understand the nature of our data and find ways to improve our data processing pipeline. some proposals for future work that could strengthen the performance of our machine learning process are as follows: 1. deeper interaction with domain experts and further lab testing would lead to a better understanding of the antivirals and the amino-acid sequences they target, leading to building better ml pipelines for drug repurposing. 2. better handling of duplicates can improve the quality of data available. the current approach (which is based on species and sequence length) can be improved through using string similarity measures such as dice coefficient, cosine similarity, levenshtein distance etc. 3. influenza and hiv datasets should be integrated into the data generation and processing pipeline to enhance available data. 4. vectorizers can be used to extract features as n-grams (small sequences of chars), which has attained success in similar problems [19] . other unsupervised learning methods such as singular value decomposition also may be applicable to our study [20] . we hope that the machine learning approaches and pipelines developed here may provide longterm benefit to public health. the fact that our results show much promise in streamlining drug discovery for sars-cov-2 motivates us to adapt our current models so we can conduct identical drug repurposing assessments for other known viruses. moreover, experimental data suggests that our approaches are generalizable to other viruses (see the hsv-1 example in section 3.2, experiment ii) -we are therefore confident that we could adapt our models to conduct equivalent studies during the next outbreak of a novel virus. this also means our methods can be used to repurpose existing drugs in order to find more potent treatments for known viruses. the direct beneficiaries of our findings are members of the clinical research community. using relatively few resources, ml-guided drug repurposing technology can help prioritize clinical investigations and streamline drug discovery. in addition to reducing costs and expediting clinical innovation, such efficiency gains may reduce the number of clinical trials -and thus human subjects used in risky research -needed to find effective treatments (this pertains to the ethical imperative to avoid harm when possible). also of importance is that in-silico analyses using machine learning provide yet another means to employ past research findings in new investigations. ml-guided drug repurposing thus provides means to obtain further value from knowledge on-hand; maximizing value in this case is laudable on many fronts, especially in terms of providing maximum benefit from publicly-funded research. the negative consequences that could arise should our models fail appear limited but are noteworthy nonetheless. note that our models aim to only indicate possible therapeutics that merit further clinical investigation in order to prove any antiviral activity against sars-cov-2. should our models fail by recommending spurious treatments, these incorrect predictions may divert limited time and resources towards frivolous investigations. it should also be noted that our methods aim to primarily work as guidance for medical experts, and not as a be-all-end-all solution. and any incorrect inferences made by our models are bound to be detected early by medical experts. communicating any machine-learning predictions of tentative antiviral drugs from this study requires much caution. the current pandemic continues to demonstrate how fear, misinformation and a lack of knowledge about a novel communicable disease can encourage counterproductive health-seeking behaviour amongst the public. soon after the coronavirus became a widely understood threat, the internet was awash in false -sometimes downright harmful -information about preventing and treating covid-19. included within this misleading health information were premature claims by some prominent government officials that therapeutics like chloroquine and hydroxychloroquine might hold promise as a repurposed drug for covid-19. such unfounded advice caused avoidable poisonings from people self-medicating with chloroquine. subsequent clinical investigations demonstrated no notable benefit and potential adverse reactions to chloroquine when used to treat covid-19. such unfortunate events remind us that preliminary findings may be misinterpreted as conclusive treatments or as evidence to support inconclusive health claims. the hyperparameters tested in our experiments are presented in section f it is certainly possible to improve the accuracies of our experiments by conducting a vaster coverage of the loss landscape through more extensive training (e.g. running longer experiments with smaller learning rates on more complex network architectures), especially for results in experiment ii. however, due to performance constraints, the scope of hyperparameter tuning as well as the ann architectures experimented on are relatively constrained as we focused on the methodology as opposed to optimal performance in this study. it should be noted that much improvement is possible in this front, as pointed out in discussion and future work. additional notes regarding our observations during hyperparameter tuning are presented below. â�¢ for the threshold, we wanted to predict eagerly, i.e. we considered false negatives more costly errors than false positives. a high threshold would mean the outputs would be composed only of the antivirals our models are very confident about per amino acid sequence. this we deem undesirable, as while we do hope these outputs narrow the scope of antivirals to focus on, over-restricting could prevent antivirals that are predicted frequently yet with low probability be detected. a low threshold such as 0.2 filtered the number of antivirals sufficiently, but also left enough breathing room for the domain experts to draw their own conclusions on a per-drug basis. â�¢ while a larger sequence length cutoff was possible and not detrimental to the results, we deemed 500 a suitable trade-off in terms of performance versus accuracy, as many sequences do not reach lengths in the thousands to begin with. â�¢ as mentioned, the number of epochs trained could be increased, as we did not see dramatic signs of overfitting at 20 epochs or further. however, a flattening of the metrics were evident around 20 epochs with the hyperparameters listed, which therefore was selected a suitable stopping point. table 10 : a section of sample outputs for amino acid sequences and their associated antivirals. post-processing outputs a list of drugs that were selected along with the respective probabilities of the drugs being "effective" against the virus with the given amino acid sequence. survey of machine learning techniques in drug discovery drug repositioning: a machine-learning approach through data integration triple combination of interferon beta-1b, lopinavir-ritonavir, and ribavirin in the treatment of patients admitted to hospital with covid-19: an open-label, randomised, phase 2 trial cyclosporin a inhibits the replication of diverse coronaviruses a sars-cov-2 protein interaction map reveals targets for drug repurposing remdesivir and chloroquine effectively inhibit the recently emerged novel coronavirus (2019-ncov) in vitro discovery and development of safe-in-man broad-spectrum antiviral agents ncbi viral genomes resource drug repurposing using deep embeddings of gene expression profiles deepdr: a network-based deep learning approach to in silico drug repositioning protein family classification with neural networks deepsf: deep convolutional neural network for mapping protein sequences to folds using attribution to decode binding mechanism in neural network models for chemistry tilorone: a broad-spectrum antiviral invented in the usa and commercialized in russia and beyond a systematic review of lopinavir therapy for sars coronavirus and mers coronavirus-a possible reference for coronavirus disease-19 treatment option corona virus drugs -a brief overview of past, present and future the fda-approved drug ivermectin inhibits the replication of sars-cov-2 in vitro a large-scale drug repositioning survey for sars-cov-2 antivirals near perfect protein multi-label classification with deep neural networks neural networks for full-scale protein sequence classification: sequence encoding with singular value decomposition we would like to thank the administrators of the drugvirus and the ncbi virus portal for providing the datasets that are central to this study. we appreciate comments on preliminary drafts of this manuscript from dr tariq daouda from the massachusetts general hospital, broad institute, harvard medical school.the authors declare they will not obtain any direct financial benefit from investigating and reporting on any given pharmaceutical compound. the following study is funded by the authors' employer, zetane systems, which produces software for ai technologies implemented in industrial and enterprise contexts. c database profile c.1 virus counts before dropping duplicate sequences key: cord-264296-0x90yubt authors: sawmya, shashata; saha, arpita; tasnim, sadia; anjum, naser; toufikuzzaman, md.; rafid, ali haisam muhammad; rahman, mohammad saifur; rahman, m. sohel title: analyzing hcov genome sequences: applying machine intelligence and beyond date: 2020-06-03 journal: biorxiv doi: 10.1101/2020.06.03.131987 sha: doc_id: 264296 cord_uid: 0x90yubt covid-19 pandemic, caused by the sars-cov-2 strain of coronavirus, has affected millions of people all over the world and taken thousands of lives. it is of utmost importance that the character of this deadly virus be studied and its nature be analysed. we present here an analysis pipeline comprising phylogenetic analysis on strains of this novel virus to track its evolutionary history among the countries uncovering several interesting relationships, followed by a classification exercise to identify the virulence of the strains and extraction of important features from its genetic material that are used subsequently to predict mutation at those interesting sites using deep learning techniques. in a nutshell, we have prepared an analysis pipeline for hcov genome sequences leveraging the power of machine intelligence and uncovered what remained apparently shrouded by raw data. covid-19 was declared a global health pandemic on march 11, 2020 [1] . it is the biggest public health concern of this century [22] . it has already surpassed the previous two outbreaks due to the coronavirus, namely, severe acute respiratory syndrome coronavirus (sars-cov) and middle east respiratory syndrome coronavirus (mers-cov). the virus acting behind this epidemic is known as severe acute respiratory syndrome coronavirus 2 or in short sars-cov-2 virus. it is a single stranded rna virus which is mainly 26,000 to 32,000 bases long in average [2] . the novel coronavirus is spherical in shape and has spike protein protruding from its surface. these spikes assimilate into human cells, then undergo a structural change that allows the viral membrane to fuse with the cell membrane. the host cell is then attacked by the viral gene through intrusion and it copies itself within the host cell, producing multiple new viruses [3] . as of mid-april, 2020, about 10,000 of high-quality complete genome sequences were present in the gisaid initiative database [23] collected from clinicians and researchers from around the world. to understand the viral evolution and its nature of spread among the different countries, we present an analysis pipeline of the genome sequence leveraging the power of machine intelligence. this paper makes the following key contributions. a. an alignment-free phylogenetic analysis is carried out with a goal to uncover the evolutionary history of sars-cov-2. the resulting phylogenetic tree is able to highlight evolutionary relationships that can be explained by facts and figures and has further identified some mysterious relationships. b. several machine learning and deep learning models are used to identify the virulence of the strains (i.e., to classify a virus strain as either severe or mild). additionally, from the classification pipeline, important features are identified as sites of interest (sois) in the virus strains for further analysis. c. several cnn-rnn based models are used to predict mutations at specific sites of interest (sois) of the sars-cov-2 genome sequence followed by further analyses of the same on several south-asian countries. d. overall, we present an analysis pipeline that can be further utilized as well as extended and revised (a) to study where a newly discovered genome sequence lies in relation to its predecessors in different regions of the world; (b) to analyse its virulence with respect to the number of deaths its predecessors have caused in their respective countries and (c) to analyse the mutation at specific important sites of the viral genome. figure 1 : the whole analysis pipeline consisted of three phases. in the first phase, the genome sequences are divided into subsets based on country and a phylogenetic tree is constructed considering only the "representative" sequences of each such subset using an alignment-free sequence comparison approach. in the second phase, we employed state of the art classification algorithms, leveraging both traditional and deep learning pipelines to learn to discriminate the viral strains of many countries as either mild or severe. we also identify the features that contributed the most as the discriminant factor in the classification pipeline. finally, we use the identified features from the previous stare to predict the mutation of the interesting sites in the viral strain using a deep learning model. figure 1 presents our overall analysis pipeline. below we present the details of the pipeline. we have collected 10179 hcov genome sequences upto the date 24 april, 2020 (cut-off date) from the gisaid initiative dataset [23] . these are high quality complete viral genome sequences submitted by the scientists and scientific institutes of individual countries. we also have collected country wise death statistics (upto cut-off date) from the official site of who [6] . the label was assigned based on a threshold of deaths which is the estimated median of the number of deaths in the data points. any genome sequence of a country having deaths below (above) the threshold were considered a mild (severe) strain, i.e., assigned a label 0 (1). a sample labelling is shown in the supplementary table 1. informatively, we have also considered some other metrics for labeling purposes albeit with unsatisfactory output (please see supplementary file for details) . we divided the whole dataset into training and testing subset in 80/20 ratio with a balanced number of data points per class for traditional machine learning pipeline and for deep learning classification routine, we created the subsets training/validation/testing in 68/12/20 ratio. figure 2 : the viral genome sequences were divided into subsets of sequences based on country. for each subset, each viral genome sequence is converted into a vector representation and pairwise euclidean distance was calculated among the vectors to create the distance matrix. as the matrix is very highdimensional, we used principal component analysis to find the principal component matrix from the distance matrix. representative sequences were identified through k-means clustering on the pca matrix, and a phylogenetic tree was constructed from the representative sequence of each country. we aim to identify and interpret the evolutionary relationships among the hcov genome sequences uploaded at gisaid from different regions around the globe ( figure 2 ). to do that we have used an alignment-free genome sequence comparison method as proposed in [5] as briefly described below. notably, we do not consider any alignmentbased method since it is not computationally feasible for us to align thousands of viral sequences for analysis and clustering purposes [4] . at first the sequence set is divided into subsets of sequences based on the location. all sequences are converted into representative ℝ 18 vector. pairwise distance among vectors derived from the fast vector method [5] are computed using euclidean distance. due to the high dimensionality of the resulting distance matrix, we resort to principal component analysis (pca) technique [9] to reduce the dimension of the matrix. subsequently, we use k-means clustering [43] to identify the corresponding cluster centers. for the k-means clustering algorithm, we have used the implementation of [38] and used the default parameters except for the number of clusters which were set to 1 for determining the cluster center for each of the subsets. for each location-based cluster, the representative sequence (i.e., the "centroid" of the cluster) is then identified and used in the subsequent step of the pipeline. the evolutionary relationship among the representative sequences of different clusters (from section 2.2) has been estimated by constructing a phylogenetic tree. we have used the neighbor joining algorithm [37] for phylogenetic tree construction since it is more reliable [25] . we have used euclidean distance among the vectors, as described in the section 2.2, to prepare the distance matrix. while we predominantly have used the alignment-free method of [5] , in this stage, we have only 67 representative sequences and hence we have also attempted a few other alignment-free and alignment-based methods to estimate the phylogenetic tree; however, these didn't produce satisfactory results (more details are in supplementary file). for traditional machine learning, we use a pipeline similar to [12] (see figure 3 in supplementary file). we extracted three types of features from the genomic sequence of novel sars-cov-2. inspired by the recent works [12] [14] [64] [65] that focus only on sequences, we also extract only sequence-based features. these features are: position independent features, n-gapped dinucleotides and position specific features (see details in section 3 of supplementary file). we use the gini value of the extremely randomized tree (extra tree) classifier [13] to rank the features. subsequently, only the features with gini value greater than the mean of the gini values are selected for training a lightgbm classifier model [15] (with default parameters) and performed 10-fold cross validation. lightgbm is a highly efficient and fast gradient boosting framework which uses tree-based algorithms. we use shap values and univariate feature selection to compare the importance of the features. shap (shapley additive explanations) is a game theoretic approach which is used to explain the output of a model [44] . univariate feature selection works by selecting the best features based on univariate statistical tests [50] . we use selectkbest univariate feature selection to get the top k highest scoring features according to anova f_classif feature scoring [56] function. we leverage the power of 3 different deep learning (dl) classification models, namely, vanilla cnn [7] , alexnet [40] and inceptionnet [41] . we transform the raw viral genome sequences into two different representations, namely, k-mers spectral representation [7] and one hot vectorization [8] to feed those into the dl networks in a seamless manner. details of these representations are given in section 5.2 of the supplementary file. for k-mers spectral representation we experimented with different values of k (k = 3,5,7 for vanila cnn and k = 3 & 5 only for the rest due to resource limitation). for one hot vectorization, we have trained inceptionnet for 150 epochs for both 3-and 5-mers and trained alexnet for 135, 100 and 100 epochs for 3-,4-and 5-mers respectively. we design a pipeline to predict mutation on specific sites (chosen in an earlier stage of the pipeline) in the sars-cov-2 genome (figure 4 ). we follow a similar protocol followed by [10] and adopt it to fit our setting as follows. we divide all the available countries and the states of the usa into different time-steps by the date of the first reported incidence of sars-cov-2 infected patients of that location. thus, every resulting time-step represents a date (tk for cluster k) and contains the clusters of genome sequences of the countries/states. then the time series samples are generated by concatenating sites from different time-step one-by-one that represent the evolutionary path of the sars-cov-2 viral strain. for example, t1 is the very first date when the virus is discovered in china. so, the time-step 1 contains only one country, china. likewise, time-step t2 contains clusters for those countries where the virus is discovered on date t2 and so on. (check table 3 in supplementary file for more details). we generate 300000 time series sequences by concatenating genome sites from t1,t2,....,tn (in our case, n = 40) and then fed the samples to the model which consists of a convolutional one dimensional layer and a recurrent neural network layer [34] . we experiment with both pure lstm and bidirectional lstm as our rnn layer (see section 4.3 of supplementary file). the model has a dense layer of 4 neurons in the end which predicts the probability of the next base pair of the next time-step. so, in a nut-shell the model takes concatenated genome sequences from t1,t2,....,tn-1 as input and predicts the mutation for time tn. we further use our mutation prediction pipeline to identify and analyze possible parents of a mutated strain. for this particular analysis, we trained the models specifically for some south-asian countries, namely, bangladesh, india and pakistan. we only used the best performing model for this analysis and generated five time series samples. at the time of generating these samples, the country/location having the minimal euclidean distance was taken for each time-step. we have implemented our experiments mostly in python. we have used scikit-learn library [38] for clustering and plotting the graphs. for deep learning models, scikit-learn, tensorflow and keras neural network libraries are used and for lightgbm classifier, python lightgbm framework has been used. the phylogenetic trees are constructed using the dendropy library of python [57] keeping default parameters. we use the tree visualizer tools dendroscope [11] and evolview [24] for tree visualization and annotation. the experiments have been conducted in the following machines: a) clustering and phylogenetic analyses have been carried out in a machine with intel(r) core (tm) i7-6500u cpu @ 2.50ghz, ubuntu 19.04 os and 8 gb ram. b) experiments involving the deep learning pipelines (i.e., both classification and mutation prediction) have been conducted in the work-stations of galileo cloud computing platform [35] and the default gpu provided by the google colaboratory cloud computing platform [36] . c) the lightgbm classifier model was trained in a machine with intel core i5-4010u cpu @ 1.70ghz x 4, windows 10 os and 16 gb ram. all the codes and data (except for the genome sequences) of our pipeline can be found at the following link: https://github.com/pythonloader/analyzing-hcov-genome-sequence. the genome sequence data have been extracted from and are publicly available at gisaid [23] . we identify the representative sequence of each of the 67 countries as present in the gisaid dataset (upto cut-off date). the estimated phylogenetic tree constructed from the representative sequences is shown in figure 5 . in what follows, we will be referring to this tree as the sc2 (sars-cov-2) tree. the phylogenetic tree generated is expected to reveal the evolutionary relationship of the viral strains. however, with careful scrutiny we have some apparently unusual but interesting observations. for example, it is generally expected that the countries sharing (open) borders (e.g., countries in europe) should be either neighbours or at least in the same clade in the tree. however, surprisingly from the tree, we do not notice geographically adjacent countries in europe as neighbors; rather we see for example that china and italy are immediate neighbors. it is to be noted that these two countries are also the first countries to get hit by the first pandemic wave. in addition to that, although the usa and canada share the longest un-militarized international border in the world, representative strains do not appear to be sister branches as they should have been. also, we notice that the usa, uk, canada, turkey and russia are in the same clade which have a higher number of deaths than most of the other countries. all our classifiers are trained to learn whether a given strain is mild or severe. the classification accuracy of the lightgbm classifier (~91%) is superior to that of the deep learning classifiers (~84-89%), which, while is somewhat surprising, is in line with the recent findings of [12] . it should be noted that lightgbm had produced better results in significantly less time than deep learning models for this dataset. the results of the classifier models are shown in figure 6 . quantitative results aside, we also have applied our classifiers on the sequences that have been deposited at gisaid after the cut-off date (i.e. april 18, 2020). since the cutoff date, the country wise death statistics [6] has certainly changed significantly and this has pushed a few countries, particularly from asian regions and several states of the united states of america transition from mild to severe state (based on our predefined threshold). interesting, our classifiers have been able to predict the severity of the new strains submitted from these countries/states correctly. table 6 in the supplementary file shows a snapshot of a few such countries/states with the relevant information. we preliminarily identify the top 10 features of shap and selectkbest feature selection (with k=10). from these features, as sois, we have selected the features that are also biologically significant, i.e., cover different significant gene expression regions ( figure 7 ). in particular, we have selected the position specific features pos_8445_8449, pos_19610_19614, pos_24065_24069 and pos_23825_23829 as the sois for the mutation prediction analyses down the pipeline. here, pos_x_y indicates the site from positions x to y of the virus strains. the reason for selecting these features as sois are outlined below. according to gene expression studies [62] [63], our sois, namely, pos_8445_8449 and pos_19610_19614 encode to two non-structural proteins, nsp3 and nsp11, respectively. and, our other two sois, namely, pos_24065_24069 and pos_23825_23829 correspond to the spike protein of sars-cov-2. nsp3 binds to viral rna, nucleocapsid protein, as well as other viral proteins, and participates in polyprotein processing. it is an essential component of the replication/transcription complex [51] . so, the mutation in this protein is expected to affect the replication process of the sars-cov-2 in host bodies. on the other hand, the spike protein sticks out from the envelope of the virion and plays a pivotal role in the receptor host selectivity and cellular attachment. according to wan et al. there exists strong scientific evidence that sars and sars-cov-2 spike proteins interact with angiotensin-converting enzyme 2 (ace2) [52] . the mutation on this protein is expected to have a significant impact on the human to human transmission [53] . therefore, it is certainly interesting and useful to predict the mutation of such sois. cnn-lstm and cnn-bidirectional lstm performed in a similar manner for different sois of the genome registering 94.98% and 95% accuracy, respectively, considering all sois together. for detailed results please check table 7 and table 8 of the supplementary material. for the model involving only bangladesh, we applied the cnn-bidirectional lstm model (as this is the best performer among the two) and achieved almost 100% accuracy. then we analyzed the ancestors in the time series test samples and noticed that some of the states of the usa are present in these samples. these states are california, massachusetts, texas, new jersey and maryland. for india and pakistan, we got similar results for some sites but for other sites, accuracy was not as high as bangladesh (check table 9 of the supplementary file for details). our analyses reveal a very close (evolutionary) relationship between the genome sequences of china and italy. also, similarity was found among the virus strains of the usa, germany, qatar and poland. these countries have similar numbers of deaths and although not geographically directly adjacent (except for germany and poland) they have strong air connectivity among them. in fact, a number of interesting relationships can be inferred from the estimated phylogenetic tree as follows. chinese tourists [26] . this relationship is clearly portrayed in the sc2 tree where the two strains appear to be immediate siblings. 2. poland's strain is in the same clade as that of germany, which can be explained by the fact that its strain (through poland's patient zero) came from germany [27] . 3. taiwan is geographically very close to china. the virus was confirmed to have spread to taiwan on january 21, 2020, through a 55-year-old woman who had been teaching in wuhan, china [28] . the virus strains from these regions are also close together as can be seen from the sc2 tree, about 6 branches apart. similar relationship can also be inferred from the tree between china and south korea: the strain of the virus in south korea is believed to be transmitted from china firstly through a 35-year old chinese woman and secondly by a 55-year old south korean national [29] . interestingly, from the sc2 tree it can also be deduced that the south korean strain is very close to that of taiwan and also near to the strain from china. the incident of a taiwanese woman being deported from south korea after refusing to stay at a quarantine facility can be a probable explanation as to how the south korean strain might have found its path to taiwan [46] . 4. on march 2, 2020, the virus was confirmed to have reached portugal, when it was reported that a portuguese 33 year-old man working in spain was tested positive for covid-19 after returning home [49] . subsequently, within a span of 9 days, 5 more cases were reported all originating from spain [49] [61] . the fact that the first cases of covid-19 in portugal originated from spain is clearly captured in our sc2 tree. 5. the sc2 tree suggests that india's strain is closely related to that from china and also italy (around 4 branches) and that it is also connected to that from saudi arabia. these relationships can be explained as follows. a 7. turkey's first identified case was a man who was travelling europe [33] . turkey also announced a huge number of cases and subsequent deaths, which were originating from europe [47] . in our inferred relationship, we can see that the turkish representative strain is quite close to several central and western european countries like russia, iceland and ireland which can be backed up by the two facts stated above. 8. it is visible from the sc2 tree that the strain of germany is very close to the strains of both poland and the usa. it might be the case that the community transmission occurred concurrently in both usa and poland from germany which hit the peak of pandemic before both usa and poland [42] . 9. qatar has the second highest number of covid-19 patients in the middle-east [48] . the first case of qatar was reported on february 27,2020 to be a man working in iran [55] . qatar introduced a travel ban to and from germany and the usa as precautionary measures in mid-march, quite a while later following the first occurrence. qatar has 5 air-routes with germany and usa, with more than 10 airlines operating in that route [59] [60] . though the first case has originated from iran, it might be the case that subsequent patients were found to be travelling from the aforementioned countries as a result of which the travel ban was introduced. our estimated sc2 tree places qatar very close to both the usa and germany. 10. while we can certainly explain many of the relationships identified by the estimated sc2 tree a above, there are some relationships which are not that apparent. one such example is the direct relationship between vietnam and greece. while apparently, there exists no direct relationship, when investigated further, we identified something interesting. patient zero of greece is believed to have been contaminated during her trip to the milan fashion week which took place during february 18-24, 2020 [45] . interestingly, the first covid-19 patient in hanoi [16] left hanoi on february 15 to visit family members living in london, england and three days later, she traveled from london to milan city. could she be in contact with patient zero of greece or any other who had been contaminated by the latter, before returning to london on february 20? we can't be certain, but our inferred relationship between vietnam and greece certainly put a lot of legitimacy to that question. 11. finally, we are unable to find any apparent explanation analyzing the reported news sources for a few other strong relationships inferred by the tree (e.g., congo-iran, panama-malaysia, sweden-singapore, japan-australia, etc). this could be because of the inherent inaccuracies of the distance matrices as well as the limitations of the tree estimation algorithms: none of these algorithms are 100% accurate. from another angle, perhaps, the tree did identify these relationships correctly; but the relevant incidences were not accurately identified or not documented. in recent times, the number of deaths is increasing rapidly in india. we have been closely following the change in the virus strains of india before and after the cut-off date. a genome sequence (epi_isl_435050) was collected on april 13, 2020 (before our cutoff date) from a patient in ahmedabad, gujrat, india. it was predicted to be a severe strain (with low confidence) even though at that time we trained the classifier to consider the indian sequences as mild. according to our evolutionary relationship, india is very close to both italy and china. so, we calculated the distance between the representative sequence of both italy and china with this strain. we considered another strain (epi_isl_437447) which was collected from another patient from the same place in india on april 26, 2020 (after our cut-off date) and predicted the severity thereof. the classifiers declared this isolate to be severe with very high confidence (about 98%). we did the distance calculation like before. interestingly, it was identified that this isolate is closer to both italy and china's representative sequence than the previous less severe one. this strongly suggests that there were some mutations that turned the indian sequences from mild or less severe to severe or highly severe, respectively. also, the sequences from the us states of pennsylvania, maryland, indiana, illinois and florida that were collected on may 25, 2020 (about one month after our cut-off date) were analyzed and our classifiers could correctly capture the severity of the genome sequences (see table 4 in the supplementary file). we conduct an analysis to predict possible parents of the (mutated) virus strains of the south asian region (bangladesh, india and pakistan). our mutation prediction pipeline suggests that the strains of some states of the usa, namely, california, massachusetts, texas, new jersey and maryland could be the parents/ancestors of these south asian strains. now, the total deaths in these states up to june 1, 2020 are 4240, 6846, 1686, 11711 and 2532 respectively [58] and the strains thereof are also classified to be severe by our classification pipeline. it thus seems quite likely that the sars-cov-2 situation in these south-asian countries will worsen in near future. bangladesh, india and pakistan are ranked 88 th , 112 th and 122 nd in global health performance compared to the united states of america which is at the 37 th position [54] . in the majority of lower middle-income countries such as bangladesh, india and pakistan, available hospital beds are < 1 bed per 1000 population and icu beds are < 1 bed per 100,000 population [39] . additionally, an uncontrolled epidemic is predicted to have 6,000,220 deaths having a duration of nearly 200 days in the majority of these countries [39] . these predictions coupled with our findings call for stern actions (i.e., interventions) on part of these countries. bibliography: covid-19) outbreak situation genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding cryo-em structure of the 2019-ncov spike in the prefusion conformation alignment-free sequence comparison: benefits, applications, and tools a novel fast vector method for genetic sequence comparison who coronavirus disease (covid-19) dashboard a deep learning approach to dna sequence classification dna sequence classification by convolutional neural network principal component analysis and factor analysis. (n.d.). principal component analysis springer series in statistics tempel: time-series mutation prediction of influenza a viruses via attention-based recurrent neural networks dendroscope 3: an interactive tool for rooted phy-logenetic trees and networks crisprpred(seq): a sequence-based method for sgrna on target activity prediction using traditional machine learning extra tree forests for sub-acute ischemic stroke lesion segmentation in mr sequences isgpt: an optimized model to identify sub-golgi protein types using svm and random forest based feature selection lightgbm: a highly efficient gradient boosting decision tree vietnam confirms 17th covid-19 patient -vnexpress international india confirms its first coronavirus case kerala defeats coronavirus; india's three covid-19 the weather channel, the weather channel india's first coronavirus death is confirmed in karnataka coronavirus: india 'super spreader' quarantines 40,000 people 40,000 indians quarantined after 'super spreader' ignores government advice responding to covid-19 -a once-in-a-century pandemic? data, disease and diplomacy: gisaid's innovative contribution to global health evolview, an online tool for visualizing, annotating and managing phylogenetic trees why neighbor-joining works coronavirus, primi due casi in italia: sono due turisti cinesi koronawirus w lubuskiem. 44 godziny, dwa razy za wolno. daleko do laboratorium taiwan confirms 1st wuhan coronavirus case (update) austria's 2 coronavirus cases are italian citizens greece confirms first coronavirus case, a woman back from milan as coronavirus takes hold, greece worries about migrant camps turkey remains firm, calm as first coronavirus case confirmed human mitochondrial genome compression using machine learning techniques google colaboratory the neighbor-joining method: a new method for reconstructing phylogenetic trees scikit, scikitlearn.org/stable/modules/generated/sklearn.cluster.kmeans.html dynamic interventions to control covid-19 pandemic: a multivariate prediction modelling study comparing 16 worldwide countries imagenet classification with deep convolutional neural networks going deeper with convolutions europe's coronavirus numbers offer hope as us enters 'peak of terrible pandemic' algorithm as 136: a k-means clustering algorithm consistent individualized feature attribution for tree ensembles greece's 'patient zero' shares coronavirus experience (lead) taiwanese woman deported for refusing to stay at quarantine facility sağlık bakanı fahrettin koca: pozitif çıkan yeni vakalarımız var -türkiye haberleri flights from qatar, www.qatar.to/united-states/qatar-to-united-states ministra confirma primeiro caso positivo de coronavírus em portugal scikit, scikitlearn.org/stable/modules/feature_selection.html#univariate-feature-selection nsp3 of coronaviruses: structures and functions of a large multi-domain protein receptor recognition by the novel coronavirus from wuhan: an analysis based on decade-long structural studies of sars coronavirus role of changes in sars-cov-2 spike protein in the interaction with the human ace2 receptor: an in silico analysis measuring overall health system performance for 191 countries. global programme on evidence forhealth policy discussion paper no. 30 qatar reports first case of coronavirus sklearn.feature_selection.f_classif ¶ dendropy: a python library for phylogenetic computing flights from qatar, www.qatar.to/germany/qatar-to-germany flights from qatar, www.qatar.to/united-states/qatar-to-united-states single-stranded rna genome of sars-cov2 sars-cov-2 (severe acute respiratory syndrome coronavirus 2) sequences antigenic: an improved prediction model of protective antigens dpp-pseaac: a dna-binding protein prediction model using chou's general pseaac key: cord-000473-jpow6iw1 authors: astrovskaya, irina; tork, bassam; mangul, serghei; westbrooks, kelly; măndoiu, ion; balfe, peter; zelikovsky, alex title: inferring viral quasispecies spectra from 454 pyrosequencing reads date: 2011-07-28 journal: bmc bioinformatics doi: 10.1186/1471-2105-12-s6-s1 sha: doc_id: 473 cord_uid: jpow6iw1 background: rna viruses infecting a host usually exist as a set of closely related sequences, referred to as quasispecies. the genomic diversity of viral quasispecies is a subject of great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. high-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. results: in this paper, we introduce a new viral spectrum assembler (vispa) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art shorah tool on both simulated and real 454 pyrosequencing shotgun reads from hcv and hiv quasispecies. experimental results show that vispa outperforms shorah on simulated error-free reads, correctly assembling 10 out of 10 quasispecies and 29 sequences out of 40 quasispecies. while shorah has a significant advantage over vispa on reads simulated with sequencing errors due to its advanced error correction algorithm, vispa is better at assembling the simulated reads after they have been corrected by shorah. vispa also outperforms shorah on real 454 reads. indeed, 7 most frequent sequences reconstructed by vispa from a real hcv dataset are viable (do not contain internal stop codons), and the most frequent sequence was within 1% of the actual open reading frame obtained by cloning and sanger sequencing. in contrast, only one of the sequences reconstructed by shorah is viable. on a real hiv dataset, shorah correctly inferred only 2 quasispecies sequences with at most 4 mismatches whereas vispa correctly reconstructed 5 quasispecies with at most 2 mismatches, and 2 out of 5 sequences were inferred without any mismatches. vispa source code is available at http://alla.cs.gsu.edu/~software/vispa/vispa.html. conclusions: vispa enables accurate viral quasispecies spectrum reconstruction from 454 pyrosequencing reads. we are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations. results: in this paper, we introduce a new viral spectrum assembler (vispa) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art shorah tool on both simulated and real 454 pyrosequencing shotgun reads from hcv and hiv quasispecies. experimental results show that vispa outperforms shorah on simulated error-free reads, correctly assembling 10 out of 10 quasispecies and 29 sequences out of 40 quasispecies. while shorah has a significant advantage over vispa on reads simulated with sequencing errors due to its advanced error correction algorithm, vispa is better at assembling the simulated reads after they have been corrected by shorah. vispa also outperforms shorah on real 454 reads. indeed, 7 most frequent sequences reconstructed by vispa from a real hcv dataset are viable (do not contain internal stop codons), and the most frequent sequence was within 1% of the actual open reading frame obtained by cloning and sanger sequencing. in contrast, only one of the sequences reconstructed by shorah is viable. on a real hiv dataset, shorah correctly inferred only 2 quasispecies sequences with at most 4 mismatches whereas vispa correctly reconstructed 5 quasispecies with at most 2 mismatches, and 2 out of 5 sequences were inferred without any mismatches. vispa source code is available at http://alla.cs.gsu.edu/~software/vispa/vispa.html. conclusions: vispa enables accurate viral quasispecies spectrum reconstruction from 454 pyrosequencing reads. we are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations. many viruses (including sars, influenza, hbv, hcv, and hiv) encode their genome in rna rather than dna. unlike dna viruses, rna viruses lack the ability to detect and repair mistakes during replication [1] and, as a result, their mutation rate can be as high as 1 mutation per each 1,000-100,000 bases copied per replication cycle [2] . many of the mutations are well tolerated and passed down to descendants, producing a family of co-existing related variants of the original viral genome referred to as quasispecies, a concept that originally described a mutation-selection balance [3] [4] [5] [6] [7] . the diversity of viral sequences in an infected individual can cause the failure of vaccines and virus resistance to existing drug therapies [8] . therefore, there is a great interest in reconstructing genomic diversity of viral quasispecies. knowing sequences of the most virulent variants can help to design effective drugs [9, 10] and vaccines [11, 12] targeting particular viral variants in vivo. briefly, the 454 pyrosequencing system shears the source genetic material into fragments of approximately 300-800 bases. millions of single-stranded fragments are sequenced by synthesizing their complementary strands. repeatedly, nucleotide reagents are flown over the fragments, one nucleotide (a, c, t, or g) at a time. light is emitted at a fragment location when the flown nucleotide base complements the first unpaired base of the fragment [13, 14] . multiple identical nucleotides may be incorporated in a single cycle, in which case the light intensity corresponds to the number of incorporated bases. however, since the number of incorporated bases (referred to as a homopolymer length) cannot be estimated accurately for long homopolymers, it results in a relatively high percentage of insertion and deletion sequencing errors (which respectively represent 65%-75% and 20%-30% of all sequencing errors [15, 16] ). the software provided by instrument manufacturers were originally designed to assemble all reads into a single genome sequence, and cannot be used for reconstructing quasispecies sequences. thus, in this paper we address the following problem: given a collection of 454 pyrosequencing reads generated from a viral sample, reconstruct the quasispecies spectrum, i.e., the set of sequences and the relative frequency of each sequence in the sample population. a major challenge in solving the qsr problem is that the quasispecies sequences are only slightly different from each other. the amount and distribution along the genome of differences between quasispecies varies significantly between virus species, as different species have different mutation rates and genomic architectures. in particular, due to the lower mutation rate and longer conserved regions, hcv quasispecies are harder to reconstruct than quasispecies of hbv and hiv. additionally, the qsr problem is made difficult by the limited read length and relatively high error rate of high throughput sequencing data generated by current technologies. the qsr problem is related to several well-studied problems: de novo genome assembly [17] [18] [19] , haplotype assembly [20, 21] , population phasing [22] and metagenomics [23] . as noted above, de novo assembly methods are designed to reconstruct a single genome sequence, and are not well-suited for reconstructing a large number of closely related quasispecies sequences. haplotype assembly does seek to reconstruct two closely related haplotype sequences, but existing methods do not easily extend to the reconstruction of a large (and a priori unknown) number of sequences. computational methods developed for population phasing deal with large numbers of haplotypes, but rely on the availability of genotype data that conflates information about pairs of haplotypes. metagenomic samples do consist of sequencing reads generated from the genomes of a large number of species. however, differences between the genomes of these species are considerably larger than those between viral quasispecies. furthermore, existing tools for metagenomic data analysis focus on species identification, as reconstruction of complete genomic sequences would require much higher sequencing depth than that typically provided by current metagenomic datasets. in contrast, achieving high sequencing depth for viral samples is very inexpensive, owing to the short length of viral genomes. mapping based approaches to qsr are naturally preferred to de novo assembly since reference genomes are available (or easy to obtain) for viruses of interest, and viral genomes do not contain repeats. thus, it is not surprising that such approaches were adopted in the two pioneering works on the qsr problem [24, 25] . eriksson et al. [24] proposed a multi-step approach consisting of sequencing error correction via clustering, haplotype reconstruction via chain decomposition, and haplotype frequency estimation via expectation-maximization, with validation on hiv data. in westbrooks et al. [25] , the focus is on haplotype reconstruction via transitive reduction, overlap probability estimation and network flows, with application to simulated error-free hcv data. recently, the qsr software tool shorah was developed [26] and applied to hiv data [27] . another combinatorial method for qsr was also developed and applied to hiv and hbv data in [28] , with results similar to those of shorah. our contributions in this paper are as follows: • a novel qsr tool called viral spectrum assembler (vispa) taking into account sequencing errors at multiple steps, • comparison of vispa with shorah on hcv synthetic data both with and without sequencing errors, and • statistical and experimental validation of the two methods on real 454 pyrosequencing reads from hcv and hiv samples. our method for inferring the quasispecies spectrum of a virus sample from 454 pyrosequencing reads consists of the following steps (see fig. 1 ): • constructing the consensus virus genome sequence for the given sample and aligning the reads onto this consensus, • preprocessing aligned reads to correct sequencing errors, • constructing a transitively reduced read graph with vertices representing reads and edges representing overlaps between them, • selecting paths in the read graph that correspond to the most probable quasispecies sequences, and assembling candidate sequences for selected paths by weighted consensus of reads, and • estimating candidate sequence frequencies by em below we describe each step separately. we assume that a reference genome sequence of the particular virus strain is available (e.g., from ncbi [29] ). since viral genomes do not have sizable repeats and the quasispecies sequences are usually close enough to the reference sequence, the majority of reads can typically be uniquely aligned onto the reference genome. however, a significant number of reads may remain unaligned due to differences between the reference genome and sequences in the viral sample. in order to recover as many of these reads as possible, we iteratively construct a consensus genome sequence from aligned reads. in particular, we first align 454 pyrosequencing reads to the reference sequence using the segemehl software [30] . then we extend the reference sequence with a placeholder i for each nucleotide inserted by at least one uniquely aligned read. similarly, we add a placeholder d to the read sequence for each reference nucleotide missing from the aligned read. then we perform sequential multiple alignment of the previously aligned reads against this extended reference sequence. finally, the consensus genome sequence is obtained by (1) replacing each nucleotide in the extended reference with the nucleotide or placeholder in the majority of the aligned reads and (2) removing all i and d placeholders, respectively corresponding to rare insertions and to deletions found in a majority of reads. reads may contain a small portion of unidentified nucleotides denoted by n'swe treat n as a special allele value matching any of nucleotides a, c, t, g, as well as placeholders i, and d. iteratively, we replace the reference with the consensus and try to align the reads, for which we could not find any acceptable alignment previously. our experiments on a dataset consisting of approximately 31,000 454 pyrosequencing reads generated from a 5.2kb-long hcv fragment (see data description in results and discussions) show that 85% of reads are uniquely aligned onto the reference sequence and an additional 9% of the reads are aligned onto the final consensus sequence. reads that cannot be aligned onto the final consensus are removed from the further consideration. since aligned reads contain insertions and deletions, we use placeholders i and d to simplify position referencing among the reads. all placeholders are treated as additional allele values but they are removed from the final assembled sequences. first, we substitute each deletion in the aligned reads with placeholder d. deletion supported by a single read is replaced either with the allele value, which is present in all other reads overlapping this position, or with n, signifying an unknown value, otherwise. next, we fill with placeholder i each gap in a read corresponding to the insertions in the other reads. all insertions supported by a single read are removed from consideration. we begin with the definition of the read graph, introduced in [25] and independently in [24] , and then describe the adjustments that need to be made to read graph construction and edge weights to account for sequencing errors as well as the high mutation rate between quasispecies. the read graph g = (v, e) is a directed graph with vertices corresponding to reads aligned with the consensus sequence. for a read u, we denote by b(u), respectively e(u), the genomic coordinate at which the first, respectively the last, base of u gets aligned. a directed edge (u, v) connects read u to read v if a suffix of u overlaps with a prefix of v and they coincide across the overlap. two auxiliary vertices -a source s and a sink t are added such that s has edges into all reads with zero indegree and t has edges from all reads with zero outdegree. then each st-path corresponds to a possible candidate quasispecies sequence. the read graph is transitively reduced, i.e., each edge e = (u, v) is removed if there is a uv-path not including edge e. note that certain reads can be completely contained inside other reads. let a superread refer to a read that is not contained in any other read and let the rest of the reads be called subreads. subreads are not used in the construction of the read graph, but are taken into account in the final assembly of candidate sequences and frequency estimation. since the number of different st-paths is exponential, we wish to generate a set of paths that have high probability to correspond to real quasispecies sequences. in order to estimate path probability, we independently estimate for each edge e the probability p(e) that it connects two reads from the same quasispecies, and then multiply estimated probabilities for all edges on the path. under the assumption of independence between edges, if we assign to each edge e a cost equal tolog (p(e)) = log(1/p(e)), then the minimum-cost st-path will have the maximum probability to represent a quasispecies sequence. for reads without errors, [25] estimated the probability that two reads u and v connected by edge (u, v) belong to the same quasispecies as is the overhang between reads u and v [25] , n = #reads, q = #quasispecies, and l = #starting positions. thus, in this case the cost of an edge with overhang δ can be approximated by δ ∝ log(1/p δ ). to account for sequencing errors, we adjust the construction of the read graph to allow for mismatches. we use three parameters: (1) n = #mismatches allowed between a read and a superread, (2) m = #mismatches allowed in the overlap between two adjacent reads, and (3) t = #mismatches expected between a read and a random quasispecies. the probability that two reads u and v with j mismatches within an overlap of length o = e(u) b(v) belong to the same quasispecies can be estimated as: where ε is the estimated 454 sequencing error rate. as in the case of error-free reads, defining the edge costs as ensures that stpaths with low cost correspond to most likely quasispecies sequences. to generate a set of high-probability (low-cost) paths that are rich enough to explain observed reads, we compute for each vertex in the read graph the minimum cost st-path passing through it. finding these paths is computationally fast. indeed, we only need to compute two shortest-paths trees in g, one outgoing from s and one incoming into t; the shortest st-path passing through a vertex v is the concatenation of the shortest s v-and vt-paths. preliminary simulation experiments (see additional file 1) show that better candidate sets are generated when edge costs c defined by (1) and (2) are replaced by e c . in fact, if we use even faster dependency on c then we obtain better candidate sets. the fastest growing cost effectively changes the shortest path into so called maxbandwidth path, i.e., paths that minimizes maximum edge cost for the entire path and for each subpath. so, vispa generates candidate paths using this strategy. when no mismatches are allowed in the construction of the read graph, finding the candidate sequence corresponding to a st-path is trivial, since by definition adjacent superreads coincide across their overlap. when mismatches are allowed, we first assemble a consensus sequence from superreads used by the st-path. it may be not the best choice, especially when the coverage with superreads is low. hence, we replace each initial candidate sequence with a weighted consensus sequence obtained using both superreads and subreads of the path, as described below. for each read r, we compute the probability that it belongs to a particular initial candidate sequence s as: where l and l denote the lengths of the read and initial candidate sequence, respectively, k is the number of mismatches between the read and the initial candidate sequence s, and t/l is the estimated mutation rate. then final candidate sequence is computed as the weighted consensus over all reads, where the weight of a read is the probability that it belongs to the sequence. note that, unlike the case without mismatches, the same candidate sequence can be obtained from different candidate st-paths, so we remove duplicates at the end of this step. we assume that reads r with observed frequencies where generated from a quasispecies population q as follows. first, a quasispecies sequence q q is randomly chosen accordingly to its unknown frequency f q . a read starting position is generated from the uniform distribution and then a read r is produced from quasispecies q with j sequencing errors. the probability of this event is calculated as h q r j l lj j , ( ) where l is the read length and ε is the sequencing error rate. in our simulation studies we use the following read data sets. in order to perform cross-validation on the assembly method, we simulate reads data from 1739-bp long fragment from the e1e2 region of 44 hcv sequences [32] when sequence frequencies are generated according to some specific distribution. in our simulation experiments, we use geometric distribution (i-th sequence is constant factor more frequent than the (i + 1)-th sequence) to create sample quasispecies populations with different number of randomly selected above-mentioned quasispecies sequences. we first simulate reads without sequencing errors: the length of a read follows normal distribution with a particular mean value and variance 400, and a starting position follows the uniform distribution. this simplified model of reads generation has two parameters: number of the reads that varies from 20k up to 100k and the average read length that varies from 200bp up to 600bp. additionally, we simulate 454 pyrosequencing reads from 10 quasispecies sequences (following geometric distribution of frequencies) out of 44 hcv sequences [32] using flowsim [33] . we generated 30k reads with average length 350bp. the data set data1 has been received from hcv research group in institute of biomedical research, at university of birmingham. data1 contains 30,927 reads obtained from the 5.2kb-long fragment of hcv-1a genome (which is more than a half of the entire hcv genome). the average (aligned) read length average is 292bp but it significantly varies as well as the depth of position coverage (see additional file 1 for details). the depth of reads coverage variability is due to a strong bias in the sequence start points, reflecting the secondary structure of the template dna or rna used to generate the initial pcr products. as a result, shorter reads are produced by gc-rich sequences. data1 is available upon request from the authors. the hiv dataset [27] contains 55,611 reads from mixture of 10 different 1.5kb-long region of hiv-1 quasispecies, including pol protease and part of the pol reverse transcriptase. the aligned reads length varies from 35bp to 584bp with average about 345bp (see additional file 1 for details). in contrast to [27] , we do not filter out reads with low-quality scores. in all our experimental validations, we compare the proposed algorithm vispa with the state-of-the-art tool shorah as well as with vispa on shorah-corrected reads (shorahreads + vispa). we say the quasispecies sequence is captured if one of the candidate sequences exactly matches it. we measure the quality of assembling by portion of the real quasispecies sequences being captured by candidate sequences (sensitivity = + ) and its portion among candihere, we see advantage of vispa over shorah. following [24] , we measure the prediction quality of frequency distribution with kullback-leibler divergence, or relative entropy. given two probability distributions, relative entropy measures the "distance" between them, or, in the other words, the quality of approximation of one probability distribution by the other distribution. formally, the relative entropy between true distribution p and approximation distribution q is given by the formula: where summation is over all reconstructed original sequences i = {i | p(i) > 0, q(i) > 0} , i. e., over all original sequences that have a match (exact or with at most k mismatches) among assembled sequences. the relative entropy is decreasing with increasing of the average read length. it is expected since sensitivity is increasing with increasing of the average read length and em predicts underlying distribution more accurately. vispa algorithm considerably outperforms shorah (see fig. 2 (right)). however, shorah has a significant advantage over vispa on a read data simulated by flowsim both in prediction power and in robustness of results (see table 1 ). indeed, shorah correctly infers 3 out of 10 real quasispecies sequences whereas vispa reconstructs only 1 sequence. additionally, 10 most frequent assemblies inferred by shorah are more robust with repeating up to 45% of times on 10%-reduced data versus 1% of times for vispa's assemblies. this advantage can be explained by superior read correction in shorah. if vispa is used on shorah-corrected reads, the results drastically improves: 5 quasispecies sequences are inferred and exactly 95% of times are repeated on reduced data, confirming that vispa is better in assembling sequences (see table 1 ). experimental validation on 454 pyrosequencing reads from hcv samples we first discuss the choice of parameters of the read graph and candidate sequence assembly from stpaths. then we give statistical validation for obtained 10 most frequent quasispecies sequences. we infer quasispecies spectrum based on the read graphs constructed with various numbers n and m (numbers of mismatches allowed for superreads and overlaps corresponding to edges). we sort the estimated frequencies in descending order and count the number of sequences which cumulative frequency is 85%, 90%, and 95%. fig. 3 reports these numbers as a percent of the total number of candidate sequences. there is an obvious drop in percentage for all three categories if we allow up to n = 6 mismatches to cluster reads and up to m = 15 mismatches to create edges. in this case, the constructed read graph has no isolated vertices. to refine assembled candidate sequences, we use all reads and parameter t varying from 80bp till 350bp, or, in the other words, mutation rate varying from 1.75% up to 8% per sequence (which is in the range observed in [34] ). out of 3207 max-bandwidth paths, we obtain as much as 938 distinct sequences (t = 80) and as low as 755 sequences (t = 350) for different values of t [80; 350]. the neighbor-joining tree for the most frequent 10 candidate sequences obtained by vispa and shorah (see fig. 4 ) reminds a neighbor-joining tree for hcv quasispecies evolution. additionally, the most frequent candidate sequence found by vispa is 99% identical to one of the actual orfs obtained by cloning the quasispecies. the quasispecies sequence is considered found if one of candidate sequences matches it exactly (k = 0) or with at most k (1 or 9) mismatches. all methods are run 100 times on 10% -reduced data. for the i-th (i = 1, .., 10) most frequent sequence assembled on the whole data, we record its reproducibility, i.e., percentage of runs when there is a match (exact or with at most k mismatches) among 10 most frequent sequences found on reduced data. "reproducibility: max" and "reproducibility: average" report respectively maximum and average of those percentages." figure 3 percentage of candidate sequences which cumulative frequency is 85%, 90%, and 95%. the values on x-axis corresponds to the number of allowed mismatches during read graph construction. n_m means that up to n mismatches are allowed in superreads and up to m mismatches are allowed in edges. viral sequences containing internal stop codons are not viable since the entire hcv genome consists of a single coding region for a large polyprotein. so the number of reconstructed viable sequences can serve as an accuracy measure for quasispecies assembly. out of 10 most frequent sequences reconstructed by vispa, only 3 are not viable while shorah is able to reconstruct only one viable sequence. this sequence has 99.94% similarity with the vispa's fourth most frequent assemblies. both methods returned similar frequency estimations for this sequence: 0.017% (shorah) and 0.019% (vispa). both shorah and vispa (n = 6, m = 15) are run on eight 2.66ghz-cpus with 8m cache. they take around 40 minutes to assemble sequences and estimate their frequencies. smaller value of n increases vispa's runtime since its bottleneck (candidate sequences assembling) is proportional to the number of reads times number of paths. indeed, smaller value of n results in larger number of superreads in built read graph, thus, in larger set of candidate paths. for example, vispa runs 90 minutes for n = 2, m = 2. the plot on fig. 5 shows validation results for 10 most frequent quasispecies sequences with respect to em estimations assembled on data1 by shorah and vispa (n = 6, m = 15, and t = 120). repeatedly, 100 times we have deleted randomly chosen 10% of reads and run both methods on each reduced read instance to reconstruct quasispecies spectrum. the plot reports the percentage of runs when each of 10 most frequent sequences assembled on data1 are reproduced among the 10 most frequent quasispecies figure 4 the neighbor-joining phylogenetic tree for 10 most frequent hcv quasispecies variants on a 5,205bp-long fragment obtained by vispa and shorah. sequences are labeled with software name and its rank among 10 most frequent assembled sequences. percentage of runs when the i-th most frequent sequence is reproduced among 10 most frequent quasispecies assembled on the 10%-reduced set of reads. the i-th point at x-axis corresponds to the i-th most frequent sequence assembled on the 100% of reads. no data are shown for the sequences that are reproduced less than 5% of runs. inferred on the reduced instances with no mismatches (k = 0), or with k = 1, 2, 5 mismatches. for example, for k = 0 shorah repeatedly (35% of times) reconstructs only the third most frequent sequence while vispa reconstructs 7 sequences in at least 15% times, and the most frequent sequence is reconstructed 40% times. this plot shows that the found sequences are pretty much reproducible for vispa. in order to compare vispa and shorah, we run both of the methods on hiv dataset, used in the first experiment in [27] . as said above, we do not preprocess reads with respect to its 454 quality score, and it can explain poorer performance of shorah. indeed, shorah correctly infers only 2 quasispecies sequences with at most 4 mismatches: one assembly has 3 mismatches with real quasispecies sequence, and the other has 4 mismatches. vispa correctly reconstructs 5 quasispecies with at most 2 mismatches (3 of them among 10 most frequent assemblies): two sequences are inferred without any mismatches (one is among 10 most frequent assemblies), one assembly has 1 mismatch with real quasispecies sequence (and it is among 10 most frequent assemblies), and the rest sequences have 2 mismatches (one is among 10 most frequent assemblies). the assemblies correspond to a viable protein sequences. if vispa is applied to shorah-corrected reads, it can successfully infer three real quasispecies without any mismatches. in this paper, we have proposed and implemented vispa, a novel software tool for quasispecies spectrum reconstruction from high-throughput sequencing reads. the vispa assembler takes into account sequencing errors at multiple steps, including mapping-based read preprocessing, path selection based on maximum bandwidth, and candidate sequence assembly using probability-weighted consensus techniques. sequencing errors are also taken into account in vispa's em-based estimation of quasispecies sequence frequencies. we have validated our method on simulated error-free reads, flowsim-simulated reads with sequencing errors, and real 454 pyrosequencing reads from hcv and hiv samples. we are currently exploring extensions of vispa to paired-end reads; the main difficulty is selection of pair-aware candidate paths. we also foresee application of vispa's techniques to the analysis of high-throughput sequencing data from microbial communities [23] and ecological samples of eukaryote populations [35] . the vispa source code is available at http://alla.cs.gsu. edu/~software/vispa/vispa.html. additional file 1: supplementary materials. the file contains derivation of edge cost formula (2) and em algorithm, example of read graph construction and analysis of 454 pyrosequencing data. rna virus quasispecies: significance for viral disease and epidemiology mutation rates among rna viruses rna virus mutations and fitness for survival the quasispecies (extremely heterogeneous) nature of viral rna genome populations: biological relevance -a review the molecular quasi-species hepatitis c virus (hcv) circulates as a population of different but closely related genomes: quasispecies nature of hcv genome distribution rapid evolution of rna viruses rna virus populations as quasispecies. current topics in microbiology and immunology computational methods for the design of effective therapies against drug resistant hiv strains hiv-1 subtype b protease and reverse transcriptase amino acid covariation the rational design of an aids vaccine diversity considerations in hiv-1 vaccine selection pyrosequencing: an accurate detection platform for single nucleotide polymorphisms genome sequencing in microfabricated high-density picolitre reactors pyrobayes: an improved base caller for snp discovery in pyrosequences quality scores and snp detection in sequencing-by-synthesis systems short read fragment assembly of bacterial genomes building fragment assembly string graphs whole-genome sequencing and assembly with high-throughput, short-read technologies hapcut: an efficient and accurate algorithm for the haplotype assembly problem algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem 2snp: scalable phasing based on 2-snp haplotypes environmental genome shotgun sequencing of the sargasso sea beerenwinkel n: viral population estimation using pyrosequencing hcv quasispecies assembly using network flows deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction error correction of nextgeneration sequencing data and reliable estimation of hiv quasispecies combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing fast mapping of short sequences with mismatches, insertions and deletions using index structures maximum likelihood from incomplete data via the em algorithm (with discussions) hepatitis c virus continuously escapes from neutralizing antibody and t-cell responses during chronic infection in vivo characteristics of 454 pyrosequencing data-enabling realistic simulation with flowsim the quasispecies nature and biological implications of the hepatitis c virus. infection robust haplotype reconstruction of eukaryotic read data with hapler inferring viral quasispecies spectra from 454 pyrosequencing reads authors contributions ia designed algorithms, developed software, performed analysis and experiments, wrote the paper. bt performed analysis and experiments. sm contributed to developing software. kw designed algorithms and developed software. im contributed to designing the algorithms and writing the paper. pb supplied the hcv data and contributed to performing the analysis. az designed the algorithms, wrote the paper and supervised the project. all authors have read and approved the final manuscript. the authors declare that they have no competing interests. key: cord-023647-dlqs8ay9 authors: nan title: sequences and topology date: 2003-03-21 journal: curr opin struct biol doi: 10.1016/0959-440x(91)90051-t sha: doc_id: 23647 cord_uid: dlqs8ay9 nan . garrell j, modolell j: the drmm~hila locus, am antagonist of proneural geors that, lilac these genes, ~.ncodes a helix-loop-helix protein. ce/11990, 61:39-48 in crystals of ~or-nib-glu(oz~)-leu-mb-ala-leu-an~alm-lys(z)-alb.ome. pro~ natl acx*d sci usa 1990, 87:7921-7925 cloning and expre~inn of two distinct high-afl~nlty p~eptot~ ~geat¢ting with acidic and b~ic ~last growth imctogs. embo j 1990 . emboj 1990 emboj , 9:1957 emboj -1962 . he~gst l~ lf~w~m~ t, g~lwr~ i~. the ryb l gene in the l~lmion ye~tt sd~m~acctm~ pombe lfalcodin 8 a gtp-bindin s protein belated to rho and ypt: structot~, expfemflon and identificetion of its human homulogue. embo j 19~0, 9:1949 embo j 19~0, 9: -1956 serotouln receptor that activates adenyiate cyclase domain • of lutropin/choriogonadotropin receptor expressed in t~ cells binds choringonadotropin with ltlgh /tmnlty cllgol~o0onlal orsanization of adrenergic receptor genes molecoiat chagactegization of a rat ~2n-adrenerl#c receptor identificatlon of rpo30, a vaccinia virus rna polymerase gene with structut~ similarity to a eucaryotic transcription elon~ation factor nucleotide sequence analysis of the l g~ne of vesicular stomafltia virus (new jersey serotype) --identification of conserved domai~l~ in l proteins of nonsegmented negative-strand rna viruses a novel u,man immunodeflclency virus type-1 protein, try, shares ~'quences with tat, ent~ and rev proteins phosphoprotein and nucleo~psid protein evolution of vesicular stomatitis virus new jersey identification of a conserved region common to cadherius and hl~u~llzat s~ a hema~u!tinin~ sequence and evolutionary relationships of african swine fever vh*es thymidine kinase all unusual stnlctul~ of a putative t cell oncugene whlch allows production of similar proteh~ from distinct messengeg rnas ~ ldentilfication of a 3rd protein factor which binds to the rolls sarcoma virus ltr enhancer --po~lble homology with the serum response factor genetic variation and multigene fannllles in aj~rics~t swine fever virus sequence of the genome rna of rubella vh'us --evidence for genetic rearrangement du~n~ tosavirus evolution derse i~ equine infectious anemia virus tat--insights into the structure, function, and evolution of lentivtrus tran.~activator proteins a colrpoeison of the genome organization of capripoxvirns with that of the orthopo~ ~golutionary orlffln of human and s imian lmtmo~odeflciency vir~jes a new supe~mlly of putative ntp-bindin 8 domaan,t encoded by genomes of small dna and rna viruses envelope gene sequence of htlv-1 isolate mr-2 and its comlmrlann with other htlv-i isolates evolutionary relationship between luteovtruses and other rna plant viruses based on sequence motifs in their putative rna pulymet-am~ and nucleic acid hellcases isolation and sequence analysis of caenothabd/~s br/~w~e repetitive elements related to the ~ dqana transposon tcl jmolevo11990 selective clo~ sequence analysis of the hum l1 sequence* which t~ in the rehttively recent l~tt jowcs m& sequences related to the matte tra~acmable element acin the genum zea. j m0/ evo/1990 evulutiotmtv pattern of the hemas~utinin gene of inflmmza-b viru~s ~ulated in japan ~ cocir~lating linesses in the same epidemic semon the dna binding subuult of nf-kaplm-b is identical to factor kbfl and homologous to the rel oncogene product sequencins analyses and com~ of pmrainfluenza virus type-4a and type-4b np protein genes. virok>gie complete sequence of the gcnomic gna of o'nyon8-nyong virus and its use in the constroctton of alphavir~ phylogenetic trecs molecular clouln s of the rinderpest virus matrix gene --comparative sequence analysis with other paramyxorirm~. vi~logy cautd~an p~ ancestry of a human endogenous retrovirus ~ determination of an epitope of the diffuse systemic sclerosis marker antisen dna topoisomerase-l: sequence $1mllagity with retroviral ~ protein suggests a possible cause for autoimmunity in systemic sclerosis. pro6 natlacad s6i u&11989, 86:8492~96. mcgeoch dj: pgotein sequence cota~lxs show that the 'psuedoprotesses' encoded by poxviruses and certain retrovirus~ belong to the deoxyoridine triphtmphate family ~sk1 life: ~es of comme//na yellow mottle vlrus's complete dna sequence, genomlc discontinuities and transcript su88est that it is a pararetrovlnm i~l~titis c vllrll~ sborl~ amino acid sequence similarity with pe~tivirutu~ and flavivirus~ as well as members of two plant vlgus superggoupo mo~mann ti~ homology of cy~kine synthesis inhibitory factor (el-10) to the epstein-barr virus gene bcrfi nucleotlde sequence analysis of sa-omvv, a vlsna-related ovine lentivirus --phylo~-netic history of lentivirmms single copy seqoences in g~qgo dna retmmable a repetitive hnman aetrotrmmposon-llke family. y mo/e~/1990, 31:92 100 re¢otnblnation resulting in unusual features in the polyomavlrus genome isolated from a murine tumor cell line sequence anal~is of rice dwarf elxytoreovirus genome sewments s4, s5, and s6 -comparison with the equivalent wound tumor virus segments ho~tu~ ~ s71 is a ehylngcueticellly distinct human endogenous reteovtgal 1rlement with structural mad sequence homology to simian sarcoma virus (ssv). vi~ologie identification of a novel 65-kl)a cell surface receptor common to pm~cee~flc polypeptide molybdenum hydroxylas~ ~ the amino acid seqoence of chicken hepatic solfite oxidase frequency of a]mloglnai h |lm~tn haemoglobins ~ by c ~ t trmmitions in cpg dlnucleotid~ evidence for conservation of ferritin sequences amon 8 plants and animctbt and for a transit peptlde in soybean a 32-kda llpo~ortin from human mononuclear cells appears to be identical with the placental inhibitor of blood coagulation distinct fercedoxins from rhodobacter-capsulstus -complete amino acid sequences and molecular evolution n~ptide sequence analysis and molecular cloning reveal two calcium pump isoforms in the human erythrocyte membgane cloning and characterization of a novel member of the cytochrome-p450 subfamily iva in rat prostate a directiy repeated sequence in the ~-globin promoter resulates transcription in murine efythroleukemla cells isolation and chamcterizatinn of the alkane-inducibie nadph-cytochrome-p-450 olf, idoreductsse gene from candida-tropicalls -identification of invarlant residues wlthin slmilmr amino acid sequences of direr'sent flavoproteins protein klnase-c inhibitor proteins -purification from sheep brain and sequence similarity to lipocortins and 14-3-3 mci~ aveml~ b& sequence homology between purple acid phosphatases and phusphoprotein pho*phatsses --are phesphoprotcin phosphatatms metalloproteins collt~|nln~ oil~-bridged dinuclcar metal centers negative regulation of the human ~-globin ca~ne by transcriptional interference: role of an mu repetitive ~lement amino acid sequence of chicken catisequestrin deduced from c dna -comp~rison of caisequestrin and aspartactin caisequestrin, an intesccilular calciumbinding protein of skeletal muscle sarcoplssmic reticulm, is homolokous to ~, a putstive latminin-binding protein of the exteac¢llular matr~ bovsm~ ]prote~ c inhihl.gog with structugll and fun~ hotdoio~ou~ ]~-.gtl~ to hum~zn plum~ protein c inhibitor sequence of silkworm hemolymph antitrypsin deduced from its cdna nucleoude sequence --~on of its homology with ~.rplus. l b~cbem (tokyo) human mm~t cell tryptm~e multiple cdnas and genes reveal • multigene serlne protemje lmmlly howam> jc: msc ore. n k#on encoding protehm iteleted to the multidtog ite~letance family of tra~membt'mne tratmpofters m~, a tks~me-speclfi¢ b•tmment membrane protein, is a ia.minin.like protein commrvation of a cytoplasmic ~xy-termitml domain of couexin 43, a gap junctional protein, in mammal heart and brain the a~lba//a~ plasma membrane h+-a~ multigene ~ -genomle sequence and expression of • 3rd lsoform, f b/0/owra op#n of calliphora peripheral l~otoreceptors r1-6 --homology with d~ rhl and po~tmnsi~domd processing evolution of rhodopsin supergene family --independent divergence of visual pibments in vertebrates and insects and po~ibly in mollusks ct~tpl¢ the g~ne~ amino acid ~'~m~me gene of sac~baromy~wcet~-v/~ae --nucleotide seque~tce, protein similarity with the other i~kers yeast amino acid petmme~mes, and nitrogen cataboht~ repreulon the 70-kda peroxlsomal membrane protein is a member of the mdr (p-glycoprotein)-related atp-bindin~g protein superfamlly a new clam of lym~o-real/vacuolar protein sorelng signais. l b~/chem complete amino acid sequence and homologies of human erythrocyte membrane protein band 4.2. proc natl acad scd us a the primary structure of a halorhodopsin from n pbaraom~--structural, functional and evolutionary impnoations for bacterial rhod~ and haloghodopslns soluble lactose. blndln~ vertelmue lectlns: a ~ family the a regulatory subunit of the mltochondrlal fi-atpa~ complex is a heat shock protein. identification of two highly conserved amino acid sequences amon~ the ~x-subunits and molecular ~ sequence of h ilmlfl ~ l~ieat~ • novel gene family of integral membrane proteoglycans a protein with homology to the c-termimml relationaxip~ between/m~-nylate cycla~ and n•+,k+-ati~se lit pat pancgtmti¢ islets human na+ ,k+-ati~¢ genes ~ beta~ubunit gene family conmina at lcest one gene and one ~ evolution of the mltc cles~l genes of a new world lh'imate from ancestral homologues of human non-clessical genes the cdna sequence of mouse pllp-1 arid homololgy to hntman cd44 c~ll s~e antitpm and promot#ycen core unk proteins ~tjott of cdna encodin~ a hnman sperm membr~e protein related to a4 amyloid protebm ptwlflcstlo~ c~mu'actet~muon, and con with memb~ne carbonic anhydrase from human kidney hypermumbility of cpg dinucleoudes in the propcptide-enced/ng sequence of the human alb~tmi~ gene dystt'ophhl in electric or~n of to, pedo-~ homologous to that in h,ml~ muscle botste~ i~ homolosy of a yeast acun-binding protein to signal trmmductlon proteins slid myosin-1 the complete sequence of drosophila alpha-spectrin --conservation of structurml domahm between alplm-~ and alpl~t4cttnin •~ettaatflon of a lqbrilisr collqwn gene ~ spruces reveais the etdy evolutionary appearance of two collqwn gene fmmilk~ the predicted amino acid sequence of ct-lnternexin is that of a novel neuronal lntegmedla~ ~ent protein otsen bl~ type xil collm~n. a larbe multidomnln molecule with partial homology to type ix cousllem / b/d aera 1989 amyioid protein in i~mni~l amyloidmfls (plmnlah type) ks homolollotm to gd$oilmb an ac, tht-i~h,.da~g protein. b/~bera b/q0b~ res commun key ji~ ~ of a proline-rich cell wall protein gene ~ of soybesn. a ~ ana/ysis. j b/o/~em chicken liver evolutionary rehttinnships and impflcations for the resulation of phoophohpsse-a2 from snake venom to human secreted forms identification of a locality in snake venom a-ncurotoxins with a slsnlficant comlm*itinmd similarity to marine smdl ct-conotoxins: implications for evolution and structure activity al~ph[biml~ albmtm|nm ~s members of the albumin, alpha-l~toprotein. vitamin-d-binding protein mul~ flmily ~ni~on of the hnm~n llpoprotein lllmse gene and evolution of the llpase gene family e~'t~ion of cloned human reticulocyte 15-1ipoxygenase and immunological evidence that 15-hpoxygetmses of different cell types are related identification of a protein alt~ inttaspecific evolution of a gene family coding for urinary proteins conservation between yeast and man of a protein a~ociated with i35 small nuclear rlbonucleoprotein stl~ctute and partial amino acid sequence of calf thymus dna topobmmaertt~-ii -coml~on with other type-h emmyme~ ol~nudeotide correlations between infector and hem genomes hint at evolutiotmry relationships. nu6/e~ scot/~ik p& carotenoid desmurases fi, om ~ ~and nmoowo~craua are stru~ and l~n~'tinnally comerved and eonmin domains homolosons to flavoprotein dimdflde oxldoreductm~ deininger pi2 stt'uc~uee and vsrisbihty of recently inserted alu family members a novel neutrolphfl chemmtttactant generated duan8 an ln~ammmtory reaction in the l~mt peritoneal cal~lt~ tt~ t~t~o -l~tl'~t~tloil~ ~ amino acid seque~tce and structural relmtmmhip to interkukin-& b~ffx~m j the multlfimctinna 6-methylmllcyllc acid syn~ ge~e of ~~ ~ its ge~e structmm ieimive to tl~t of other po~lyketide symhase~. f.urj b/odaem 1990 mammalkm ublquitin carrier prmmtmh but not i~:i~k, ame ltdated to the 20-kda yeast 182, rad6. bk~chem b/qohys res commun chambers gk: sequence. structure and evolution of the c.ene codin b for ~t-gi~erol-3-phe~plmte ~rdrotfm~ in om,qt~ the cotaplete sequence of bogu/ktmm nenrotoxin type-#, and com~ with other clostrldhtl neugoto~hm if: a pamlly of cxam~fltutive c/bbp-llkc dna blndln~ proteins attenuate the il-l~t induced, ni~b mediated trans-activation of the ansiotemflnogen gene acute-phase response element different fort~ of ultmhithomx proteim generated by alternative spttcim~ are functionally equivalent evolution of collagen-iv genes from a 54-batm pair faton --a role for lntrmm ht gem~ evolution evolution of the insulin superfamlly tcetins are structoraily related sertoli cell proteim who~ ~on is tightly coupled to the iprtsence of germ cells ivarie r~ a bovine homolo s to the human myolletti c determination factor myf~ sequence conservation and 3' proce~ing of transcripts proteiu sertne threonine phoephatmes -an expanding family coppes zl divergence of duplicate genes in three sciaenid species (perciformes) from the south co~t of uruguay coasfaneda m: rrs~j~o~a (mu-~--a~) repetitive dna seqmmce l~vointion in 3 ~hically mstinct isolates. cor~0 bnz~n physiol repetitive seq~ce involvement in the duplication and divergence of mouse lysozyme genes the structure of a subtermlnal nut/e/6 a6/ds res 1990 schoofs i~ h~ between amino acid sequenc~ of ~ v~'lt~tm'stte peptide hormones and peptides ~mlated fi-on~ invertebrate sources. corn# bm&.n mg~ol bun'nng s, ~us r& lqatelet gtycoprotetn nb-ma protein antssonim from snake venoms ---evidence for s fumlly of p~telet-~sgqpttlon lnhll~tol~ hikher plant orilgins and the whylogeny of gt~en allpte simihtrity between the t~ ~ sindln s proteins abf1 how big is the univet~ of e~otm worklwide diffegences in the ~ncideace of type ! diabetes are ammciated with amino acid variation at pos/tion 57 of the hi~-dq ~ chain yeast general trtnscelptimt l~ctor gf! --sequence requirements for binding to dna mad evointhmky commrvttion. nudeg m/ds res concerted ]rv~ution of primate mplm smelllte dna. e'~kmce foe tm an~mt~ sequence sbm'ed by goal~ md human x ~e alpha ~ttdllte the nuchl~m~ sequence of etve ribommaal protein genea from the o/anene. of ~~ impacattom concem~ the mtytosene~ relationship bet~-en cyanelles and chloropluts wmslanoer l~ a new member of a secretory protein gene family in the dipteran c~t~onomot~ tentaus ~ a variant repeat stracture the ~r sequence ~ --die.inn on the x-chromosome and y-chromosome of a large set of closely related sequence~, most of wmda are i~eudogene~ ba~ttmo~e l~ cloning of the pso dna binding subutdt of nf-kapi~-b -homolo~" to gel and dortml l-~te two-monooxr~muse from m~ --clon~ nucleotide sequence, and primary structu~ homology within an enzyme family genetic hot~o~n~ty ~ acute and chronic acute forms of spinal muscular atrophy genetic variants of bovine ~-lactogiobulin --a novel wild.type ~-lacto#obulin w and ~ts primary sequence. b/or (~rn h0tt0e sey/er l~ltogh~ dna evolution in the olmcm species subgroup of drooophll~ f mot evot lovell-badge l~ a gene mapldng to the sex-determining gegion of the mouse y chromommae ~ a member of a novel ~ of zmbryonk~ly genes ~titmte 1,2-dioxy~mm~ from p~.udomotm~ pustfi~mtion, characterization, ~md compm'tson of the f.mtymes from psemffmmm~m ta~o~k-ron/and aaammms~ spec~clties of the peptidyl prolyl cis-tratm isomeric activities of cydophmn and fk-506 bindh~ protein --evidence for the existence of a family of distinct enzymes. b~x/aem/ary mltochondrl~ dna evolution in primates -tt-atmltion gate has been extremely low in the lemug homeobox containing genes in the nematode ~enorbabd/f~ elk.gamin nucleic ac shdic add fateesses of ~ • voluttomu.y origins have serine active sites f~entlal arginlne residues dewact-rrer l~ the 188 ltilm0omal rna ~-quence of the s~t anemone anemom~s ssdcmta and its evolutionary intuition amomqg other eukaryotes inferred b'om s~l,.m.~ comlmrttmas of a heat shock g~ae in two nematorl~ the l~'/o multtgene family of ok~hag of cdna ~ for the ~ omin of human complement component ca~bi~una protein, seqaenoe homolo~ with thc a c~t~:~a~h proc natl acad s¢t usa1990 highly conserved core domain and unique n terminus with presumptive regulatory moti~ in a hmman tata factor (l'lql~) [letter] identification cimractertzaflon of a novel member of the nerve growth fmctor/besln.dertved neurotrophic factor family ~ bind8 to s~dlfmme [eal(~-so4)l~l-lcer ] and has a sequence homology with other pt'otelns that bind sulfated glycoconjut~tes anllllo acid seqmmce of clnnamomin, a new member of the elicitin family, and its comparison to cryptogein and capsicetn soluble and mtmo[~tle~ioc~ta~l h~ low-ml~n|ty adenomne binding protein (adenotin) --properties and homology with mtmmall~la and avian stress protelus. b~-/~om/stry edolatlon of complementary dna$ f~lcoding a cerebellum-enriched nuclear factor-i family that activates tt'anscription from the mouse m~.lin basic protein promoter ye~mt mltochondrlal dna polymet'ase is related to the family a dna polymerases nudeotide and deduced amino add sequence of a human cdna (nqo2) corresponding to a second membeg of the nad(p)h --quinone oxldoreductase gene family --extensive polymorphism at the nqo 2 gene locus on chgomo~ome-6. b/oc.heraistry ult~ sltnlltt'leles a~llolltll enzyme pterin binding sites as demonstrated by a monoeinnal amiidiotypic antibody blundell tl molecular anatomy: phylogenetic relationship* derived from three~limenslonal structure~ of proteins subfamily structure and evolution of the hnmtn 1.1 family of repetitive scquence~. f mot evo 3 selmt~te mltochondrlal dna sequences are contiguous in htlmsa~ genol~ic dna l~t~lit~ within mmmm~lla~ sogl~tol deh~ --the prlmm'y structure of the human liver enzyme heterogeneous modifications of the l14/alo ltrote~a of ibtegleuldn-~t cells are concentrated in a/,ti~hly r~qg~.titlv ~ amino-t~ vaults.ell rebofmcleoprotein structures are msl~ conserved among higher and lower e~tes rnas le~d support to the monophyletic nature of the ~erla lmmunoloslcal ~lmllmtties ~etween cytosolic and partictdate tissue trans#utamilsc. febs lat mans~ti x#tope m~w~zed by a protective m~aodonm antibody is identical to the sta~e-specific embryonic antlgen-l. proc naa acad sa o~ 1990 the murg3 gene of t-brucei contains multiple dom.l.m of extensive editinil and is hofaoin~m~ to a subultit of nadh dehy~ neparm-bindl~ nenrotrophtc x~tor (hbnf) and mk, member's of z new i~mily of homolosous~ developmentally l~ted proteitm pugmattion and strucrmml ~on of pttcentel nad + .mtked 15-hydroxyproma#andm dehydtoffmase ~ the primary structure reveals the enzyme to belon 8 to the short-alcohol l)ehydrogena~ l~mlly. b/ochemistry structores and homologies of carbohydrate ~pho~ system ep~l~[ln, a ~o~a-gmjoclated mudn, is generated by a polymorphlc gene encodin8 splice variants with alternative amino termini a new member of the leucine zipper class of proteins that binds to the hia drct promoter. sc/ence attalysi~ of cdna for human ~ ajudgyrin i~dicltes a repeated structure with homology to tissue-differentiation a~td cell-cycle control protein the b subunlt of a rat hetefomeric ocaat-binding transcription factor shoes a striking sequence identity with the yeast hap2 transcription factor homology to mouse s-if and sequence similarity to yeast pt~2 stgucttu'e and evolution of the 02 small nuclear rna multigene family in primates: gene amplification under nat-¢wal selectinn? ident~catinn of an additional member of the proteln.tyrushle-phosp~ family --l*vidence f~ alternative spliclog in the tyrmine phosphzmme domain a 8~le am~o acid difference dis~ishes the human and the rat sequences of statlmaln, a ubiquitous intracehular pho~phoproteln ~ with cell item comp~ison of the seve~le~ gene* of drosop~ffa t~'ff~ end ma4~ muty, an adenine ~ active on g-a mislmirs, has homology to ~t evolution of largesubunit iutna structuge --the ~cation of imvetbe~t d3 dommin amon8 mmjor phyiolpmetic groups discrepancy in diveqlenoe of the mltodtondrlal and nuclear genomes of m sensor/and y~ j mot evot 1~90 adenylate deamll~t~. a mt~flige~e fam~ in p..m~,n, and rats isolmion and structure of ceerol#m, itna,~le hat~ peptmes, from the smm~m, ~ mo~ comp a~a rmm~ i~ vmotocin ge~ of the teleom f.,xott intro~ botany. ~ hot~ ot'l~mization. b~hemioy the adb gene areal share features of sequence structure and nudeast~protected sites. m0/cell bto/1990 the amino-acid sequence of multip/e lectins of the #.corn barnacle m~us-lgo~ and its homology with .animal ]~'tllls. bioclx'm btqobys acta amino add ~.-quence of mtmkey erythrocyte glycophorba mk. its amino acid ~'qu~'~icc ]f][~ a stri~tl~ homology with that of human glycophorin a flsp~r p& drtmophila proliferating cell nuclear antigen. structural and functional homology with its mammalian coonterpart phylogeny of n|trogen*me s~queac~ in ][~mnkla and other nlteogen-fixing ml~m$ vertebrate prot~mlne c~ne evolution.1. sequence alignments and gene structure florin l~ a major styl~ matrix polypeptid~ (sp41) is a member of the f~thogenesia-reiated proteins superciass complete amino acid sequence of rat kidney ornithine aminoteat~fet-~e --identity with ijver omithine aminotransferme. l bnxl;em (tokyo) rlbonuclease p --function and variation. j b/o/~bem the primary strum of glycoprotein-m from bovine adrenal medullary granules --sequence similarity with bnmmn serum protein-40,40 and rat sertoli cell giycoprotein-2 compm'ative ~quence/umlysis of m~mmantan f'a~or ix protaotegs the amino acid sequence of the b nman l~ia polymet'a~-h 33-kda subunit hrpb 33 is highly cotmerved among eukaryotes phylogenetic conservation of atylsulfatases --cdna cloturing and expre~ion of hnman aryisul~t~e-b. j b/o/cbem c.oll/l~'vlltion and diversity in fatnllies of coated vemcle adaptlns cllaracterizaflon of petel porcine bone sialoproteins, soca'~ted phosphopgotein ! (sppi, osteopontin), bone siaioprotein, and a 23.kda glycoprotetn ~ demonstration that the 23-kda glycoprotein is derived from the carboxyl terminus of sppi characterization of matteuccin, the 2.2s storag~ prote~ of the ostcich fern -evolutionary iteiatinnshlp to angiosperm seed storage ~ a new mmber of the glutamine-rlch protein gene family is characterized by the absence of internal lgepe~ts and the androgen control of its expression in the subm*ndlbuiar gland of pad 2 novel insect n~ with homology to peptides of the vea'te~ tachykinin family identircation of a novel platelet-derived neutrophli-chcmaotgctic po~ with structural homology to piatelet-factor-4 a novel repeated dna sequoncc located in the intergenic regions of ba~tceial chromosomes. nuc2eic.,k:ids res the proianlin storage protellx¢ of cere~ seeds ~ structure and evolution functional analysis of the 3'-terminal part of the balbiani ring gene by hlterspecies sequence comparison dr= mammaban ~yl phosphate symhetase (cp*) --cdna sequence and evolution of the cl m domain of the syrian hamster multifunctional protein cad mammalian dihydroorotase --nudeotide sequence, peptide sequences, and evolution of the imhydroorotsse domain of the multifunctinnal protein cad a receptor for tumor necrosis factor defines an unusual family of cellular and viral proteins the control of flower morphogenesis in a~..ffd~um majusthe protein shows homoinff~ to transcription factors an element of symmetry in ytmst tata-box binding protein transcription factor-lid --consequence of an ancestra/ duplication? c-type natciuretic peptide (cnp): a new member of nateinretic peptide family identified in porcine brain evolution of antioxidant m~: ediol-dependent petoxidm~.s and thiol~ ~umong ptocaryotes towards the evolution of ribozymes alkyl hydroperoxide reductase from sa/mone/ta ~ur/um --sequence and homology to thinredoxin reductase and other fiavoprotein disuliide oxidoreducmses fc: nonuniform evolution of duplicated, developmentally controlled c~azrion genes in a sillumoth the fission yeast cutl + gene regulates spindle pole body duplication and has homolosy to the buddin structural homology b~ween the hnmmn fur gene product mad the sub---like protea~ encoded by ye~t/~x2. nuc~ a¢/ds res 1990 nudeotide sequences and novel steuctut~ features of hnm=. and cimm~ lighter ~# primary stt~t~ and expression of a nuclear-coded subunit of complex-n n~ to protetm specified by the chtoropiast genome. b/0chera bnfhys r~ commun a novel gene member of the human giycophorin-a and glycophorin-b genc fatuily -molecular cloning and expression the x-chromosome of monotremes shares a highly conserved region with the eutherlan and marsupial x-o~romosomes despite the absence of x-chromosome ittactt~tion c~lract~tion and or~= nl~tion of dna sequences adjacent to the evidence for a new fmily of evolutionarily conserved homeobox genes elellatltlll and albolabrin purified peptides from viper venoms --homologies with the rgds domain of flbrinogen and yon willebrand pactor measurement of $~tiv~-site homology between potato and l~bbit muscle alpha-glum phosphoryiases through use of a iane~r free energy relationship white 1~ weiss 1~ the neuroflbromatosis typed gene encodes a protein related to gap the dna damage-inducible gcne-dinl of saocbarom3q~ewcet~#.s/ae encodes a regulatory subunit of elbonucleotide reductase and is identical to gnr3 fhlgegprinting of ne~lr-homogeneous dna hgase-i and ligase-h from eh,m~n cells --similarity of their amp-binding domains control of m11na st~mlity in • chnoc~qg.~um, by 3'inverted ltepeats: effects of stem and loop mutations on degradation ofxtmba mlna/n vt~ nuc/e~ ac alternative messenger rna structures of the ciil-gene of bacteriophage ~. determine the rate of its tt'ansbttion initiation alternative mrna structures of the cm genc of bacta~ophage ~ detc:'mine the rate of its translation initiation. j mo/b~0/1989 a model fog iina editing in klnetopiastid mltochondrla --guide rna molecules transcribed from max/circle dna provide the edited information elements and coding sequences. j mol bio11989, 210:417-427 . chang c-y, ~ d-a, mohandas til chung b-c: stt~ctut~e, ~-quence, chromo~maal location, and evolution of the human fercedoxin gene family. dna cell b/o/1990, 9:205-212 key: cord-004862-yv76yvy5 authors: demers, g. william; matunis, michael j.; hardison, ross c. title: the l1 family of long interspersed repetitive dna in rabbits: sequence, copy number, conserved open reading frames, and similarity to keratin date: 1989 journal: j mol evol doi: 10.1007/bf02106177 sha: doc_id: 4862 cord_uid: yv76yvy5 the l1 family of long interspersed repetitive dna in the rabbit genome (l1oc) has been studied by determining the sequence of the five l1 repeats in the rabbit β-like globin gene cluster and by hybridization analysis of other l1 repeats in the genome. l1oc repeats have a common 3′ end that terminates in a poly a addition signal and an a-rich tract, but individual repeats have different 5′ ends, indicating a polar truncation from the 5′ end during their synthesis or propagation. as a result of the polar truncations, the 5′ end of l1oc is present in about 11,000 copies per haploid genome, whereas the 3′ end is present in at least 66,000 copies per haploid genome. one type of l1oc repeat has internal direct repeats of 78 bp in the 3′ untranslated region, whereas other l1oc repeats have only one copy of this sequence. the longest repeat sequenced, l1oc5, is 6.5 kb long, and genomic blot-hybridization data using probes from the 5′ end of l1oc5 indicate that a full length l1oc repeat is about 7.5 kb long, extending about 1 kb 5′ to the sequenced region. the l1oc5 sequence has long open reading frames (orfs) that correspond to orf-1 and orf-2 described in the mouse l1 sequence. in contrast to the overlapping reading frames seen for mouse l1, orf-1 and orf-2 are in the same reading frame in rabbit and human l1s, resulting in a discistronic structure. the region between the likely stop codon for orf-1 and the proposed start codon for orf-2 is not conserved in interspecies comparisons, which is further evidence that this short region does not encode part of a protein. orf-1 appears to be a hybrid of sequences, of which the 3′ half is unique to and conserved in mammalian l1 repeats. the 5′ half of orf-1 is not conserved between mammalian l1 repeats, but this segment of l1oc is related significantly to type ii cytoskeletal keratin. the repeated dna sequences that are dispersed throughout eukaryotic genomes have been divided into two classes (reviewed by weiner et al. 1986 ). both classes appear to transpose by an rna intermediate, and the insertion of either class of repeated dna generates short flanking direct repeats at the target site--hallmarks of transposition first recognized in prokaryotes. one class of repeated dna resembles retroviruses in that members of this class are flanked by long terminal repeats (baltimore 1985) . this class includes the yeast ty-1 repeat, the drosophila copia repeat, and the human the1 repeat (paulson et al. 1985) . another class of repeated sequences resembles processed pseudogenes and lacks long terminal repeats (ltrs). this second class of repeats has been termed retroposons (rogers 1983) , nonviral retroposons (weiner et al. 1986) , and non-ltr retrotransposons (xiong and eickbush 1988) . in this paper, this second class of rnatransposed repeats will be called retroposons. two groups of retroposons have been identified based on their length: the short interspersed repeats, or sines, that are tess than 500 bp long, and the long interrepetitive dna in the rabbit/3-like globin gene cluster. the 0-like globin genes ~, 7, 6, and/3 are shown as boxes along the 45-kb segment of cloned dna (lacy et al. 1979 ). transcription of the active genes is from left to right. the location and orientation of l1 repeats are shown by the filled arrows. the l1 repeats are named liocl-l1oc5 (demers et al. 1986 ). the location and orientation of c repeats, a rabbit sine, are shown by the open arrows. spersed repeats, or lines, that are greater than 6000 bp long (singer 1982) . although no precise sequence specificity has been observed at the insertion sites, sines and lines do have a regional preference for integration in the human genome, as shown by the enrichment of different chromosome bands for either lines or sines (korenberg and rykowski 1988) . although several different sequences have been dispersed as sines in mammals (reviewed in weiner et al. 1986 ), only one sequence element, called l1, has been found to be dispersed as a line in mammals (reviewed in singer and skowronski 1985) . the l1 sequence has been identified in a wide variety of species including primates (lerman et al. 1983) , mice (brown and dover 1981; fanning 1982) , rats (econonmou-pachnis et al. 1985; soares et al. 1985; d'ambrosio et al. 1986 ), dogs (katzir et al. 1985) , eats (fanning and singer 1987) , and rabbits (demers et al. 1986 ). genomic blot-hybridization analysis indicates that the l1 sequence is present in all mammalian species at a frequency of about 104-105 copies per haploid genome (burton et al. 1986) . although the parent genes of sines are transcribed by rna polymerase iii, the l1 repeats appear to be derived from an rna polymerase ii transcript. the parent gene of l1 is proposed to be a protein-coding gene (reviewed in singer and skowronski 1985) . long open reading frames (orfs) are found in the l1 sequences (manuelidis 1982; martin et al. 1984; potter 1984) , and sequenced members from the mouse genome have two overlapping orfs of 1137 bp (orf-1) and 3900 bp (orf-2) shehee et al. 1987) . the orf-2 regions of primate and rabbit li are 65% similar, but the similarity ends abruptly at a conserved stop codon (demers et al. 1986 ). in previous studies on the l 1 repeats from rabbits (l 10c, for line 1 from oryctolagus cuniculus), the b, e, and d repeats identified by shen and maniatis (1980) were shown to be parts of the l1oc repeat. the sequence of one truncated l1 repeat and part of another repeat were presented as a composite sequence, and the orf (corresponding to orf-2) and 3' untranslated region were identified (demers et al. 1986 ). in this paper, the rabbit l1 repeats are characterized more thoroughly, and the similarities and differences of l1 sequences between species are explored further. interspecies comparisons reinforce the conclusion that the l1 repeat has two orfs that are conserved for their protein-coding capacity. however, the region between the two orfs is not conserved among species, and this observation is used to indicate possible start and stop codons for the orfs. orf-1 encodes a composite protein, and the 5' half of orf-1 from l1oc is related to type ii cytoskeletal keratin. subcloning and sequencing of lloc repeats. the sequenced members of the l1oc family were from the rabbit 0-like globin gene cluster isolated by lacy et al. (1979) . interspersed repetitive dna was identified by shen and maniatis (1980) by hybridization and heteroduplex mapping. the five l1 members (demers et al. 1986 ) were sequenced by dideoxynucleotide chain termination reactions (sanger et al. 1977) using subclones in m13 phages as templates (messing 1983 ). analysis of dna sequences. sequence matches were first identified by dot plots generated by the computer program matrix (zweig 1984) . this provides a graphical display of sequence similarity that plots matches (forward similarity) of 23 out of 30 bases. similar sequences were then aligned by the computer program nucaln (wilbur and lipman 1983) using the parameters k-tuple = 3, window size = 20, gap penalty = 7. the protein sequence databases at the protein identification resource (national biomedical research foundation) were searched using the fastp program (lipman and pearson 1985) . the statistical significance of the similarities found by fastp were tested using the program rdf (national biomedical research foundation); this program scrambles the target sequence (revealed by fastp) into 20 shuffled sequences and computes the mean similarity score for the shuffled sequence with the test sequence (in this case, orf-1 of l 10c). the similarity score for the match between the true sequences is compared with the mean score for the shuffled sequences in terms of the number of standard deviations that separate them. conditions as in the southern blot analysis. the ratio of percentage of plaques that hybridized to the percentage of the rabbit genome in one h clone gives the approximate copy number of the region. the average size of an insert in this ~, library is 17 kb (maniatis et al. 1978) . thus, the fraction of the rabbit genome per phage is 17 • 103/3 x 109 or 5.7 • 10-4%. the fact that 96% of the phage in the library have rabbit dna (maniatis et al. 1978) was also taken into account. rodent and human l1 sequences. the mouse li sequence, limda2 , and the rat li sequence, l1rn or line3 (d'ambrosio et al. 1986 ) are randomly isolated l1 members from their respective genomes. the human l1 sequence, l1hs-tbg41, is located 3.3 kb 3' to the human/]-globin gene (hanori et al. 1985) . a consensus l1hs sequence (scott et al. 1987 ) was used in the analysis of orf-i in fig. 8 . the interspersion of repetitive sequences among the rabbit b-like globin genes is shown in fig. 1 . the genes ~ and 7 (formerly/34 and/33) are expressed in embryonic development (rohrbaugh and hardison 1983) , ~ (~b/32) is an inactive pseudogene (lacy and maniatis 1980) , and/3 (/31) is expressed in fetal and adult life (hardison et al. 1979; rohrbaugh et al. 1985) . the 5' to 3' orientations of the proposed rna intermediates of the repetitive elements are indicated by the arrows in fig. 1 ; the a-rich tracts are at the 3' ends. the sequences of the five l1oc repeats are presented in fig. 2 . l1oc5 is adjacent to l1 oc4 (fig. 1) , so the last nucleotide in the l1 oc5 sequence is followed by the first nucleotide in the l1 oc4 sequence (fig. 2) in the sequence of the gene cluster (margot et al. 1989) . the longest member of the rabbit l1 family in the /3-like globin gene cluster is l1oc5. the next longest member is li oc4; it has an internal deletion of 667 bp (fig. 2, . this is clearly a deletion from l i oc4 and not an insertion in l1oc5 because a similar sequence is present in both mouse and human lls (demers et al. 1986 ). l1 oc5 will be the prototypical rabbit l1 for further analysis because it is the longest and has no extensive internal deletions. the 5' end of l1oc5 is also the end of the cloned region of the rabbit /3-like globin gene cluster (see fig. 1 ). only two of the shen and maniatis (1980) are shown at the bottom of the diagram. individual repeats, l1oc4 and l1oc5, contain sequences for the orf region (demers et al. 1986 ). the other three repeats contain part or all of the 3' untranslated region. l1oc5 and l1ocl have internal direct repeats of 78 bp in the 3' untranslated region. one copy of the repeat is at positions 6015-6092 and the other is at positions 6212-6289 (lower case letters in fig. 2 ). l1oc4 and l1oc3 have only one copy of this 78-bp sequence, and they do not contain the sequence between the 78-bp direct repeat (present in l1oc5 and l1ocl). thus, the class of l1oc repeats containing one copy of the 78-bp sequence could be derived from the class containing two copies by a deletion between the two 78-bp sequences. another example of a sequence rearrangement is the apparent insertion of 34 bp into l 10e4 between positions 5701-5702 of l1oc5. most members of the l1oc family are flanked by short direct repeats. l 10c 1 and l 1 oc2 are flanked by direct repeats of 9 bp and 5 bp, respectively (fig. 2) . the flanking direct repeats differ for the two individual l 1 repeats, showing that they are not part of the l1 sequence. such flanking direct repeats are often generated by insertion of transposable elements presumably by repair of a staggered break at the target site. the flanking direct repeats for l 1 oc4 and l1oc5 cannot be identified with the available data. the 5' end of l1oe5 has not been cloned. because l1 oc5 is juxtaposed to li oc4, it is possible that l1oc5 may have inserted into l1oc4, in which case the 5' end of l1oc4 is also not available. the only other l1 member, l1oc3, does not have obvious flanking direct repeats generated by a duplication of the target site. the sequence gttaaaaaaa found just 3' to the polyadenylation site (positions 6438-6447) is also found upstream from l1oc3 (margot et al. 1989 ). however, because the sequence gtt(a)7 (or a slight variation ofi0 is also found in all of the other l1 sequences just 3' to the polyadenylation signal, it is likely not to have been generated by a target site duplication around l1oc3. this terminal repetition could be generated by insertion of a circular form of l1 by homologous recombination into a gtt(a)7 sequence at the target site. the structural features revealed by the alignment and comparison of the l1 members from the rabbit ~3-1ike globin gene cluster are summarized in fig. 3 . the b, e, and d repeats identified by shen and maniatis (1980) are also aligned with their position in the l10c sequence. the d repeat is confined to the 3' untranslated region, whereas the b repeat and most of the e repeat are from the orf region. l1ocl begins immediately after the conserved translation stop codon. figure 3 also illustrates the internal sequence rearrangements described above. the diagram of l1oc repeats in fig. 3 shows that they are truncated at a variable distance from the 5' end of the longest elements. this truncation from the 5' ends is common in the whole population of li repeats, as demonstrated by using four regions of l1oc5 as probes against the rabbit genomic dna library in a plaque hybridization assay. by counting the number of plaques that hybridized to a given probe, the approximate copy number of each region of the l1oc5 repeat was determined (see materials and methods). as shown in fig. 4 , the 5'-most region of lioc5 is represented about 11,000 times in the haploid genome of the rabbit, and regions of l1 located more 3' are found more frequently. the largest increase in copy number is seen in the region from positions 4351 to 6004 that includes the 3' untranslated region; this region is represented at least 66,000 times. however, the relationship between the length of the repeat and the copy number is not linear; only a gradual decrease in copy number is observed as probes going from position 4350 to position 1 are used (fig. 4) . therefore, many of the l1 repeats detected with the probe from the 5' end may be full length, indicating that up to 17% of the population of lioc repeats could be full length. this difference in copy number at the 5' and 3' ends of lioc repeats is also observed when uncloned genomic dna is hybridized with the different l1oc probes (data not shown). thus, the lower copy number at the 5' end is not a result of underrepresenration in the cloned genomic library. because the 5' end of l1oc5 is at the end of the cloned portion of the rabbit/3-1ike globin gene cluster, it is likely that the nucleotide sequence obtained from l1oc5 is not that of a full-length l1 repeat. therefore, cloned subfragments of l 1 oc5 were used as probes against southern (1975) blots of rabbit genomic dna to determine the average structure of full-length rabbit l1 repeats. discrete genomic restriction fragments detected with l1oc5 probes were mapped by two strategies. the portion of l 1 oc contained within the genomic restriction fragment was determined by which probes from l1oc5 hybridized to the fragment, and then the genomic restriction fragment was aligned with conserved restriction sites found in the cloned li oc dna. this analysis is presented in detail in demers (1987) , and the portion relevant to the 5' end of l1oc is summarized in fig seal 2.1 sphl 1.9 xmnl 3.7 the longest restriction fragment extending 5' to the cloned end of l 1 oc5 is the psti 4.0-kb fragment that ends 1 kb 5' to the cloned region of l1oc5 (fig. 5 ). the scai 2.1-kb, sphi 1.9-kb, and xmni 3.7-kb genomic fragments all have 5' ends between the conserved psti site located outside l1oc5 and the 5' end of l1oc5 (fig. 5) . these data indicate that fulllength l1oc repeats wii1 extend at least 1 kb further 5' than the sequenced portion of l1oc5. several clones from the rabbit genomic dna library are currently being studied in order to determine the 5' end of l1oc repeats. the sequence of the rabbit l1 repeat was compared with the sequences of the mouse and human li repeats by dot-plots and by sequence alignments. the dot-plot analyses in fig. 6 show that the internal sequence of l1oc is very similar to both l1md (mouse) and l1hs (human) over very long segments, whereas the 5' and 3' ends are not conserved between species. the internal region of sequence similarity of about 4.5 kb is divided into two pans, a short region of similarity of about 300 bp followed by a very long segment of similarity. the long segments of internal similarity are in the portion of l 1 that encodes open reading frames (orfs). the orfs found in the l1oc5 sequence are shown in fig. 7 , along with a comparison of the orfs from l1md. the mouse limda2 sequence contains two orfs, one of 1137 nucleotides (top strand, n frame in fig. 7 , bottom panel) and one of 3900 nucleotides (top strand, n + 1 frame in fig. 7 ), that overlap by 14 nucleotides ). seven open reading blocks are in the rabbit l1oc5 sequence in frames n, n + 1, and n + 2 ( fig. 7 , top panel). the bar between the stop codon maps of each species shows the regions of similarity ( fig. 6 ) as filled boxes. it is apparent that the regions of l1 that are similar between species contain extensive orfs, although the orfs at the 5' end are not similar between species. rabbit l1 repeats have only two major orfs. although the data in fig. 7 show that l1oc5 has several orfs, they are probably derived from longer reading frames in the ancestral l1 sequence. the fig. 9 . sequence similarities in the orf-! region. the l]oc orf-i region is shown as a black box, numbered according to the codon positions in fig. 8 . the orf-1 regions from l1md and l 1hs are displayed as composite boxes. the darkness of the fill in each box is proportional to the extent of similarity of the l 10c sequence. the percent identity ofthe encoded amino acids, compared to the l1oc sequence, are given in the boxes. a box representing a portion of the type ii cytoskeletal keratin sequence is aligned with the segment of the lioc sequence that matches it. the percent of amino acids identical to the l1oc orf-1 translated sequence is given in the boxes, and the amino acid positions in the keratin sequence are listed below the boxes. a gap penalty of -! was assessed in calculating the percent identities. larity corresponds to orf-2 and the short region of similarity corresponds to the 3' portion of orf-1. the two orfs are overlapping in l1md, and it is of interest to determine whether this feature is conserved in li repeats from other species. also, orf-1 appears to be a hybrid sequence because it is well conserved between species in the 3' half but it is not well conserved in the 5' half. therefore, the sequence of orf-1 and the region between the orfs were aligned for the l1 repeats from rabbit, mouse, rat, and humans. figure 8 shows both the aligned nucleotide sequences and the predicted amino acid sequences. sequences that match well between species are in reverse text, whereas sequences that do not match well are in plain text. inspection of the aligned l1 sequences allows a tentative identification of the start and stop sites of the orfs. this analysis reveals that no overlap between reading frames is seen in rabbit and human l1 repeats. the end of orf-1 in l1md is the taa at positions 1163-1165 (boldface in fig. 8) . the same sequence is found in the rat l1 sequence (l 1rn), and in-phase terminators are found nearby in l10c and l1hs (boldface taas in fig. 8 ). orf-2 in l1md begins in a different reading frame at position 1149, and thus it overlaps with orf-1 for 14 nucleotides. by aligning the sequences of the different l1 s in the well-conserved orf-2 region, it is apparent that an atg is conserved in the rabbit and human sequences at positions 1235-1237. an in-frame atg two codons upstream was previously identified as the start of orfb in the l1rn sequence (d'ambrosio et al. 1986 ) and an atg is also in frame in the l1md sequence seven codons upstream. one can propose that the taa close to position 1163 is the end of orf-1 and the atg at positions 1235-1237 is the start of orf-2 in rabbit and human l1 repeats. in an independent analysis of several individual l1hs repeats, these same codons were assigned as the end of orf-1 and the start of orf-2 in the consensus l 1hs sequence (scott et al. 1987) . as shown in fig. 8 , orf-2 is in the same reading frame as orf-1 in the l 10c and l 1hs sequences. thus, the overlap in reading frames seen for l1md is not observed in l1oc and l1hs. orf-2 in l1rn is in a different reading frame than orf-1, but the l1rn sequence does have an atg proposed as the start of orf-2. thus, lirn has overlapping reading frames, but the sequence in the overlap may not be used to encode a protein. the region between orf-1 and orf-2 is not conserved between mammalian species. the sequence between the taa that ends orf-1 and the atg proposed to be the start of orf-2 is in a region that is quite dissimilar between rabbit and mouse and between rabbit and human (plain text region between positions 1121 and 1240 in fig. 8 ). this is the region of no similarity previously seen in dotplots (fig. 6) . the sequence between the l1 orfs is also not conserved in comparisons between the human and rodent sequences (scott et al. 1987 ). because this region is not conserved, whereas the sequences before and after it are conserved, probably for their capacity to encode a protein, it is unlikely that the inter-orf region encodes a protein. this lack of conservation supports the proposed assignments for the start of orf-2 in l1oc and l1hs. the mouse l1 sequence is ata at positions 1235-1237; this same sequence is found in three sequenced members of the l1md family (shehee et al. 1987) . therefore, the overlap between reading frames 1 and 2 are conserved in mouse lls, but the overlaps are not seen in the rabbit and human l1 sequences. the orf-1 sequence is a composite of conserved and nonconserved regions. as shown diagrammatically in fig. 9 , codons 79-294 are highly related between species in different mammalian orders, and a long segment from codons 171 through 294 shows a 52-56% amino acid identity in these comparisons. a short region from codons 97 to 122 is not conserved, nor are the last 14 codons in the sequence, but in general the c-terminal two-thirds of orf-1 is conserved between orders. a search through the databanks at the protein identification resource (national biomedical research foundation) did not identify any known proteins (besides the l1 proteins) that are related to the c-terminal half of the orf-1 sequence. (lipman and pearson 1985) is shown starting at amino acid position 1 of orf-1 from l1oc5 (fig. 8) and position 303 of the sequence of type ii cytoskeletal keratin of humans (johnson el al. 1985) . the orf-1 sequence of rabbit l1 is labeled li, and the type ii keratin sequence is labeled kii. identical amino acids are indicated by colons, and similar amino acids are indicated by periods. the following groups of amino acids are considered similar: p, a, g, s, and t (neutral or weakly hydrophobic); q, n, e, and d (acids and amides); h, k, and r (basic); l, i, v, and m (hydrophobic); f, y, and w (aromatic); and c. in contrast, the n-terminal portion of orf-1 is not highly conserved between mammalian orders. this region shows almost no similarity between rabbit and human (sequence between nucleotide positions 3 and 476 in fig. 8; fig. 9 ), and the comparison between rabbit and mouse shows only a short segment of matching sequence at the 5' end (figs. 8 and 9) . the dissimilarity of the sequences makes it difficult to assign a start point to orf-1. however, an atg is found in the rabbit, mouse, and rat sequences at positions 240-242 of fig. 8 (shown in boldface). an atg is found three codons downstream in the human l1 sequence. other atg codons are either immediately adjacent (mouse and rat) or are 20 codons upstream (rabbit, underlined in fig. 8) . the atg at positions 240-242 has been tentatively assigned as the start of orf-1, and the codons in fig. 8 are numbered starting here. this is 71 codons into orf-1 as defined by loeb et al. (1986) . although the n-terminal half of orf-1 differs among rabbits, mouse, and humans, it is similar between the two rodents, mouse and rat. this region surrounds a 66-bp tandemly repeated sequence in l1rn (soares et al. 1985; d'ambrosio et al. 1986) and contains several in-frame stop codons in l1rn (fig. 8) . it is possible that the coding function of this region has been lost in l1rn. the n-terminal half of orf-1 from the rabbit l 1 sequence is related to type ii cytoskeletal keratin. protein sequence databanks were searched using the fastp program (lipman and pearson 1985) , and a significant match was found with type ii cytoskeletal keratin. the region of l1oc orf-1 that matches with keratin, along with the percent amino acid identity, is shown in fig. 9 , and the alignment with the human 67 kda type ii keratin ) is shown in fig. 10 . the sequences align over a 156-amino acid region, with an average of 20.5% identity. the segment between amino acid positions 95 and 126 oflioc orf-1 is most similar to type ii keratin; this segment contains identical amino acids at 32% of the positions. the similarity between the n-terminal half of orf-1 from l10c and type ii cytoskeletal keratin is statistically significant. the sequence of the type ii keratin was scrambled into 20 different sequences and aligned with the orf-1 sequence to generate an average match score. the match score with the true keratin sequence is 13 standard deviations above the average match score with the scrambled sequences; a difference of 10 standard deviations in this test is an indicator of a significant evolutionary relationship (lipman and pearson 1985) . although statistical significance does not establish biological significance, it is helpful to compare this match with that of a part of orf-2 with reverse transcriptases whose similarity has been cited as significant in the past (hattori et al. 1986; loeb et al. 1986) 9 the alignment between the l1md orf-2 sequence and the sequence of reverse transcriptase from moloney murine leukemia virus shows 17.5% amino acid identity, whereas the alignment between l1oc orf-1 and type ii keratin shows 20.5% identity. it is apparent that orf-1 of the rabbit li contains a region related in sequence to type ii cytoskeletal keratin. the propagation of l1 repeats probably has occurred independently in different mammalian gehomes. although the l1 repeats from lagomorphs, rodents, and primates are similar in size and sequence organization, the 5' and 3' ends are distinctive (summarized in fig. 11) . also, the li repeats i i i i i i i i are located in different positions in orthologous regions of chromosomes, specifically the b-like globin gene cluster of rabbits and humans (margot.el al. 1989 ) and mice (shehee et al. 1989) . because the contemporary /3-like globin gene clusters are descended from a preexisting gene cluster in the last common ancestor, the presence of l1 repeats at different positions in different species indicates that the l1 repeats have integrated independently into these gene clusters (and probably the whole genome) is each species. it is noteworthy, therefore, that the structure of the population of l1 repeats is quite similar in several mammals. most members of the l1 repeat family in rabbits (this paper), mouse (voliva et al. 1983) , and monkeys (grimaldi et al. 1984 ) are truncated from the 5' end, resulting in a higher frequency in the genome of the 3' end of l1 (about 50,000 copies) than the 5' end (about 10,000 copies). this similarity in copy number suggests that the time of onset and the rate of propagation of l1 repeats is similar in the different species. the rabbit, mouse, and monkey l1 repeats also show a similar pattern for the increase in copy number in which the 5' regions increase gradually in copy number before a large increase in copy number at the very 3' end. this very large increase in copy number in the y region could indicate a strong stop for reverse transcriptase during the conversion of the l1 transcript to a dna copy. given this frequency of polar truncations of l1 in rabbits, humans, and mice, it is striking that most of the l1 repeats in rats are full length (d' am-brosio et al. 1986 ). some aspect of the mechanism for synthesis and propagation of the lls is apparently different in rats, e.g., to allow more full length reverse transcripts or to select for these in the integration process. full length l1 transcripts have been observed in teratocarcinoma cells (skowronski and singer 1985) . given the assignments of start and stop codons proposed in this paper, then transcripts of the l1 repeat of rabbits and humans have the characteristics of a dicistronic rna. polycistronic mrnas are common in bacteria, and a polycistronic arrangement of genes is found in the genomes of some rna viruses that infect animals and plants, e.g., togaviruses, coronaviruses, and tobacco mosaic virus. in contrast, most mrnas from eukaryotic cellular genes are monocistronic. regardless of whether the orfs are overlapping, as in l1md, or are part of a dicistronic rna, as in l1oc and l1hs, the structure of the l 1 repeats resembles dna copies of viral genomes more than conventional cellular transcription units. this suggests that the ancestor to l1 repeats in fact may be some type of animal virus rather than a normal cellular gene, as is often proposed (reviewed in weiner et al. 1986) . a viral ancestor with a wide host range would provide an explanation for the independent, and perhaps simultaneous, entry of the l1 element into different mammalian genomes. the orfs in the l1 repeal appear to encode hybrids of different types of proteins (fig. 11) . or.f-1 can be divided into two parts, the n-terminal por-tion that is not well conserved between species and the c-terminal portion that is well conserved. in the rabbit l 1 repeat, a sequence similar to keratin has been fused to the conserved c-terminal portion of orf-1. although orf-2 is conserved in l 1 s from different orders of mammals it also seems to be a hybrid of sequences related to several proteins (fig. i 1) . the middle portion of orf-2 is related to reverse transcriptase (hattori et al. 1986; loeb et al. 1986 ). different parts of the c-terminal region are related to transferrin (hattori et al. 1986 ) and to nucleic acid binding proteins with the cysteine structural motif, such as the binding proteins derived from retroviral gaggenes (fanning and singer i987) . the cysteine structural motif is related to the zinc fingers characterized in tfiiia and other nucleic acid binding proteins (fanning and singer 1987) . this pastiche of similarities suggests that the l1 element is a fusion of several different sequences, some of which are derived from cellular genes, possibly by a viral vector. another fusion event may account for the variation in sizes and sequences of the 3' untranslated regions of l1 repeats in different mammals. the 3' untranslated regions of orthologous globin genes in mammals have retained obvious sequence similarities over the course of eutherian evolution (e.g., hardies et al. 1984; hardison 1984) , so it is puzzling that no sequence similarity is seen in the 3' untranslated region of l1 repeats in comparisons between mammals (fig. 11) . perhaps the conserved coding region was fused to a different 3' untranslated sequence in each species. it is noteworthy that the 5' end of l1ocl begins immediately after the conserved termination codon that ends orf-2, suggesting that the sequence corresponding to the 3' untranslated region of l 1oc may exist as a distinct repetitive element in the rabbit genome in addition to its presence in the l1 sequence. if so, this would be an additional factor in explaining the large increase in copy number of ll repeats in this region. a similar situation has been observed in drosophila melanogaster, in which suffix, an element repeated about 300 times in the genome, is almost identical to the sequence of the 3' untranslated region (but not the coding region) of the f element that is present about 70 times in the genome (dinocera and casari 1987). the mammalian l1 repeats show a clear similarity to the ingi repeat in the protozoan trypanosoma brucei (kimmel et al. 1987) , the i factor of the i-r system of hybrid dysgenesis in d. melanogaster (fawcett et al. 1986 ), f elements in d. melanogaster (dinocera and casari 1987) , and the r 1 bm (xiong and eickbush 1988) and r2bm (burke et al. 1987 ) insertion sequences in some rrna genes ofbombyx mori (fig. 11) . the similarity has been 17 recognized only in the region proposed to encode reverse transcriptase, and these sequences are more similar among themselves than to retroviral reverse transcriptases (dinocera and casari 1987; xiong and eickbush 1988) . the mammalian l1 s and these protozoan and insect repeats share other structural features, such as the absence of long terminal repeats, the presence of at least two orfs (orf-2 containing sequences similar to reverse transcriptase and either orf-1 or orf-2 encoding a cysteine motif), a length from 5 to 7.5 kb, and a 3' untranslated region with a sequence similar to aataaa close to the 3' end. the dicistronic structure proposed for l1oc and lihs may also be present in the i factor, the f element, and the r ibm repeat (fawcett et al. 1986; dinocera and casari 1987; xiong and eickbush 1988) . each type of repeated element also has some distinctive features, e.g., the specific insertion sites for r1bm and r2bm in the rrna genes and the absence of a-rich tracts at the 3' ends of some of the insect repeats. however, at least parts of these repeats in mammals, insects, and a parasitic protozoan appear to be evolutionarily related. if this type of repeat is restricted to these groups of organisms, it may indicate that the genetic information was transferred among parasites, their mammalian hosts, and insect vectors (k.immel et al. 1987) . a viral progenitor, suggested by the dicistronic arrangement shown in this paper, would provide a means for the horizontal transmission of the l1 sequences. retroviruses and retrotransposons: the role of reverse transcription in shaping the eukaryotic genome screening xgt recombinant clones by hybridization to single plaques in situ organization and evolutionary progress of a dispersed repetitive family of sequences in widely separated rodent genomes the site-specific ribosomal insertion element type ii ofbombyx mori (r2bm) contains the coding sequence for a reverse transcriptase-like enzyme conservationthroughoutmammaliaand extensive protein encoding capacity of the highly repeated dna li genomic sequencing structure of the highly repeated, long interspersed 18 dna family (line or l1rn) of the rat long interspersed l1 repeats in rabbit dna are homologous to l1 repeats of rodents and primates in an open-reading-frame region related polypeptides are encoded by drosophila f elements, i factors, and mammalian l1 repeats insertion of long interspersed repeated elements at the lgh (immunoglobulin heavy chain) and mlvi-2 (moloney leukemia virus integration 2) loci of rats characterization ofa highly repetitive family of dna sequences in the mouse the line-1 dna sequences in four mammalian orders predict proteins thin conserve homologies to retrovirus proteins transposable elements controlling i-r hybrid dysgenesis in d. metanogaster are similar to mammalian lines defining the beginning and end of kpni family segments evolution of the mammalian fl-globin gene cluster comparison of the ~8-1ike globin gene families of rabbits and humans indicates that the gene cluster 5'-~-3'-fi-fl-3' predates the mammalian radiation efstratiadis a (1979) the structure and transcription of four linked rabbit/~-like globin genes sequence analysis of a kpni family member near the 3' end of human fl-globin gene li family of repetitive dna sequences in primates may be derived from a sequence encoding a reverse transcriptase-related protein structure ofa gene for the human epidermal 67-kda keratin retroposon" insertion into the cellular oncogene c-myc in canine transmissible venereal tumor ingi, a 5.2-kb dispersed sequence element from trypanosoma brucei that carries half of a smaller mobile element at either end and has homology with mammalian lines human genome organization: alu, lines, and the molecular structure of metaphase chromosome bands the nucleotide sequence of a rabbit fl-giobin pseudogene linkage arrangement of four rabbit fl-like giobin genes kpni family of long interspersed repeated dna sequences in primates: polymorphism of family members and evidence for transcription rapid and sensitive protein similarity searches the sequence of a large l 1md element reveals a tandemly repeated 5' end and several features found in retrotransposons the isolation of structural genes from libraries of eucaryotic dna nucleotide sequence definition ofa major human repeated dna, the hindlii 1.9 kb family complete nucleotide sequence of the rabbit fl-like globin gene cluster: analysis of intergenic sequences and comparison with human fl-like globin gene cluster a large interspersed repeal found in mouse dna contains a long open reading frame that evolves as if it encodes a protein new m13 vectors for cloning a transposon-like element in human dna rearranged sequence of a human kpni element labeling deoxyribonucleic acid to high specific activity in vitro by nick translation with dna polymerase i retroposons defined analysis of rabbit fl-like globin gene transcripts during development transcription unit of rabbit fl i-globin gene dna sequencing with chain-terminating inhibitors origin of the human li elements: proposed progenitor genes deduced from a consensus dna sequence determination of a functional ancestral sequence and definition of the 5' end of a-type mouse li elements the nucleotide sequence of the balb/c mouse fl-globin complex the organization of repetitive sequences in a cluster of rabbit fl-like globin genes sines and lines: highly repeated short and long interspersed sequences in mammalian genomes making sense out of lines: long interspersed repeat sequences in mammalian genomes expression of a cytoplasmic line-1 transcript is regulated in a human teratocarcinoma cell line rat linei: the origin and evolution of a family of long interspersed middle repetitive dna elements detection of specific sequences among dna fragments separated by gel electrophoresis the l1md long interspersed repeat family in the mouse: almost all examples are truncated at one end roposons: genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information rapid similarity searches of nucleic acid and protein data banks the site-specific ribosomal dnainsertion element ribm belongs to a class of non-long-terminal-repeat retrotransposons analysis of large nucleic acid dot matrices on small computers acknowledgments. we key: cord-025948-6dsx7pey authors: maitra, arindam; sarkar, mamta chawla; raheja, harsha; biswas, nidhan k; chakraborti, sohini; singh, animesh kumar; ghosh, shekhar; sarkar, sumanta; patra, subrata; mondal, rajiv kumar; ghosh, trinath; chatterjee, ananya; banu, hasina; majumdar, agniva; chinnaswamy, sreedhar; srinivasan, narayanaswamy; dutta, shanta; das, saumitra title: mutations in sars-cov-2 viral rna identified in eastern india: possible implications for the ongoing outbreak in india and impact on viral structure and host susceptibility date: 2020-06-04 journal: j biosci doi: 10.1007/s12038-020-00046-1 sha: doc_id: 25948 cord_uid: 6dsx7pey direct massively parallel sequencing of sars-cov-2 genome was undertaken from nasopharyngeal and oropharyngeal swab samples of infected individuals in eastern india. seven of the isolates belonged to the a2a clade, while one belonged to the b4 clade. specific mutations, characteristic of the a2a clade, were also detected, which included the p323l in rna-dependent rna polymerase and d614g in the spike glycoprotein. further, our data revealed emergence of novel subclones harbouring nonsynonymous mutations, viz. g1124v in spike (s) protein, r203k, and g204r in the nucleocapsid (n) protein. the n protein mutations reside in the sr-rich region involved in viral capsid formation and the s protein mutation is in the s(2) domain, which is involved in triggering viral fusion with the host cell membrane. interesting correlation was observed between these mutations and travel or contact history of covid-19 positive cases. consequent alterations of mirna binding and structure were also predicted for these mutations. more importantly, the possible implications of mutation d614g (in s(d) domain) and g1124v (in s(2) subunit) on the structural stability of s protein have also been discussed. results report for the first time a bird’s eye view on the accumulation of mutations in sars-cov-2 genome in eastern india. electronic supplementary material: the online version of this article (10.1007/s12038-020-00046-1) contains supplementary material, which is available to authorized users. sars-cov-2 is the causative agent of current pandemic of novel coronavirus disease which has infected millions of people and is responsible for more than 200,000 deaths worldwide in a span of just 4 months. the virus has a positive sense, singlestranded rna genome, which is around 30 kb in length. the genome codes for four structural and multiple non-structural proteins (astuti and ysrafil 2020) . while the structural proteins form capsid and envelope of the virus, non-structural proteins are involved in various steps of viral life cycle such as replication, translation, packaging and release (lai and cavanagh 1997) . although at a slower rate, mutations are emerging in the sars-cov-2 genome which might modulate viral transmission, replication efficiency and virulence in different regions of the world (jia et al. 2020; pachetti et al. 2020) . the genome sequence data has revealed that sars-cov-2 is a member of the genus betacoronavirus and belongs to the subgenus sarbecovirus that includes sars-cov while mers-cov belongs to a separate subgenus, merbecovirus (lu et al. 2020; wu et al. 2020; zhu et al. 2020) . sars-cov-2 is approximately 79% similar to sars cov at the nucleotide sequence level. epidemiological data suggests that sars-cov-2 had spread widely from the city of wuhan in china (chinazzi et al. 2020 ) after its zoonotic transmission originating from bats via the malayan pangolins . global sequence and epidemiological data reveals that since its emergence, sars-cov-2 has spread rapidly to all parts of the globe, facilitated by its ability to use the human ace2 receptor for cellular entry (hoffmann et al. 2020) . the accumulating mutations in the sars-cov-2 genome have resulted in the evolution of 11 clades out of which the ancestral clade o originated in wuhan. since the first report of sequence of sars-cov-2 from india, there have been multiple sequence submissions in global initiative on sharing all influenza data (gisaid, https://www.gisaid.org/). extensive sequencing of the viral genome from different regions in india is required urgently. this will provide information on the prevalence of various viral clades and any regional differences therein, which might lead to improved understanding of the transmission patterns, tracking of the outbreak and formulation of effective containment measures. the mutation data might provide important clues for development of efficient vaccines, antiviral drugs and diagnostic assays. we have initiated a study on sequencing of sars-cov-2 genome from swab samples obtained from infected individuals from different regions of west bengal in eastern india and report here the first nine sequences and the results of analysis of the sequence data with respect to other sequences reported from the country until date. we have detected unique mutations in the rna-dependent rna polymerase (rdrp), spike (s) and nucleocapsid (n) coding viral genes. it appears that the mutation in nucleocapsid gene might lead to alterations in local structure of the n protein. also the putative sites of mirna binding could be affected, which might have major consequences. the possible implications of the mutations have been discussed, which will provide important insights for functional validation to understand the molecular basis of differential disease severity. the regional virus research & diagnostic laboratory (vrdl) in indian council of medical research-national institute of cholera and enteric diseases (icmr-niced) is a government-designated laboratory for providing laboratory diagnosis for sars-cov-2 (covid19) in eastern india. nasopharyngeal and oropharyngeal swabs in viral transport media (vtm) (himedia labs, india) collected from suspect cases with acute respiratory symptoms/travel history to affected countries or contacts of the covid-19 confirmed cases were referred to the laboratory for diagnosis. the test reports were provided to the health authorities for initiating treatment and quarantine measures. residual deidentified positive samples for sars-cov-2 were used for rna isolation and sequencing in accordance with ethics guidelines of govt. of india. extraction of viral rna from the clinical sample (200 ll) was performed using the qiaamp viral rna mini kit as per manufacturer's protocol (qiagen, germany). the extracted rna was tested for sars-cov-2 (covid-19) by real time reverse transcription pcr (qrt-pcr) (abi 7500, applied biosystems, usa) using the protocol provided by niv-pune, india (https://www.icmr.gov.in/pdf/covid/labs/1_sop_for_first_ line_screening_assay_for_2019_ncov.pdf; https://www. icmr.gov.in/pdf/covid/labs/2_sop_for_confirmatory_ assay_for_2019_ncov.pdf). briefly, first line screening was done for envelope e gene and rnase p (internal control). clinical samples positive for e gene (ct b 35.0) were subjected to confirmatory test with primers specific for rdrp and hku orf (hku-orf1-nsp14). positive control and no template control were run for all genes. a specimen was considered confirmed positive for sars-cov-2 if reaction growth curves crossed the threshold line within 35 cycles (ct cut off b 35.0) for e gene, and both rdrp, orf or either rdrp or orf. rna isolated from nasopharyngeal and oropharyngeal swabs were depleted of ribosomal rna using ribo-zero rrna removal kit (illumina, usa). the residual rna was then converted to double stranded cdna and sequencing libraries prepared using truseq stranded total rna library preparation kit (illumina inc, usa) according to the manufacturer's instructions. the sequencing libraries were checked using high sensitivity d1000 screentape in 2200 tapestation system (agilent technologies, usa) and quantified by real time pcr using library quantitation kit (kapa biosystems, usa). the libraries were sequenced using miseq reagent kit v3 in miseq system (illumina inc, usa) to generate 2x100 bp paired end sequencing reads. for viral genome amplification in samples which did not generate sufficient viral reads, the rna samples were converted to double stranded cdna and amplified using qiaseq sars-cov-2 primer panel (qiagen gmbh, germany) according to the manufacturer's instructions. the multiplexed amplicon pools were then converted to sequencing libraries by enzymatic fragmentation, end repair and ligation to adapters. the sequencing libraries were checked and quantified as above and sequenced using miseq reagent kit v2 nano in miseq system (illumina inc, usa) to generate 2x150 bp paired end sequencing reads. the sequencing reads obtained in shotgun rna-seq experiment were mapped to reference viral sequence, variants detected and consensus sequence for each sample built using dragen rna pathogen detection software (version 9) in basespace (illumina inc, usa). for amplified whole genome sequencing, the viral sequences were assembled using clc genomics workbench v20.0.3 (qiagen gmbh, germany). in both cases, the severe acute respiratory syndrome coronavirus 2 isolate wuhan-hu-1 as reference genome (accession nc_045512.2) was used as the reference sequence. each variant call generated in either pipeline was manually verified in integrated genome viewer igv v7.8.2 (jt robinson et al. 2017) . clustal omega was used to display the mutations in the context of the sequence alignments. bioedit software (v7.2) was used to extract the cds from consensus sequence and to check codon usage. nucleotide to amino acid conversion was done in emboss transeq online tool (f madeira et al. 2019). to generate the clustering patterns of the viral sequences from west bengal, a subset of representative virus sequence data (n = 310) were downloaded from gisaid global database (supplementary table 1). only high coverage data (where the entries have less than 1% n's and less than 0.05% amino acid mutations), complete genome (entries with base pair greater than 29,000) and excluding low coverage entries (entries having more than 5% n's) were used in the analysis. all of the sequences were aligned using mafft (multiple alignment using fast fourier transform). we used the nextstrain pipeline to process the sequence data. nextstrain with the augur pipeline was used to build phylogenetic tree based on the iqtree method, which is a fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood. the tree building process involves the use of these subtypes 'wuhan-hu-1/2019', 'wuhan/wh01/2019' to generate the root of the tree. the tree is refined using raxml (randomized axelerated maximum likelihood). a web-based visualization program, auspice was then used to present and interact with phylogenetic and phylogeographic data. we investigated the potential mirna binding site in the region coding for n protein, found to be mutated in our samples. starmir (http://sfold.wadsworth.org/cgibin/starmirtest2.pl) software was used for this purpose. the whole human mature mirna library was obtained from mirbase database. the sequence in query was taken 50 nt upstream and 50 nt downstream from the site of mutation. the mirnas which bind to the mutation site through seed sequence were shortlisted. the change in bases can prevent certain mirna binding and support the binding of others. therefore, mirna binding was checked for both, original and mutated site. we checked the levels of mirnas in the cancer conditions around the upper respiratory tract in the dbdemc2 database (https://www.picb.ac.cn/ dbdemc/). the tissueatlas database (https://ccbweb.cs.uni-saarland.de/tissueatlas/) was used to analyse the presence and correlation of mirnas in body fluids. all patients were diagnosed positive for sars-cov-2 rna by real time pcr as described above. five of the patients suffered from fever, while seven patients exhibited some symptoms of infection like sore throat, cough with sputum, running nose or breathlessness. one patient suffered from acute respiratory distress syndrome (ards). two patients did not exhibit any symptom (table 1) . five individuals had contact with covid-19 patients in particular; both s2 and s3 had contact with the same patient (table 1). one individual had history of international travel while another had history of domestic travel. the shotgun rna-seq data resulted in high coverage (greater than 100x median depth of coverage) of complete genome sequences of the sars-cov-2 in five samples (s2, s3, s5, s6 and s11) in which greater than 96% of the viral genome was covered at greater than 5x and greater than 99% of the viral genome was covered at greater than 1x. a negative correlation was found between viral load (represented by the threshold cycle or ct value of the rna samples in the real time pcr based diagnostic assay) and the number of reads mapped to the viral genome in the rna-seq library. even with 9 samples, the pearson correlation coefficient was found to be -0.63 (p value = 0.036) (table 1). in particular, it was observed that samples with ct values greater than 25 mostly resulted in generation of low counts of viral sequence reads leading to less than 15x median depth of coverage of the viral genome. in the remaining four samples (s1, s8, s10 and s12), the median depth of coverage was less than 15x and hence the viral genome sequencing was achieved after amplification of the viral genome by a multiplex pcr approach. all the nine sequences have been submitted in the global initiative on sharing all influenza data (gisaid) database. phylogenetic tree analysis of the sequences, along with other complete viral genome sequences submitted from india in gisaid, revealed that seven of these sequences belonged to the a2a clade while only one sequence belonged to clade b4 (figure 1 and table 1). we were unable to classify one of the nine sequences, s1, into any clade due to low sequence coverage. to understand transmission histories of these nine sars-cov-2 isolates from west bengal, we aligned these sequences with more than 6000 global sequences, including thirty sequences submitted in gisaid from india (at the time of our analysis) to identify specific mutations that occur at the highest level of the tip in a branch leading to the specific subtype. the predicted origin of the transmitted subtype in each case was identified with 98-100% confidence from the branch in which our samples were located in the phylogenetic tree (table 1) . the list of mutations detected in the sequences from nine samples are provided (table 2) . seven sequences harboured the important signature mutations of a2a clade. these consisted of the 14408 c/t mutation resulting in a change of p323l in the rdrp and the 23403 a/g mutation resulting in a change of d614g in the spike glycoprotein of the virus. in addition to these, 24933 g/t mutation in the gene coding for spike glycoprotein (g1124v) and triple base mutations of 2881-2883 ggg/aac in the gene coding for nucleocapsid resulting in two consecutive amino acid changes r203k and g204r were detected in s2, s3 and s2, s3, s5 respectively. while the 24933 g/t s gene mutation was unique to these samples and could not be found in any other sequence from india or the rest of the world, the nucleocapsid mutations could be detected in only three other sequences from india (figure 2). out of these, two sequences were obtained from individuals with contact history of a covid-19 patient who had travelled from italy. interestingly, two out of three sequences harbouring these mutations obtained by us belonged to kolkata and with contact history with one covid-19 patient who had travelled from london (uk). the third sequence was obtained from a covid-19 patient from darjeeling, india who had history of travel from chennai, india. these mutations have been found in 16% of sars-cov-2 sequences reported world-wide from countries like uk, netherlands, iceland, belgium, portugal, usa, australia, brazil, etc. rdrp (nsp12) gene of the sars-cov-2 codes for the rna-dependent rna polymerase and is vital for the replication machinery of the virus. we detected a total of six mutations in this gene in the nine samples, out of which four were nonsynonymous, including the a2a clade specific 14408 c/t (rdrp: p323l) mutation. two individuals, s11 and s12, harboured viral genome sequences that shared a unique 13730 c/t (a88v) mutation which was not found in any other sequence reported from india or rest of the world. one individual s10, whose viral sequence belonged to b4 clade, harboured 3 mutations in rdrp, which appear to be clade specific, out of which 2 were nonsynonymous. to study the functional relevance of the mutations, we investigated the alteration in mirna binding in the nucleocapsid coding region, predicted to be caused by the 28881-3 ggg/aac mutations. we found seven mirnas which bind to the original sequence and three which bind the mutated sequence exclusively (table 3 and figure 3 ). the number of bases in the sequence (ggg/aac) which bind the seed sequence of mirna were also identified. the strength of mirna prediction is reflected by the dg value mentioned in the figure 3. mutant base s1 s2 s3 s5 s6 s8 s10 s11 s12 lesser the value, stronger is the binding. the values are comparable to some of the experimentally validated mirna bindings like mir122 binding to hcv rna has dg value of -18.3 kcal/mol for s1 binding site and -22.6 kcal/mol for s2 binding site (data not shown). the values of dg obtained for the mirnas binding to n protein coding region are comparable to these values, suggesting their relevance in the in vivo conditions. we checked the levels of these mirnas in cancer conditions around the upper respiratory tract in the only two samples from west bengal (s11 and s12) harbour this mutation. (d) 14323 c/t, 14326 c/a and 14331 t/c mutations in the rdrp gene in clustal omega. only one sample from west bengal (s10) harbour these mutations. dbdemc2 database. we found that mir-24-1-5p and mir-299-5p were downregulated in most of the cancers. mir-24-2-5p was found to be upregulated in esophageal cancer (esca), head and neck cancer (hnsc), lung cancer (luca) and downregulated in nasopharyngeal cancer (nsca) (supplementary figure 1). assuming that the binding of mirnas would inhibit the viral replication/stability, higher abundance of that mirna would be protective against infection and lower abundance would increase the susceptibility towards infection. to comprehend the results, we have found that if a patient suffering from esca, hnsc, luca is infected with the original virus containing ggg sequence, the upregulated mir-24-2-5p would be protective against the infection. but, if the same patient is infected with the mutated virus containing aac sequence, mir-24-2-5p will not be functional anymore and mir-299-5p which targets the mutated site is also downregulated. this could make the patients suffering from described cancers, highly susceptible to infection with the mutant virus. we also checked if these mirnas are associated with other disease conditions and found that mir-299-5p is down regulated in type 2 diabetes mellitus (t2dm) and hence could serve as one of factors for increased susceptibility of t2dm patients for the mutated viral subtype and increase the risk of comorbidity (huang et al. 2018) . another mir-3162-3p, targeting original subtype, is reported to be higher in asthma patients (fang et al. 2016) . this could be one of the factors limiting the original viral propagation, but the loss of its targeting site in mutated viral subtype could increase the host susceptibility towards viral infection. we further checked if there are some other conditions that could alter the availability of these mirnas at the site of infection. therefore, we used the tis-sueatlas database to analyse the presence and correlation of these mirnas in body fluids. we found that there is differential expression of certain mirnas in the saliva of patients suffering from pancreatic cancer. mir-642b-5p, mir-3162-3p and mir-299-5p were found to be upregulated in the saliva of pancreatic cancer patients which could provide similar protective/susceptible effect as mentioned of mirnas before (supplementary figure 2) . mirnas have been known to affect viral replication and stability by binding to protein coding regions of the genome of h1n1, ev71, cvb3 and many more viruses (bruscella et al. 2017; trobaugh and klimstra 2017) . in most of the cases, binding of mirnas leads to translational repression of the targeted protein and hence directly affects viral rna replication. targeting by mirnas could decrease the levels of n protein, which is involved in various steps of viral life cycle including replication, translation and coating of viral rna to form the nucleocapsid. hence, altered levels of the shortlisted mirna could regulate various viral processes and severity of sars-cov-2 infection. the effect of mirnas would be opposite if they assist in viral replication/stability, but that needs to be experimentally confirmed and still holds the importance of mirnas targeting the original and mutated sites. we analysed the 28881-3 ggg/aac mutations in the nucleocapsid gene which results in contiguous amino acid changes of r203k and g204r for their potential role in alteration of structure of the encoded protein. the sites of these mutations at position are located in the sr-rich region which is known to be intrinsically disordered (chang et al. 2014 ). in addition, this region is known to encompass a few phosphorylation hsa-mir-3162-3p 3 hsa-mir-4699-3p 2 hsa-mir-6826-3p 3 hsa-mir-299-5p 3 hsa-mir-5195-5p 3 hsa-mir-12132 1 hsa-mir-24-1-5p 2 hsa-mir-3679-3p 3 hsa-mir-642b-5p 3 hsa-mir-24-2-5p 2 sites (surjit et al. 2005) , notably the gsk3 phosphorylation site at ser202 and a cdk phosphorylation site at ser206 which are in close proximity to these mutations. the sequence motifs and are entirely consistent with gsk3 and cdk phosphorylation motifs, respectively. when ser202 is phosphorylated which incorporates a large negative group tethered to the sidechain of ser, as seen in many other substrates of kinases, it is likely that charge neutralization takes place involving positively charged sidechains in the sequential and spatial vicinity. arg203 is a part of gsk3 phosphorylation motif and its sidechain could potentially contribute to charge neutralization at p-ser202. given the sequential, and therefore spatial proximity of arg203 to p-ser206 the sidechain of arg203 could potentially be involved in interaction also with phosphate group at position 206. this interaction would contribute to reduction of conformational entropy. similarly, arg209, a part of cdk phosphorylation motif, would contribute to charge neutralization at p-ser206. arg203 and gly204 are mutated to lys and arg respectively (figure 4). spike protein (s) of coronaviruses is a class i viral fusion protein which is synthesized as a single chain precursor that trimerizes upon folding. it is composed of two subunits: s 1 (in the amino terminal) containing the receptor binding domain (rbd) and s 2 (in the carboxy terminal) that drives membrane fusion ( in all the three structure, d614 lies in a loop at the interface between any two out of the three protomers. the co-ordinates for the d614 side-chain in chain a and c of 6vyb are available only up to c b -atom and the orientation of these atoms are similar to that observed in the respective atoms of d614 in 6vxx. the co-ordinates of all the side-chain atoms of d614 in chain b of 6vyb are available and they are similar to that observed in chain b of 6vxx. the side-chain of d614 in all the protomers of 6vxx and chain b of 6vyb point outward from the core of the protein toward the solvent. the side-chain orientation of d614 in all the three chains of 6vsb is different from the former two structures. this differential orientation of d614 side-chain in 6vsb facilitates formation of hydrogen bond between d614 (present in s 1 subunit) and t859 (present in s 2 subunit) from the neighbouring chain in two out of the three interfaces found in 6vsb ( figure 6 ). taken together, these facts suggest that d614 is highly flexible and support the wobbly nature of the inter-protomeric hydrogen bond observed between d614 and t859. contribution of this transient hydrogen bond toward stability of the pre-fusion state cannot be negated. interestingly, s protein of mouse coronavirus (mhv-a59) which has a similar structural topology as that of the sars-cov-2 s protein but shares a low overall sequence identity (*32%), has a conservative substitution at the position equivalent to d614 of the latter. the asn (n655) of mouse coronavirus (mhv-a59) is replaced with asp (d614) in sars-cov-2 ( figure 5 and figure 7) . in earlier literature, n655 has been suggested to offer inter-protomeric interactions that contribute toward maintenance of the s 2 fusion machinery in its metastable state (ac walls et al. 2016) . given the conservation of asp at this position in closely related coronaviruses (bat coronaviruses: btcov-ratg13 and btcov-hku3; sars-cov) and its conservative substitution in mouse coronavirus (mhv-a59), it is likely that d614 is important for structural stability of s protein. as gly lacks a side-chain, the transient hydrogen bond as observed in the wild-type s protein would be lost in the variant with d614g mutation. this can potentially compromise on the structural stability of pre-fusion state of s protein possibly interfering with conformational transitions. moreover, replacement of asp with gly at this position would come with higher conformational freedom at the backbone (c ramakrishnan and gn ramachandran 1965) of the polypeptide resulting in enhancement of local conformational entropy. the gly at this position is solvent exposed and is present at the tip of the c-terminal end of a b-strand. this position is proximal to the region where the s protein attaches itself to the viral membrane (figure 5). it is to be noted that the gly at this position is conserved among the closely related coronaviruses (bat coronavirus ratg13 and hku3, sars-cov) hinting toward its possible role in maintenance of structure and function of the s protein (figure 7). in general, as explained above, gly backbone has higher conformational freedom than any other amino acid residues (ramakrishnan and ramachandran 1965) . therefore, figure 6 . conformation of d614 in three structures (6vxx, 6vyb, 6vsb). (a), (b), (c) overlay of d614 (6vxx: yellow carbon; 6vyb: white carbon; 6vsb: dark pink carbon) from chain a, b and c of the three structures, respectively. to maintain visual clarity, only the backbone of respective chain of 6vxx is shown in cartoon representation. (d), (e), (f) orientation of d614 (green carbon) from chain a (purple cartoon) and t859 (dark blue carbon) from chain b (teal cartoon) in 6vxx, 6vyb and 6vsb, respectively. hydrogen bond is depicted as black dashed line. (g), (h), (i) orientation of d614 (green carbon) from chain c (orange cartoon) and t859 (dark blue carbon) from chain a (purple cartoon) in 6vxx, 6vyb and 6vsb, respectively. hydrogen bond is depicted as black dashed line. the side-chain co-ordinates for d614 in chain a and c of 6vyb are unavailable. protein rendering has been done using pymol (schrödinger, llc). substitution of gly with val would impart rigidity to the local region. the possible implication of such rigidity on the association of s protein with viral membrane could be understood from a structure of s protein in association with the viral membrane. however, such a structure is currently unavailable. substantial uncertainties surround the trajectory of the recent epidemic of covid-19 in india. it is extremely important to track the outbreak by analysing the phylogenetic relationships between different sars-cov-2 genomes prevalent in india and compare them with genomes reported from rest of the world. the errorprone replication process of all rna viruses in general, results in introduction of mutations in their genomes which behave as a molecular clock that can provide insights into the emergence and evolution of the virus. the data till date suggests that sars-cov-2 emerged not long before the first cases of pneumonia in wuhan occurred . in this study, direct massively parallel sequencing of the viral genome was undertaken on nasopharyngeal and oropharyngeal swab samples collected from infected individuals from different districts of west bengal. we have analysed the first nine sequences in this report. recent analysis of sars-cov-2 sequences from all over the globe has revealed that the outbreaks have been initially triggered in most countries by the original strain from wuhan, clade o, which thereafter have diversified into multiple clades (yadav et al. 2020; biswas and majumder 2020) . temporal sweeps leading to replacement of the ancestral o and other clades by a2a, have been detected. until our report, initial sequences from samples obtained from individuals with travel history to china, reported genetic similarity to the clade o, which was obtained at the beginning of the outbreak in wuhan, china. rest of the sequences reported from india mostly belonged to either clade a3 (18%) or a2a (44%) (supplementary table 3), with evidence of the temporal sweep where the a2a is emerging as the predominant clade (biswas and majumder 2020) . the a2a clade is characterized by the signature nonsynonymous mutations leading to amino acid changes of p323l in the rdrp which is involved in replication of the viral genome and the change of d614g in the spike glycoprotein which is essential for the entry of the virus in the host cell by binding to the ace2 receptor. notably, the d614g mutation is close to the furin recognition site for cleavage of the spike protein, which plays an important role in virus entry. whether both these mutations have resulted in the evolution of a more transmissible viral subtype i.e. the a2a clade, is yet to be verified by in vitro and in silico analyses. interestingly, we also found that one of viral sequences in our study belonged to the b4 clade, which originated in china (gonzalez-reiche et al. 2020). b4 clade sequences have not been reported from india earlier and are only less than 1% of sequences reported worldwide. probably, the individual s10 was transmitted this subtype by contact with others who had travel history to china although this information was not available in the patient clinical history. emergence of viral subclones in an outbreak can affect the transmission patterns and disease severity, which are immensely important for public health (harvala et al. 2017; jones et al. 2019) . given the large size of the infected population in india, with the possibility of regional differences in the population and host-related factors, this can have the potential to affect the course of the outbreak. population surveillance is essential for early detection of emergence of such subclones. we analysed the mutations detected in each sequence that we generated and found preliminary evidence of this. we found that three individuals of this study, viz. s2, s3 and s5, shared rare set of three contiguous mutations in their genome which resulted in the consecutive alterations of r203k and g204r. these mutations were also found to be shared with 3 other sequences reported from western india. interestingly, while two out the three sequences harbouring these mutations were from individuals who shared contact history with a covid-19 patient with history of travel from italy, two out of the three samples from west bengal shared contact history with the same covid-19 patient with history of travel from uk. the third individual whose sample harboured these mutations, viz. s5, was found to have history of travel from chennai, india, but the possibility of the patient having contact in transit with an individual with international travel history cannot be excluded. additionally, origin of the viral subtypes infecting s5 and s6 has also been predicted by phylogenetic analysis to be europe (uk). s6 had been infected in delhi, india where he had contact with an infected individual who travelled from europe. one of the individuals s10, harboured a viral subtype which is predicted to have been transmitted in china. s2 and s3, who shared an identical sequence of the virus, also harboured one unique mutation resulting in the amino acid alteration of g1124v in the spike protein. this correlates with the fact that these two individuals had also been known to have contact with the same covid-19 patient. viral rna sequences obtained from two samples s11 and s12 shared all mutations except a v32l mutation at orf8 harboured by s11 and not by s12. interestingly, both these individuals belonged to the same district of east medinipur, had history of contact with covid-19 patients and did not exhibit any clinical symptom. thus our findings indicate that the viral subtypes transmitted in the eastern region of india, in particular west bengal, have mostly originated from europe and also china. sequencing of large number of samples are being presently undertaken to confirm and elaborate these initial findings. rdrp is essential for replication of viral rna genome and hence this gene is expected to be conserved. interestingly, we detected multiple mutations in this gene, the majority of which were non synonymous and hence result in alteration of protein sequence. in particular, the p323l was present in all a2a sequences in our samples. this mutation is located adjacent to a hydrophobic cleft in rdrp which is a promising target for potential drugs (pachetti et al. 2020) . sequences from two samples, s11 and s12, shared a unique rdrp mutation at a88v which has not been detected until date in rest of the sequences submitted from india or worldwide. as observed earlier, these two samples harbour viral subtypes whose genomes are strikingly similar. sequence obtained from one of the samples s10, which belonged to the clade b4, did not possess the p323l mutation. instead, it harboured three different mutations resulting in two non-synonymous changes of h286y, p287t and a synonymous mutation which were not found in any other sequences reported from india until date and are specific for the b4 clade. it remains to be seen whether these amino acid alterations result in substantial changes in structure or function of rdrp, resulting in emergence of drug resistant subtypes or enhancement in mutation rate in the viral genome. we investigated the potential of the mutations detected in the nucleocapsid region to effect alterations in the viral and host processes. we found that this mutation results in considerable alterations in the predicted binding of mirnas, which might play a role in the establishment and progress of infection in the patient. we also found that some of the mirnas which are predicted to bind to the mutated subtype might be downregulated in multiple cancer types. this raises the possibility that cancer patients might have higher susceptibility to the mutated sub-clone due to the reduced ability to contain the virus in vivo, compared to infection by the original virus of the same clade. the leads obtained from this study need to be pursued to develop mirna based novel therapeutic approaches. we also analysed the predicted structural alterations in the viral nucleocapsid protein, which might be caused by consecutive alterations of r203k and g204r. as a result of these mutations, we have two strong positively charged residues in close sequential positions as opposed to only one positively charged residue in the other genotype. given the structural vicinity of p-ser202 and p-ser206 and the long sidechains of lys and arg with high positive charge and significant side-chain conformational freedom in this genotype, both these residues potentially could contribute to charge neutralization of the phosphorylated serine residues. this contributes to further reduction of conformational entropy compared to the other genotype. while lys203 is likely to offer electrostatic interactions to p-ser202, arg204 (with a greater number of positively charged centres as compared to lys) could potentially simultaneously interact with the phosphate groups at both p-ser202 and p-ser206. together, these two positively charged residues (lys203 and arg204) have the potential to offer additional interactions to the phosphorylated serine residues at 202 and 206 positions as opposed to only one of them (arg203) in the other genotype. consequently, one can expect a significant difference in conformational entropy as well as in the inter-residue interaction structural network between the two genotypes especially when ser202 and ser206 are phosphorylated. further, gly at position 204 in one of the genotypes would confer significantly higher conformational freedom at the backbone (ramakrishnan and ramachandran 1965) of the polypeptide chain compared to arg in the equivalent position in the other genotype. this mutation adds another dimension to the likely structural differences in this local region of the two genotypes. subsequently, phosphorylation-mediated functional events might be different in the two genotypes (surjit and lal 2008; surjit et al. 2006 ). these proposed differences in the inter-residue structural network between the two genotypes are depicted schematically in figure 4 . admittedly, the proposed network of interactions is fraught with uncertainty. however, given two positively charged residues in one genotype compared to only one in the other genotype, the charge neutralization structural interaction networks involving p-ser202 and p-ser206 has to be certainly different going by the highly established literature on kinase substrates (kitchen et al. 2008; krupa et al. 2004) . interestingly, the mutations d614g (in s d domain) is supposed to confer flexibility in the s d domain and the mutation g1124v might impart partial rigidity in the conformation of s 2 domain. obvious question is whether such structural alterations in local region would have any consequence in receptor binding affinity of spike protein. since the mutation resides in rbd domain-s1 subunit of spike protein, residue 614 is not directly involved in the interaction with ace2. but the mutation might have some effect on the positioning of the residues involved in interaction. now to address the concerns whether these mutations are expected to affect the sensitivity of the existing diagnostic kit, we have again explored the implications of the structural changes. most likely, the presence of mutation should not affect the rapid detection kits because these kits detect the presence of specific igg/igm antibody against viral n protein or viral s protein. the whole protein is coated for the test and therefore polyclonal antibodies would provide the result here. change in just one epitope might not affect the overall result. we have further checked if the mutation sites fall in immunodominant epitopes. this data is available for sars proteins and the sites where we have found mutation have been shown to be conserved in sars and sars-cov-2. while the mutation site of n protein does not elicit much antibody response, region 603-634 of the s protein of sars has been shown to be a major immunodominant epitope in s protein (he et al. 2004) . change in this epitope by mutation could alter the sensitivity of the igg/igm tests conducted. also, there are certain diagnostic kits being designed to check the presence of viral antigen in the clinical sample. the abundance of antibodies targeting the mutation sites needs to be checked in those kits, to be more effective across the viral strains harbouring different mutations. we also detected interesting relationships between ct value of diagnostic assay as a surrogate of viral copy number and viral sequence reads obtained. we recommend that for future sequencing studies, the shotgun rna-seq approach should be used for high viral copy number represented by low ct values while for rest, a viral genome amplification method should be used. although the sample size of our preliminary report is small, follow up studies are underway to confirm these observations for understanding the impact of the same in the ongoing outbreak of covid-19 in india. we have not commented on the relationship of the viral sequence alterations with disease severity due to the limited sample size of this analysis. we hope to provide valuable information on this aspect based on the expanded number of samples being sequenced at present. our findings provide leads which might benefit outbreak tracking and development of therapeutic and prophylactic strategies to contain the infection. finally, we conclude that the initial sequences generated by us from first nine samples in west bengal in eastern india indicate a selective sweep of the a2a clade of sars-cov-2. however, the viral population is not homogenous and other clades like b4 also exist. we have also detected emergence of mutations in the important regions of the viral genome including spike, rdrp and nucleocapsid coding genes. some of these mutations are predicted to have impact on viral and host factors, which might affect transmission and disease severity. this preliminary evidence of emergence of multiple subclones of sars-cov-2, which might have altered phenotypes, can have important consequences on the ongoing outbreak in india. during the ongoing covid-19 pandemic. we also acknowledge the assistance provided by dr. sillarine kurkalang (nibmg), mr. sumitava roy (nibmg) in reviewing the sequence data, ms. soumi sarkar (nibmg) for assistance in statistical analysis, and mr. anand bhushan and ms. meghna chowdhury for providing assistance in laboratory support and logistics. sd and ns would like to acknowledge support from j c bose fellowship. we also thank dbt-iisc partnership programme at iisc, bengaluru, and the national genomics core at nibmg. hr and sc would like to acknowledge support from csir-spm fellowship and dst-inspire fellowship, respectively. tg would like to acknowledge dbt-ra fellowship. severe acute respiratory syndrome coronavirus 2 (sars-cov-2): an overview of viral structure and host response analysis of rna sequences of 3636 sars-cov-2 collected from 55 countries reveals selective sweep of one virus type viruses and mirnas: more friends than foes modular organization of sars coronavirus nucleocapsid protein the sars coronavirus nucleocapsid protein -forms and functions the effect of travel restrictions on the spread of the 2019 novel coronavirus (covid-19) outbreak mir-3162-3p is a novel microrna that exacerbates asthma by regulating b-catenin 2020 introductions and early spread of sars-cov-2 in the new york city area emergence of a novel subclade of influenza a(h3n2) virus in london identification of immunodominant sites on the spike protein of severe acute respiratory syndrome (sars) coronavirus: implication for developing sars diagnostics and vaccines sars-cov-2 cell entry depends on ace2 and tmprss2 and is blocked by a clinically proven protease inhibitor glucolipotoxicity-inhibited mir-299-5p regulates pancreatic b-cell function and survival analysis of the mutation dynamics of sars-cov-2 reveals the spread history and emergence of rbd mutant with lower ace2 binding affinity evolutionary, genetic, structural characterization and its functional implications for the influenza a (h1n1) infection outbreak in india from charge environments around phosphorylation sites in proteins structural modes of stabilization of permissive phosphorylation sites in protein kinases: distinct strategies in ser/thr and tyr kinases the molecular biology of coronaviruses structure, function, and evolution of coronavirus spike proteins genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding the embl-ebi search and sequence analysis tools apis in 2019 emerging sars-cov-2 mutation hot spots include a novel rna-dependent-rna polymerase variant stereochemical criteria for polypeptide and protein chain conformations: ii. allowed conformations for a pair of peptide units variant review with the integrative genomics viewer the severe acute respiratory syndrome coronavirus nucleocapsid protein is phosphorylated and localizes in the cytoplasm by 14-3-3-mediated translocation the sars-cov nucleocapsid protein: a protein with multifarious activities the nucleocapsid protein of severe acute respiratory syndrome-coronavirus inhibits the activity of cyclin-cyclindependent kinase complex and blocks s phase progression in mammalian cells microrna regulation of rna virus replication and pathogenesis 2020 structure, function, and antigenicity of the sars-cov-2 spike glycoprotein cryo-electron microscopy structure of a coronavirus spike glycoprotein trimer cryo-em structure of the 2019-ncov spike in the prefusion conformation a new coronavirus associated with human respiratory disease in china full-genome sequences of the first two sars-cov-2 viruses from india probable pangolin origin of sars-cov-2 associated with the covid-19 outbreak a novel coronavirus from patients with pneumonia in china we acknowledge the financial and overall support provided by the department of biotechnology, ministry of science and technology, india, and indian council of medical research and all laboratory staff of the niced-vrdl network for laboratory support key: cord-001786-ybd8hi8y authors: dutilh, bas e title: metagenomic ventures into outer sequence space date: 2014-12-15 journal: bacteriophage doi: 10.4161/21597081.2014.979664 sha: doc_id: 1786 cord_uid: ybd8hi8y sequencing dna or rna directly from the environment often results in many sequencing reads that have no homologs in the database. these are referred to as “unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as “biological dark matter." however, unknowns also exist because metagenomic datasets are not optimally mined. there is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. this can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crassphage. the unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. however, it remains an open question, what is the actual size of biological sequence space? the de novo assembly of shotgun metagenomes is the most powerful tool to address this question. s equencing dna or rna directly from the environment often results in many sequencing reads that have no homologs in the database. these are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." however, unknowns also exist because metagenomic datasets are not optimally mined. there is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. this can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crassphage. the unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. however, it remains an open question, what is the actual size of biological sequence space? the de novo assembly of shotgun metagenomes is the most powerful tool to address this question. metagenomics is the untargeted sequencing of genetic material isolated from communities of micro-organisms and viruses. these communities may be derived from bioreactors, environmental, clinical, or industrial samples; in short, from anywhere in our unsterile biosphere. the classical questions in metagenomics that are asked about the sampled microbial community are "who is there?" and "what are they doing?." 1 originally an approach to answer these classical questions, metagenomics as a field has made great progress in the past decade. applications include the use of metagenomics for the discovery of novel genetic functionality, 2 for describing microbial ecosystems and tracking their variation, 3 in untargeted medical diagnostics and forensics, 4 and as a powerful tool to determine the genome sequences of rare, uncultivable microbes. 5 powered by advances in next-generation sequencing technology, metagenomics has the potential to venture beyond the limits of currently explored sequence space by sampling environmental microbes and viruses at an unprecedented scale and resolution. quite literally, sequence space is defined as the multi-dimensional space of all possible nucleotide (or protein) sequences. 6 sequence space contains n dimensions; one dimension per residue that can take one of 4 (or 20, for proteins) states, with a total volume of s4 n sequences when summed over all possible sequence lengths n. evolution may have largely explored this space, 7 but it remains an open question how large the current biological sequence space is, i.e. the fraction occupied by extant life. figuratively, and within the context of this paper, "outer sequence space" is the remainder of this biological sequence space waiting to be explored by science. metagenomics has traditionally addressed the 2 classical questions listed above by aligning the sequencing reads in metagenomic data sets to a reference database containing known, annotated sequences. this allows the taxonomic and functional diversity of the sampled microbes to be described in terms of existing knowledge, allowing for straightforward interpretation of the results. however, a persistent concern in the analysis of metagenomes has been the unknown fraction, consisting of the reads keywords: biological dark matter, crassphage, human gut, human virome, metagenomics, metagenome assembly, unknowns that cannot be annotated by using database searches. the level of unknowns can range up to 99% of the metagenomic reads, depending on the sampled environment, the protocols used for nucleotide isolation and sequencing, the homology search algorithm, and the reference database. 8 unknowns exist for 4 reasons that are not unrelated. the first reason is technical. due to limitations of some next-generation sequencing platforms and library preparation protocols, spurious sequences may be generated that do not reflect true biological molecules. these artificial sequences include artifacts due to the sequencing technology 9 and chimeras, i.e., sequences generated from separate genetic molecules derived from different organisms. since chimeras frequently arise during pcr amplification, they are expected to be more abundant in environmental amplicon sequencing than in shotgun metagenomics, and can be detected using bioinformatic tools. 10 the second reason that unknowns exist is biological, as they reflect the enormous natural diversity of microorganisms that we are only beginning to unveil with metagenomics. this is both overwhelming and exciting, highlighting how much remains to be discovered in biology. this genetic diversity has been referred to as biological "dark matter," 11, 12 and is especially pronounced in viral metagenomes. 8 this issue can only be resolved by expanding reference databases, as exemplified by recent studies of one of the most studied microbial ecosystems: the human gut. the first metagenomic snapshots of the microbiota in the human gut were taken from 2 healthy adults, and revealed a high interindividual diversity and many unknowns. 13 to a large extent, these unknowns were resolved when a reference catalog was created based on the sequences in the gut metagenomes themselves, decreasing the percentage of unknowns from »85% to »20%. 14 moreover, subsequent large scale sequencing efforts revealed that in fact, many people share a similar intestinal flora, regardless of whether these similarities are viewed as discrete enterotypes 15 or as gradients. 16 these results illustrate how unknowns can be depleted by expanding the databases with appropriate reference sequences. this not only requires increased sequencing effort of phylogenetically diverse isolates 17 or single cells, 11 but also mining of draft genomes from metagenomes, 18 sampled from microbial environments around the globe. 19 thus, by mapping the global sequence space, we can provide reassurance that at least some level of sampling saturation can be achieved. for viruses, and particularly for bacteriophages, efforts to provide a denser sampling of sequence space are still lacking. the third reason that unknowns exist is methodological. because the advances in dna sequencing technology have greatly outpaced improvements in computer power, 20 bioinformatic approaches to analyze metagenomes often cut corners. for example, reference databases may be reduced to include only those references that are expected in the sample a priori. moreover, read annotation may be limited to identifying almost exact sequence matches, as this can be computed much faster than if sequence variations needs to be taken into consideration in a permissive homology search. these issues lead to an inherent blind spot for discovering true novelty, such as sequences that are not expected in the sample, or organisms that have not been observed before. one way to, at least partially resolve this issue is by de novo assembly of the metagenome. depending on the diversity of the sample, assembly can combine many short sequences (individual reads) into fewer, longer ones (assembled contigs). reducing the number, and increasing the length of the sequences allows homology searches to be performed with more sensitive, computationally more expensive algorithms such as translated homology searches or profile searches, leading to more specific annotation and improved biological interpretation. moreover, larger and more comprehensive reference databases can be used, allowing unexpected hits to be found. the fourth reason that unknowns exist is logistical. most research projects that generate metagenomic sequencing datasets deposit the read files in large repositories, provide an accession number in the associated publication, and move on. it is not unlikely that many of these data sets, consisting of files sometimes gigabytes in size, are never looked at again. thus, while a certain sequence may have been "seen" in a metagenome and is thus strictly no longer "dark matter," it will still not be recognized when it is observed again. reidentification of this sequence would only be possible if the publishing researcher identified it as an interesting sequence in his or her (assembled) metagenome, and submitted it to a searchable database like genbank. 21 because genbank maintains very high standards for the sequences it accepts, submission can be a tedious process that is rarely worthwhile for unknown metagenomic contigs. an in depth investigation of the unknowns is rarely within the scope of a research project, and those sequences are thus first ignored and later forgotten. this is a waste of valuable resources: time, money, and work. the metagenomes available in public databases should be better exploited and mined for common sequences. to facilitate this, it is critical that metadata annotations of the metagenomes include a detailed description of the samples and sequencing protocol. 22 exploiting these datasets will allow us to create more comprehensive maps of sequence space, and greatly improve our understanding and interpretation of metagenomes. in the short term, ignoring the unknowns can facilitate the interpretation of a metagenome. because a taxonomic or functional description cannot be provided, the classical questions in metagenomics are left unanswered for the unknown fraction of the metagenome, and concentrating on the annotated sequences leads to a more straightforward answer. however, unexpected or novel sequences are quickly overlooked, even if they represent highly abundant or widespread organisms. thus, in the long term, stockpiling the unknown sequencing reads in badly accessible bulk sequence repositories can severely slow down research, the discovery of novel species, and the charting of biological sequence space. one striking example of a novel genome discovered among the unknown sequences is crassphage, a bacteriophage whose genome uniquely aligned sequencing reads from 73% of the 466 analyzed human gut metagenomes, and constituted a total of 1.68% of those metagenomic reads. 23 like many bacteriophages, its genome sequence is highly divergent from everything that was present in the annotated part of the genbank database, which is why it was not observed before. it has been suggested that the unknown fraction of metagenomes is enriched for viral sequences, 8, 24 because viral genomes are thought to evolve more rapidly than the genomes of cellular organisms, allowing them to explore a larger region of sequence space in the same amount of time. to summarize, unknowns are genetic sequences that are difficult to identify using standard methods, such as by alignment to an annotated reference database. unknowns remain a persistent elephant in the room in most metagenomics research projects, and exist for technical, biological, methodological, and logistical reasons. the most promising option to resolve the unknowns is by creating improved reference databases that chart biological sequence space, including the outer realms that remain unexplored by science (also known as dark matter). besides sequencing reference strains or single cells, it may be expected that metagenomic sequencing, assembly, and binning will greatly add to improving these reference databases, for example by identifying common sequences in many metagenomes, and prioritizing them for targeted characterization. characterizing unknowns will be vital to fully exploit the increasingly available metagenomic data sets from all ecosystems, toward understanding the roles of microbes and viruses in the biosphere. it remains an open question what is the actual size of biological sequence space, but the untargeted, shotgun nature of metagenomics makes it the most powerful tool to address this question. metagenomics: application of genomics to uncultured microorganisms fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla human gut microbiome viewed across age and geography isolation of a novel coronavirus from a man with pneumonia in saudi arabia genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes natural selection and the concept of a protein space how much of protein sequence space has been explored by life on earth metagenomics and future perspectives in virus discovery tagdust-a program to eliminate artifacts from next generation sequencing data uchime improves sensitivity and speed of chimera detection insights into the phylogeny and coding potential of microbial dark matter scratching the surface of biology's dark matter metagenomic analysis of the human distal gut microbiome a human gut microbial gene catalogue established by metagenomic sequencing enterotypes of the human gut microbiome a guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets a phylogeny-driven genomic encyclopaedia of bacteria and archaea genomes from metagenomics meeting report: the terabase metagenomics workshop and the vision of an earth microbiome project the pace and proliferation of biological technologies the minimum information about a genome sequence (migs) specification a highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes umars: un-mappable reads solution i thank my collaborators for their contributions in the crassphage project, and the anonymous reviewers of this manuscript for valuable suggestions. key: cord-016798-tv2ntug6 authors: gautam, ablesh; tiwari, ashish; malik, yashpal singh title: bioinformatics applications in advancing animal virus research date: 2019-06-06 journal: recent advances in animal virology doi: 10.1007/978-981-13-9073-9_23 sha: doc_id: 16798 cord_uid: tv2ntug6 viruses serve as infectious agents for all living entities. there have been various research groups that focus on understanding the viruses in terms of their host-viral relationships, pathogenesis and immune evasion. however, with the current advances in the field of science, now the research field has widened up at the ‘omics’ level. apparently, generation of viral sequence data has been increasing. there are numerous bioinformatics tools available that not only aid in analysing such sequence data but also aid in deducing useful information that can be exploited in developing preventive and therapeutic measures. this chapter elaborates on bioinformatics tools that are specifically designed for animal viruses as well as other generic tools that can be exploited to study animal viruses. the chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (orf) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. various databases that organize information on animal and human viruses have also been described. the chapter will converse on overview of the current advances, online and downloadable tools and databases in the field of bioinformatics that will enable the researchers to study animal viruses at gene level. viruses are notorious to infect all forms of life ranging from bacteria to chordates. in humans, viruses are known to cause infectious diseases such as influenza, hepatitis, aids, diarrhoea, encephalitis, dengue fever and, more recently, severe acute respiratory syndrome (sars), ebola (singh et al. 2017a) , zika (singh et al. 2017b) , etc. despite the vaccines and treatments for such diseases, morbidity and mortality both occur as a result of the viral infections. viral disease of animals not only affects the production but also is a threat to humans (saminathan et al. 2016) . a rapid growth in the availability of sequencing methods and a vast amount of viral sequence data have been generated during recent times. thus, it is imperative to decipher this data using more advanced tools such as bioinformatics resources. a large number of bioinformatics tools that can aid in the analysis of viral genomes and develop preventive and therapeutic strategies have been developed for human as well as animal viruses. this chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics. analysis of viral sequence involves use of certain tools that are employable on any novel sequence, for example, gene identification, orf identification, functional annotation and phylogeny. however, due to small genome size, viruses have complex methods to maximize the coding potential of genomes and evolution. many viruses utilize overlapping reading frames or translational frameshifts to code for multiple proteins from limited genome sequences. also, higher rates of mutations and recombination between related viruses pose a challenge in accurate phylogenetic and evolutionary analysis of viruses using general-purpose softwares. lately, enormous growth in the volume and diversity of viral sequences in the databases has been seen. now, it has become imperative to organize data of these viral sequences in virus family-specific resources tailored for accurate analysis of a specific virus. one of the most common applications of bioinformatics in virology was to use phylogenetic analysis of the viral isolates to aid in the epidemiological analysis of viral outbreaks. general-purpose phylogeny programs such as phylip (felsenstein 1989) have been used extensively for the phylogeny and molecular epidemiology of viruses. a comprehensive list of these packages and web servers is maintained by joe felenstein at http://evolution.genetics.washington.edu/phylip/software.html. an open reading frame (orf) is the part of genome that translates into a protein. finding orf is one of the key steps in viral genome analysis. it forms the basis for further analysis such as homologous search, predicting proteins, functional analysis and viral vaccine and antiviral target discovery. if an orf translates a surface protein that is unique to that virus, it may elicit immune responses and could potentially be a vaccine candidate. orf finder by ncbi is a orf prediction program (rombel et al. 2002) . the program outputs a range of each orfs along with its protein translation in six possible reading frames from the input dna sequence. it can be used to search newly sequenced dna for potential protein encoding sequences and to verify predicted proteins using smart blast or blastp (altschul et al. 1990 ). however, the web version of the program is limited to a query sequence length of 50 kb only. a standalone system has no limitation on length but is available only for the linux 64 operating system. neg8, a 167-codon novel orf in segment 8 of influenza virus, was visualized using orf finder (clifford et al. 2009 ). using the orf finder in association with the basic local alignment search tool blast, 154 orfs were found in the hz-1 virus genome (cheng et al. 2002) . due to small genome size, viruses employ multiple strategies to maximize the coding potential including frameshifts and alternative codon usage. thus, virus-specific programs have been developed to overcome these challenges. genemark (http://opal.biology. gatech.edu/genemark/genemarks.cgi) provides gene prediction tools for viruses (besemer and borodovsky 2005) . viral genome organizer (vgo) -a java-based web tool -offers identification of gene and orf identification in viral sequences (upton et al. 2000) . identification of immune epitopes is important in designing new vaccine candidates and in diagnostics. an epitope is the part of an antigen that is recognized by the receptors of immune system components such as antibodies, b cells or t cells. epitopes have been generally classified as either linear or conformational epitopes. t cells recognize linear epitopes, short continuous strings of amino acids derived from protein antigen, presented with mhc class i molecules. b cells and antibodies, on the other hand, recognize conformational epitopes which are formed by interactions of amino acids with multiple discontinuous segments forming a threedimensional antigen (barlow et al. 1986 ). owing to the simple linear structure of t cell epitopes, their interaction with receptors can be modelled with high accuracy (delisi and berzofsky 1985) . a large number of prediction databases and servers thus are available for linear epitope prediction. mhcpep (brusic et al. 1998) , syfpeithi (rammensee et al. 1999) , fimm (schonbach et al. 2005) , mhcbn (bhasin et al. 2003) and epimhc (reche et al. 2005 ) are some of the commonly used t cell epitope prediction programs. immune epitope database and analysis resource (https://www.iedb.org) (vita et al. 2015) offers the most comprehensive set of tools for epitope analysis for epitope prediction covering hla-a and hla-b for humans as well as chimpanzee, macaque, gorilla, cow, pig and mouse and is one of the few databases that cover such a variety of organisms. since 2011, iedb uses netmhcpan as prediction method. netmhc server uses the artificial neural network method to predict binding of peptides to different alleles from human as well as 41 animals including cattle and pig (38 from core). the database also contains curated data for many viruses including influenza and herpesviruses. b cell receptors and epitope interactions are more complex in nature than the linear epitopes for t cells; thus, accuracy of b cell epitopes is relatively low. furthermore, most of the current databases are centred on linear rather than conformational epitopes. bcipep is a tool developed for predicting the linear epitope of b cells (saha et al. 2005) . epitome is a database of structure-inferred antigenic residues in proteins (schlessinger et al. 2006) . epitome is especially useful in the prediction of antibodyantigen complex interaction. the database is available at http://www.rostlab.org/ services/epitome/. antijen is an intricate database with entries on both t cell and b cell epitopes. it emphasizes on integration of kinetic, thermodynamic, functional and cellular data within the context of immunology and vaccinology (toseland et al. 2005) (fig. 23.1a ). three-dimensional prediction of viral proteins can be used to predict the correlation between actual protein structure and antigenic sites, folding surfaces and functional motifs. such structural modelling tools may be implicated to identify and design novel candidates for antiviral inhibitors and vaccine targets. secondary structures may be predicted using the tool predictprotein (http://www.predictprotein.org/) (rost et al. 2004) . using this online tool, along with secondary structures, solvent accessibility and possible transmembrane helices can be predicted. further, it also provides expected accuracy of prediction methods. swiss-model (http:// swissmodel.expasy.org/) is a popular tool for the prediction of a 3-d structure of a protein. 3-d structure prediction programs usually employ homology searching using similar and known protein structures as templates. one of the most commonly used database for such templates is protein data bank (pdb) (reddy et al. 2001) . output from the swiss-model program includes the template selected, alignment between the query sequence and the template, and the predicted 3-d model. results of swiss-model are, however, only sent by email (figs. 23.1b, 23.1c, 23 .1d and 23.1e). for long, bioinformatic analysis of viruses utilized common bioinformatics tools developed for other organisms. however, analysing viral genomes using general bioinformatics tools could compromise the accuracy and sensitivity of analysis. virus genomes are too small (e.g. < 10 kb) to compute statistics with their codon usage. to maximize the coding potential, viruses work with unusual codon usage patterns comprising of overlapping coding and non-coding functional elements. additionally, viruses also rely on other translational mechanisms such as stop codon read-through, frameshifting, leaky scanning and internal ribosome entry sites. comparative genomic analysis of viruses is complicated by the fact that highly conservative sequences may not be coding for anything. presence of overlapping pairs may be indicated by conservation for the sequences where there is overlapping of cdss and/or non-coding functional elements. novel virus types comprise of new cdss that are different than previously known cdss. there are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. in this section, some of the databases and resources useful for the analysis of veterinary viruses are discussed (table 23 .1). viruses are one of the most diversified and dynamic microorganisms. with increasing viral genome sequencing, there was a need to develop bioinformatics tools to compare and analyse the voluminous data. to meet this requirement, one such downloadable software package is base-by-base, which aids in analysis of whole viral genome alignments at single nucleotide level (brodie et al. 2004 ). moreover, with the online resource genome information broker for viruses (gib-v), comparative studies can be made using the generic tools such as clustalw, blast and keyword search algorithms (hirahata et al. 2007 ). another downloadable web server tool, viroblast, is an exclusive blast tool that can be used for queries against multiple databases (deng et al. 2007 ). sequences from a variety of viral strains can be analysed simultaneously using the alvira software, which is a multiple sequence alignment tool that provides graphical representation as well (enault et al. 2007 ). furthermore, comparative analysis of genes and genomes of coronavirus can be carried out by using the covdb (coronavirus database) (huang et al. 2008 ). the digital resource viralzone is designed specifically to comprehend viral diversity and acquire information on viral molecular biology, hosts, taxonomy, epidemiology and structures (hulo et al. 2011 ). the simmonics program was upgraded to the simple sequence editor (sse) software package, wherein the user-given sequences can be aligned and annotated and further can be analysed for diversity and phylogeny (simmonds 2012) . evolutionary changes in viral genome lead to polymorphisms in their proteins, which in turn result into changes in viral phenotype such as viral virulence, viral-host interactions, etc. the digital database, viralorfeome, not only stores all variants and mutants of viral orfs, but also provides tools to design orf-specific cloning primers (pellet et al. 2010 ). further, degenerate primer pairs can be selected and matched to amplify user-defined viral genomes using the online tool prism (yu et al. 2011 ). the recent advances in nextgeneration sequencing and technologies have facilitated to study viral population at an advanced level. the viral population biodiversity and dynamics can be studied using the first such tool developed, phaccs (phage communities from contig spectrum), that can analyse the shotgun sequence data to estimate the structure and diversity of phages (angly et al. 2005) . later on, more tools/resources were developed to analyse viral metagenomics sequences, such as viral informatics resource for metagenomic exploration (virome), viral metagenome annotation pipeline (vmgap) and metavir (lorenzi et al. 2011 , roux et al. 2011 , wommack et al. 2012 . novel viruses can be identified from a pool of specimen types using a specific computational pipeline, virushunter ). the phenomenon of genetic recombination in viruses is responsible for the emergence of new viruses, increased virulence and host range, immune evasion and development of antiviral resistance. this distinct process of viral recombination can be detected by two bioinformatics tools, viz. jphmm (jumping profile hidden (schultz et al. 2009; routh and johnson 2014) . the jphmm, a web server, can be used for predicting recombination in hiv-1 and hbv, whereas virema, a downloadable software, can be used to analyse next-generation sequencing data. additionally, another software called vipr hmm (viral identification with a probabilistic algorithm incorporating hidden markov model) can detect recombinant and nonrecombinant viruses using microbial detection microarrays (allred et al. 2012 ). further, viral genome sequences can be searched for degenerate locus of recombination (lox)-like sites by a web server called selox (surendranath et al. 2010) . a downloadable software, virapops, is a forward simulator that allows simulation of rna virus population (petitjean and vanet 2014) . with this software, the drastic changes in rapidly evolving rna viruses such as mutability, recombination, variation, covariation, etc. can be simulated to predict their effects on viral populations. seqmap is a tool capable of identifying viral integration sites (vis) from ligationmediated pcr (lm-pcr), linear amplification-mediated pcr (lam-pcr) and nonrestrictive lam-pcr (nrlam-pcr) reactions and mapping short sequences to the genome (hawkins et al. 2011) . further, vis can also be detected by three more distinct tools, virusseq, viralfusionseq, and virusfinder , li et al. 2013 . for more precise vis prediction, all four tools can be employed by virologists. mirnas: a microrna (mirna) is a small, regulatory, non-coding rna molecule that regulates the translation or stability of viral and host target mrnas, thereby affecting viral pathogenesis. this host-viral regulatory relationship can be investigated by a database called vita, capable of curating known viral mirna genes and known/putative target sites of host mirna (hsu et al. 2007 ). vita exploits miranda and targetscan to scan viral genomes and determine mirna targets. vita is also capable of annotating the viruses, virus-infected tissues and tissue specificity of host mirnas. subtypes of viruses, for example, influenza viruses, and the conserved regions in various viruses can also be compared using the vita database. viral mirna candidate hairpins can be predicted using the database vir-mir. it serves as a platform to query the predicted viral mirna hairpins (based on taxonomic classification) and host target genes (based on the use of the rnahybrid program) in human, mouse, rat, zebrafish, rice and arabidopsis (li et al. 2008) . sirna: a sirna is similar to mirna that operates within the rna interference (rnai) pathway. it interferes in expression of specific genes and, therefore, is used in post-transcriptional gene silencing. virsirnadb is an online curated repository that stores experimentally validated research data of sirna and short hairpin rna (shrna) targeting diverse genes of 42 important human viruses, including influenza virus (tyagi et al. 2011 , thakur et al. 2012 . the current database includes experimental information on sirna sequence, virus subtype, target gene, genbank accession, design algorithm, cell type, test object, method, efficacy, etc. a web-based software, sivirus, is an antiviral srna design software that allows analysis of influenza virus, hiv-1, hcv and sars coronavirus (naito et al. 2006 ). further, viral sirna sequence data sets can be analysed using the softwares visitor and virome (antoniewski 2011; watson et al. 2013) . a perl script, called paparazzi, enables reconstitution of viral genome using a viral sirna in a given sample (vodovar et al. 2011 ). host-pathogenic interactions play an important role in determining the pathogenicity of a pathogen or immune evasion mechanism of a host. to comprehend such interactions between viral and host cellular proteins, various databases and softwares are available. one such database is phever that enables to explore virusvirus and virus-host lateral gene transfers by providing evolutionary and phylogenetic information (palmeira et al. 2011 ). this distinct database catalogues homologous families between different viral sequences and between viral and host sequences. it compiles the extensive data from completely sequenced genomes (2426 nonredundant viral genomes, 1007 non-redundant prokaryotic genomes, 43 eukaryotic genomes ranging from plants to vertebrates). thus, it enables compiling of various proteins into homologous families by selecting at least one viral sequence, related alignments and phylogenies for each of these families. with increasing availability of viral genome sequences, data mining, curation and genome annotation have become essential components to better comprehend the structure and function of genome components. this information can further be exploited to develop diagnostics, vaccines and therapeutics. there are a number of tools available capable of annotation and classification of viral sequences, such as ncbi genotyping tool (rozanov et al. 2004) , vigor (viral genome orf reader) (wang et al. 2010 ), viral genome organizer (vgo) (upton et al. 2000) , genome annotation transfer utility (gatu) (tcherepanov et al. 2006) , virus genotyping tools (alcantara et al. 2009 ), zcurve_v (guo and zhang 2006) and star (subtype analyser) (myers et al. 2005) . vgo is a web-based genome browser that allows viewing and predicting genes and orfs in one or more viral genomes. it also allows performing searches within viral genomes and acquiring information about a genome such as locating genes, orfs, start/stop codons, etc. within genome, the sequences can be searched for regular expression, fuzzy motif pattern, genes with highest at composition, etc. using vgo, comparative analyses can be made between different viral genomes. vgo uses the graphical user interface (gui) for constructing alignments and display orthologues in a set of genomes. it also allows searching the translated genome for matches to mass spec peptides. vigor is a gene prediction online tool that was developed by j. craig venter institute in 2010. it started with gene prediction in small viral genomes such as coronavirus, influenza, rhinovirus and rotavirus. with the updated version in 2012 (https://www.ncbi.nlm.nih.gov/pmc/articles/pmc3394299/), vigor is now capable of gene prediction in 12 more viruses: measles virus, mumps virus, rubella virus, respiratory syncytial virus, alphavirus and venezuelan equine encephalitis virus, norovirus, metapneumovirus, yellow fever virus, japanese encephalitis virus, parainfluenza virus and sendai virus. with vigor, based on sequence similarity searches, users are able to predict protein coding regions, start and stop codons and other complex gene features such as rna editing, stop codon leakage and ribosomal shunting. further, various features such as frameshifts, overlapping genes, embedded genes, etc. can be predicted in the virus genome. additionally, a mature peptide can be predicted in a given polypeptide open reading frame. vigor is also capable of genotyping influenza virus and rotavirus. four output files -a gene prediction file, a complementary dna file, an alignment file, and a gene feature table file -are produced by vigor. genbank submission can be directly done using the gene feature table. genome annotation transfer utility (gatu) facilitates quick and efficient annotation of similar target genome using the reference genomes that have already been annotated. later, the users can manually curate the annotated genome. the newly annotated genomes can be saved as genbank, embl or xml file format. although it doesn't provide a complete annotation system, gatu serves as a very useful tool for the preliminary work in genome annotation. gatu utilizes tblastn and blastn algorithms to map genes onto the new target genome by using an annotated reference genome. as a result, majority of the new genome's genes are annotated in a single step. with gatu, users can also identify open reading frames present in the target genome and absent from the reference genome. these orfs can further be scrutinized by using other bioinformatics tools such as blast and vgo, which can determine if the orfs should be included in the annotation. multiple-exon genes and mature peptides can also be analysed using gatu. a primer design tool, primerhunter, allows to design highly sensitive and specific primers for virus subtyping by pcr (duitama et al. 2009 ). primerhunter allows predicting specific forward and reverse primers with respect to a given set of dna sequences. phylotype is a web-based as well as downloadable software that uses parsimony to reconstruct ancestral traits and to select phylotypes (chevenet et al. 2013) . rotac is an automated genotyping tool for group a rotaviruses (maes et al. 2009 ). it works by comparing a complete orf of interest to other complete orfs of cognate genes available in the genbank database by performing blast searches. viroligo is a database of virus-specific oligonucleotides. the viroligo database acts as a repository for virus-specific oligonucleotides for virus detection (onodera and melcher 2002) . the database comprises of oligo data and common data tables. the oligo data table enlists pcr primers and hybridization probes that are used for viral nucleic acid detection, while common data table contains pcr and hybridization experimental conditions used in their detection. each oligo data entry provides information on the name of the oligonucleotide, oligonucleotide sequence, target region, type of usage (pcr primer, pcr probe, hybridization or other), note and direction of the pcr oligonucleotide (forward or reverse). each oligonucleotide entry also contains direct links to pubmed, genbank, ncbi taxonomy databases and blast. on the updated version of viroligo as of september 2015, the database contains complete listing of oligonucleotides specific to various animal viruses. the viruses are vaccinia virus; canine parvovirus; porcine parvovirus; rodent parvovirus; tobamovirus; potyvirus; borna virus; bovine herpesvirus types 1, 3, 4 and 5; bovine viral diarrhoea virus; bovine parainfluenza 3 virus; bovine respiratory syncytial virus; bovine adenovirus; bovine rhinovirus; bovine coronavirus; bovine reovirus; bovine enterovirus; foot-and-mouth disease (fmd) virus; and alcelaphine herpesvirus. virus-ploc is a web server for prediction of subcellular localization of viral proteins within host and virus-infected cells (shen and chou 2007) . another web server developed a little later, iloc-virus, is a multi-label learning classifier that predicts the subcellular locations of viral proteins with single and multiple sites (xiao et al. 2011) . similarly, a most recent web server, ploc-mvirus (cheng et al. 2017) , is a new predictor that identifies subcellular localization of viral proteins with both single and multiple location sites. it works by extracting information from the gene ontology (go) database and is claimed to be more successful than the state-of-the-art method, iloc-virus, in predicting subcellular localization of viral proteins. avppred is an antiviral peptide prediction algorithm that contains the peptides with experimentally proven antiviral activity (thakur et al. 2012) . the prediction is based on peptide sequence features, peptide motifs, sequence alignment, amino acid composition and physicochemical properties. vips is a viral internal ribosomal entry site (ires) prediction system that can predict ires secondary structures (hong et al. 2013) . vips uses the rna fold program that predicts local rna secondary structures, rna align program that compares predicted structures and pknotsrg program (reeder et al. 2007 ) that calculates the pseudoknot structures. vazymolo, a database that deals with viral sequences at protein level, defines and classifies viral protein modularity (ferron et al. 2005) . it extracts information of complete genome sequences of various viruses from genbank and refseq and organizes the acquired information about modularity on viral orfs (fig. 23.1f) . there are web-based tools available to predict and analyse structural aspects of viruses. the learncoil-vmf is a computational tool that allows to predict coiledcoil-like regions in viral membrane fusion proteins (singh et al. 1999) . the membrane fusion proteins are known to be diverse and share no sequence similarity between most pairs of viruses in the same or different families. the learncoil-vmf is also capable of characterizing the core structure of these membrane fusion proteins. viperdb (virus particle explorer database) is a web-based database that enables manual curation of icosahedral virus capsid structures (carrillo-tripp et al. 2009 ). this database serves as a comprehensive resource for specific needs of structural virology and comparatives of data derived from structural and computational analyses of capsids. with the updated version, viperdb (2), capsid protein residues in the icosahedral asymmetric unit (iau) can be deduced using phi-psi (phi-psi) diagrams (azimuthal polar orthographic projections) (ref: https://www.ncbi.nlm.nih. gov/pubmed/18981051). these diagrams can be depicted as dynamic interface and surface residues and interface and core residues and can be mapped to the database using a new application programming interface (api). this aids in identifying family-wide conserved residues at the interfaces. additionally, jmol and strap are built in the system to visualize an interactive model of viral molecular structures. vida is a database that organizes animal virus genome open reading frames from partial and complete genomic sequences (alba et al. 2001) . presently, vida includes a complete collection of homologous protein families from genbank for herpesviridae, papillomaviridae, poxviridae, coronaviridae and arteriviridae. the homologous proteins in vida include both orthologous and paralogous sequences. vida retrieves virus sequences from genbank and the files are parsed into subfields. the parsed fields contain all the information such as genbank accession number, genbank identifier (gi numbers), protein sequence source, sequence length, gene name and gene product. in order to eliminate 100% redundancy, the virus protein sequences thus retrieved are filtered and a list of synonymous gis is created for reference. the orfs from complete and partial virus genomes are further organized into homologous protein families, on the basis of sequence similarity. furthermore, the structure of known viral proteins or homologous to viral proteins is also mapped onto homologous protein families. vida also provides functional classification of virus proteins into broad functional classes based on typical virus processes such as dna and rna replication, virus structural proteins, nucleotide and nucleic acid metabolism, transcription, glycoproteins and others. this database also provides alignment of the conserved regions based on potential functional importance. apart from functional classification, vida also provides a taxonomical classification of the proteins and protein families. the protein families serve as a tool for functional and evolutionary studies, whereas alignments of conserved sequences provide crucial information on conserved amino acids or construction of sequence profiles. the viral bioinformatics resource center (vbrc) is one of eight nih-sponsored bioinformatics resource centers (http://www.oxfordjournals.org/nar/database/ summary/798). it is an online platform that provides informational and analytical tools and resources to scientific community. the vbrc is oriented to conduct basic and applied research to better comprehend the viruses included on the nih/niaid list of priority pathogens. these viruses are selected based on their possibility of bioterrorism threats or as emerging or re-emerging infectious diseases. the vbrc focuses specifically on large dna viruses. it includes the viruses that belong to the arenaviridae, bunyaviridae, filoviridae, flaviviridae, paramyxoviridae, poxviridae and togaviridae families. it serves as a relational database and web application tool that allows data storage, annotation, analysis and information exchange of the data. the current version (v 4.2) consists of 369 complete genomic sequences. using the vbrc, each of the viral gene and genome can be curated. as a result, a comprehensive and searchable summary is acquired that details about the genotype and phenotype of the genes. the role of the genes in host-pathogen relationships is also being emphasized in these curations. additionally, the vbrc also houses multiple analytical tools such as tools for genome annotation, comparative analysis, whole genome alignments and phylogenetic analysis. further, this database also looks forward to include high-throughput data derived from other studies such as microarray gene expression data, proteomic analyses and population genetics data. the poxvirus bioinformatics resource center (pbrc, now merged into vbrc) is an online platform that serves as an informational and analytical resource to better comprehend the poxviridae family of viruses. it allows data storage, annotation, analysis and information exchange of the data. influenza virus is one the major global concern. it gained attention after the emergence of pandemic influenza a virus (h1n1, swine flu) in 2009. there are a total of 11 web portals and tools that focus only on influenza virus. this includes the influenza virus database (ivdb), influenza research database (ird) and ncbi influenza virus resource (ncbi-ivr) (chang et al. 2007; bao et al. 2008; squires et al. 2008) . researchers can exploit all the three websites mentioned for sequence databases as well as various basic tools such as blast, multiple-sequence alignment, phylogenetic tree construction, etc. ivdb provides access to additional tools such as (i) the sequence distribution tool, which provides global geographical distribution of a given viral genotype as well as correlates its genomic data with epidemiological data, and (ii) the quality filter system, which according to their sequence content (coding sequence [cds], 5'untranslated region [5'utr] , and 3'utr) and integrity (complete [c] or partial [p]) categorizes a given viral nucleotide sequence into either of the seven categories of c1 to c4 and p1 to p3, respectively. ncbi-ivr is the most widely used and cited online resource. with ncbi-ivr, the given viral genomic sequences can be annotated using a genome annotation tool and flu annotation (flan) tool. additionally, large phylogenetic trees may be constructed and can be visualized in aggregated form with sub-scale details (bao et al. 2007; bao et al. 2008; zaslavsky et al. 2008) . ird provides tools for genomic and proteomic intervention, immune epitope prediction and surveillance data for viral nucleotide sequences (squires et al. 2012) . furthermore, this resource is also equipped with tools that provide insight into hostpathogen interactions, type of virulence, host range and a correlation of sequence variation and these processes. there are other repositories available: global initiative on sharing avian influenza data (gisaid) consortium that mediated the epiflu database and flugenome database that exclusively provides genotyping of influenza a virus and aids in detecting reassortments taking place in divergent lines (lu et al. 2007) . furthermore, reassortment events in influenza viruses exclusively can be identified by a program giraf (graph-incompatibility-based reassortment finder) that can be downloaded (nagarajan and kingsford 2011) . another distinct repository, influenza sequence and epitope database (ised), provides viral sequences and epitopes from asian countries; the information could be exploited to understand and study evolutionary divergence and migration of strains (yang et al. 2009 ). the web server ativs (analytical tool for influenza virus surveillance) provides an antigenic map for conducting surveillance and selection of vaccine strains by scrutinizing the serological data of haemagglutinin sequence data of influenza a/h3n2 viruses and influenza subtypes (liao et al. 2009 ). there is another online repository openfludb (an isolate-centred inventory), where information of an isolate such as virus type, host, date of isolation, geographical distribution, predicted antiviral resistance, enhanced pathogenicity or human adaptation propensity may be obtained (liechti et al. 2010) . for influenza viruses, primers and probes can be designed using the influenza primer design resource (ipdr) (bose et al. 2008 ). further, prospective influenza seasonal epidemics or pandemics can be predicted using a stochastic model, flute (chao et al. 2010) (table 23 .2). the ncbi virus variation resource (ncbi-vvr) is a web-based database of a set of viruses, viz. influenza virus, dengue virus, rotavirus, west nile virus, ebola virus, zika virus and mers coronavirus (resch et al. 2009 ). it enables the user to submit their viral sequences along with relevant metadata such as sample collection time, isolation source, geographic location, host, disease severity, etc. it further allows integrating and analysing the viral sequences using the generic tools such as multiple sequence alignment and phylogenetic tree construction. rotavirus a (rva) is the most frequent cause of severe diarrhoea in human and animal infants worldwide and remains as a major global threat for childhood morbidity and mortality (minakshi et al. 2005; basera et al. 2010) . in recent years, extensive research efforts have been done for the development of live, orally administered vaccines. in india, an orally administered vaccine rotavac was also introduced after successful clinical trials in 2014 which became available to clinicians in 2016, although these vaccines will have to be scrutinized and have to be updated regularly to accommodate the emerging rotavirus genotype variations, following which molecular and genetic characterization of new circulating and emerging genotypes of rotavirus strains in humans and animals becomes necessary. recently, a classification system for rvas has been described by the rotavirus classification working group (rcwg) in which all the 11 genomic rna segments are assigned a particular alphabet followed by the particular genotype number. the classification system will be helpful in explaining the importance of genetic reassortments among rvas, host range, transfer of gene segments among two different genotypes and adaptation to different hosts. to differentiate between different gene segments of rvas, an online web-based tool rotac was developed by the leading researchers from rega institute, ku leuven, belgium, in 2009 (table 23. 3). it's an easy-to-use and reliable classification tool for rvas and works on the agreement with rcwg. it's a platform-independent tool which works on any web browser by simply going to its url (http://rotac.regatools.be/) and has been released without any restriction of use by academicians or anyone else. as claimed, the rotac web-based tool will be updated regularly to reflect the established as well as newly emerging genotypes announced by the rcwg from time to time. various researches in animal viral diseases are being conducted at the genomic level. often, handling an enormous data obtained from sequencing is daunting to researchers. the chapter categorically provides a list of bioinformatics approaches that are useful in data mining. there are tables that list all such bioinformatics programs as per the applications. the tables also list databases that organize information on human and animal viruses such as genomic data, orfs, oligonucleotides, etc. an illustration has also been provided in the chapter showing the application of the tool predictprotein, which is used for prediction of three-dimensional structures of viral proteins. the major goal of the chapter has been to provide a roadmap to bioinformatics approaches in the field of animal viral diseases. although the chapter elaborates on viruses-specific bioinformatics programs, most of these programs are designed for human viruses. nevertheless, there are bioinformatics tools that are animal-virus specific, but these are limited in number. henceforth, in many cases, researchers have to switch to either human virus-specific tools or other generic tools. application of such tools for studying animal viruses or animal diseases, in many situations, may not be as accurate as with specialized tools. the users should take precautions while using the settings of such tools. furthermore, the results, thus obtained, also need to be scrutinized. therefore, development of new bioinformatics programs/tools that are specifically designed for animal viruses/diseases should be taken up robustly. specialized tools will provide much accurate results and predictions, thereby accelerating the bioinformatics researches in the field of animal viral diseases. vida: a virus database system for the organization of animal virus genome open reading frames a standardized framework for accurate, high-throughput genotyping of recombinant and non-recombinant viral sequences hmm: a hidden markov model for detecting recombination with microbial detection microarrays basic local alignment search tool phaccs, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information visitor, an informatic pipeline for analysis of viral sirna sequencing datasets flan: a web server for influenza virus genome annotation the influenza virus resource at the national center for biotechnology information continuous and discontinuous protein antigenic determinants detection of rotavirus infection in bovine calves by rna-page and rt-pcr genemark: web software for gene finding in prokaryotes, eukaryotes and viruses mhcbn: a comprehensive database of mhc binding and non-binding peptides the influenza primer design resource: a new tool for translating influenza sequence data into effective diagnostics base-by-base: single nucleotidelevel analysis of whole viral genome alignments mhcpep, a database of mhc-binding peptides: update 1997 viperdb2: an enhanced and web api enabled relational database for structural virology influenza virus database (ivdb): an integrated information resource and analysis platform for influenza virus research flute, a publicly available stochastic influenza epidemic simulation model virusseq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue analysis of the complete genome sequence of the hz-1 virus suggests that it is related to members of the baculoviridae ploc-mvirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal go information into general pseaac searching for virus phylotypes evidence for a novel gene associated with human influenza a viruses t-cell antigenic sites tend to be amphipathic structures viroblast: a stand-alone blast web server for flexible queries of multiple databases and user's datasets primerhunter: a primer design tool for pcr-based virus subtype identification alvira: comparative genomics of viral strains mathematics vs. evolution: mathematical evolutionary theory vazymolo: a tool to define and classify modularity in viral proteins zcurve_v: a new self-training system for recognizing protein-coding genes in viral and phage genomes identifying viral integration sites using seqmap 2.0 genome information broker for viruses (gib-v): database for comparative analysis of virus genomes viral ires prediction system -a web server for prediction of the ires secondary structure in silico vita: prediction of host micrornas targets on viruses covdb: a comprehensive database for comparative analysis of coronavirus genes and genomes viralzone: a knowledge resource to understand virus diversity vir-mir db: prediction of viral microrna candidate hairpins viralfusionseq: accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution ativs: analytical tool for influenza virus surveillance openfludb, a database for human and animal influenza virus the viral meta genome annotation pipeline(vmgap):an automated tool for the functional annotation of viral metagenomic shotgun sequencing data flugenome: a web tool for genotyping influenza a virus rota c: a web-based tool for the complete genome classification of group a rotaviruses g and p genotyping of bovine group a rotaviruses in faecal samples of diarrheic calves by dig-labeled probes a statistical model for hiv-1 sequence classification using the subtype analyser (star) giraf: robust, computational identification of influenza reassortments via graph mining sivirus: web-based antiviral sirna design software for highly divergent viral sequences viroligo: a database of virus-specific oligonucleotides phever: a database for the global exploration of virus-host evolutionary relationships viralorfeome: an integrated database to generate a versatile collection of viral orfs virapops: a forward simulator dedicated to rapidly evolved viral populations syfpeithi: database for mhc ligands and peptide motifs epimhc: a curated database of mhcbinding peptides for customized computational vaccinology virus particle explorer (viper), a website for virus capsid structures and their computational analyses pknotsrg: rna pseudoknot folding including nearoptimal structures and sliding windows virus variation resources at the national center for biotechnology information: dengue virus orf-finder: a vector for high-throughput gene identification the predictprotein server discovery of functional genomic motifs in viruses with virema-a virus recombination mapper-for analysis of next-generation sequencing data metavir: a web server dedicated to virome analysis a web-based genotyping resource for viral sequences bcipep: a database of b-cell epitopes prevalence, diagnosis, management and control of important diseases of ruminants with special reference to indian scenario epitome: database of structure-inferred antigenic epitopes an update on the functional molecular immunology (fimm) database jphmm: improving the reliability of recombination prediction in hiv-1 virus-ploc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells sse: a nucleotide and amino acid sequence analysis platform learncoil-vmf: computational evidence for coiled-coil-like motifs in many viral membrane-fusion proteins advances in diagnosis, surveillance, and monitoring of zika virus: an update ebola virus -epidemiology, diagnosis and control: threat to humans, lessons learnt and preparedness plans-an update on its 40 year's journey biohealthbase: informatics support in the elucidation of influenza virus host pathogen interactions and virulence influenza research database: an integrated bioinformatics resource for influenza research and surveillance selox--a locus of recombination site search tool for the detection and directed evolution of site-specific recombination systems genome annotation transfer utility (gatu): rapid annotation of viral genomes using a closely related reference genome virsirnadb: a curated database of experimentally validated viral sirna/shrna antijen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data hivsirdb: a database of hiv inhibiting sirnas viral genome organizer: a system for analyzing complete viral genomes the immune epitope database (iedb) 3.0 in silico reconstruction of viral genomes from small rnas improves virus-derived small interfering rna profiling vigor, an annotation program for small viral genomes virusfinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data virome: an r package for the visualization and analysis of viral small rna sequence datasets virome: a standard operating procedure for analysis of viral metagenome sequences iloc-virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites influenza sequence and epitope database prism: a primer selection and matching tool for amplification and sequencing of viral genomes visualization of large influenza virus sequence datasets using adaptively aggregated trees with sampling-based subscale representation identification of novel viruses using virushunter--an automated data analysis pipeline acknowledgements all the authors of the manuscript thank and acknowledge their respective universities and institutes. there is no conflict of interest. key: cord-015850-ef6svn8f authors: saitou, naruya title: eukaryote genomes date: 2013-08-22 journal: introduction to evolutionary genomics doi: 10.1007/978-1-4471-5304-7_8 sha: doc_id: 15850 cord_uid: ef6svn8f general overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk dnas. we then discuss the evolutionary features of eukaryote genomes, such as genome duplication, c-value paradox, and the relationship between genome size and mutation rates. genomes of multicellular organisms, plants, fungi, and animals are then briefly discussed. duplications sometimes occur in eukaryotes, especially in plants and in vertebrates, but genome duplication is so far not known for prokaryotic genomes. because the gene number of typical eukaryotic genomes is much larger than that of prokaryotes, there are many genes shared among most of eukaryote genomes but nonexisting in prokaryote genomes. some examples are listed in table 8.2 . for example, myosin is located in animal muscle tissues, and its homologous protein exists in cytoskeleton of all eukaryotes, but not found in prokaryotes. recently, kryukov et al. (2012; [ 1 ] ) constructed a new database on oligonucleotide sequence frequencies and conducted a series of statistical analyses. frequencies of all possible 1-10 oligonucleotides were counted for each genome, and these observed values were compared with expected values computed under observed oligonucleotide frequencies of length 1-4. deviations from expected values were much larger for eukaryotes than prokaryotes, except for fungal genomes. figure 8 .1 shows the distribution of the deviation for various organismal groups. the biological reason for this difference is not known. there are two major types of organella in eukaryotes: mitochondria and plastids. figure 8 .2 shows schematic views of mitochondria and chloroplasts. these two organella has their independent genomes. this suggests that they were initially independent organisms which started intracellular symbiosis with primordial eukaryotic cells. because most eukaryotes have mitochondria, the ancestral eukaryotes, a lineage that emerged from archaea, most probably started intracellular symbiosis with mitochondrial ancestor. a parasitic rickettsia prowazekii is so far phylogenetically closest to mitochondria [ 2 ] , and a rickettsia-like bacterium is the best candidate as the mitochondrial ancestor. however, there is an alternative "hydrogen hypothesis" [ 3 ] . plastids include chloroplasts, leucoplasts, and chromoplasts and exist in land plants, green algae, red algae, glaucophyte algae, and some protists like euglenoids. mitochondrial genome sizes of some representative eukaryotes are listed in table 8 . 3 . most of animal mitochondrial genomes are less than 20 kb, and sizes of protist and fungi mitochondrial genomes are somewhat larger. mitochondrial genome size of plants is much larger than those of other eukaryotic lineages, yet the size is mostly less than 500 kb. an ancestral eukaryotic cell, probably an archaean lineage, hosted a bacterial cell, and intracellular symbiosis started. initially, archaea and bacteria shared genes responsible for basic metabolism, and the situation is a sort of gene duplication for many genes, though homologous genes are not identical but already diverged long time ago. in any case, division of labor followed, and only limited metabolic pathways were left in the bacterial system, which eventually became mitochondria. animal mitochondrial genomes contain very small number of genes; 13 for peptide subunits, 20 for trna, and 2 for rrna [ 4 ] . genome size (kb) animals homo sapiens (human) 16 .5 takifugu rubripes (torafugu fi sh) 16 representative animal species mitochondrial dna genomes. although most of vertebrate mitochondrial dna genomes have the same gene order as in human ( fig. 8 .3a ), gene order may vary from phylum to phylum. yet the gene content and the genome size are more or less constant among animals. it is not clear why animal mitochondrial genomes are so small. one possibility is that animal individuals are highly integrated compared to fungi and plants, and this might have infl uenced a drastic reduction of the mitochondrial genome size. another interesting feature of animal mitochondrial dna genomes is the heterogeneous rates of gene order change. for example, platyhelminthes exhibit great variability in mitochondrial gene order (sakai and sakaizumi, 2012; [ 5 ] ). in contrast, plant mitochondrial genomes are much larger (see table 8 .3 ). figure 8 .4 shows the genome structure of tobacco mitochondrial genome (from sugiyama et al. 2005; [ 6 ] ). horizontal gene transfers are also known to occur in plant mitochondrial dnas even between remotely related species [ 7 ] . the melon ( cucumis melo ) mitochondrial genome size, ca. 2.9 mb, is exceptionally large, and recently its draft genome was determined [ 8 ] . interestingly, melon mitochondrial genome looks like the vertebrate nuclear genome in its contents, in spite of its genome size being similar to that of bacteria. the protein coding gene region accounted for only 1.7 % of the genome, and about half of the genome is composed of repeats. the remaining part is mostly homologous to melon nuclear dna, and 1.4 % is homologous to melon chloroplast dna. most of the protein coding genes of melon mitochondrial dnas are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are 119 kb and 125 kb, respectively. this indicates that the huge expansion of its genome size occurred only recently. interestingly, cucumber ( cucumis sativus ), another congeneric species, also has ~1.8-mb mitochondrial genome with many repeat sequences [ 9 ] . it will be interesting to study whether the increase of mitochondrial genomes of melon and cucumber is independent or not. chloroplasts exist only in plants, algae, and some protists. it may change to leucoplasts and chromoplasts. because of this, a generic name "plastids" may also be used. the origin of chloroplast seems to be a cyanobacterium that started intracellular symbiosis as in the case of mitochondria. a unique but common feature of chloroplast genome is the existence of inverted repeats [ 10 ] , and they mainly contain rrna genes. chloroplast dna contents may [ 11 ] . chloroplast genomes were determined for more than 340 species as of december 2013 [ 106 ] . their genome sizes range from 59 kb ( rhizanthella gardneri ) to 521 kb ( floydiella terrestris ). although the largest chloroplast genome is still much smaller than atypical bacterial genome, its average intergenic length is 4 kb, much longer than that for bacterial genomes. fractions of mitochondrial dna may sometimes be inserted to nuclear genomes, and they are called "numts." an extensive analysis of the human genome found over 600 numts [ 12 ] . their sequence patterns are random in terms of mitochondrial genome locations. this suggests that mitochondrial dnas themselves were inserted, not via cdna reverse-transcribed from mitochondrial mrna. a possible source is sperm mitochondrial dna that were fragmented after fertilization [ 12 ] . the reverse direction, from nucleus to mitochondria, was observed in melon, as discussed in subsection 8.2.1 . intron is a dna region of a gene that is eliminated during splicing after transcription of a long premature mrna molecule. intron was discovered by phillip a. sharp and richard j. roberts in 1977 as "intervening sequence" [ 13 ] , but the name "intron" coined by walter gilbert in 1978 [ 14 ] is now widely used. it should be noted that some description on intron by kenmochi [ 15 ] was used for writing this section. there are various types of introns, but they can be classifi ed into two: those requiring spliceosomes (spliceosome type) and self-splicing type. figure 8 .5 shows the splicing mechanisms of these two major types. most of introns in nuclear genomes of eukaryotes are spliceosome type, and there are common gu-ag type and rare au-ac type, depending on the nucleotide sequences of the intron-exon boundaries [ 16 ] . spliceosomes involving these two types differ [ 17 ] . self-splicing introns are divided into three groups: groups i, ii, and iii. group i introns exist in organellar and nuclear rrna genes of eukaryotes and prokaryotic trna genes. group ii are found in organellar and some eubacterial genomes. cavalier-smith [ 18 ] suggested that spliceosome-type introns originated from group ii introns because of their similarity in splicing mechanism and structural similarity between group ii introns and spliceosomal rna. group iii introns exist in organellar genomes, and its splicing system is similar with that of group ii intron, though they are smaller and have unique secondary structure. there is yet another type of introns which exist only in trnas of single-cell eukaryotes and archaea [ 19 ] . these introns do not have self-splicing functions, but endonuclease and rna ligase are involved in splicing. the location of this type of introns is often at a certain position of the trna anticodon loop. after the discovery of introns, their probable functions and evolutionary origin have long been argued (e.g., [ 20 , 21 ] ). because self-splicing introns can occur at any time, even in the very early stage of origin of life, we consider only spliceosometype introns. for brevity, we hereafter call this type of introns as simply "intron." there are mainly two major hypotheses: introns early and introns late. the former claims that exon existed as a functional unit from the common ancestor of prokaryotes and eukaryotes, and "exon shuffl ing" was proposed for creating new protein functions [ 14 ] . introns which separate exons should also be quite an ancient origin [ 14 , 22 ] . in contrast, introns are considered to emerge only in the eukaryotic lineage according to the introns-late hypothesis [ 23 , 24 ] . the protein "module" hypothesis proposed by go [ 25 ] is related to be intronsearly hypothesis. pattern of intron appearance and loss has been estimated by various methods (e.g., [ 21 , 26 ] ). kenmochi and his colleagues analyzed introns of ribosomal proteins of mitochondrial genomes and eukaryotic nuclear genomes in details [ 27 -29 ] . these studies supported the introns-late hypothesis, because introns in mitochondrial and cytosolic ribosomal proteins seem to be independent origins and introns seem to emerge in many ribosomal protein genes after eukaryotes appeared. introns do not code for amino acid sequences by defi nition. in this sense, most of introns may be classifi ed as junk dnas (see the next section). there are, however, evolutionarily conserved regions in introns, suggesting the existence of some functional roles in introns. ohno (1972; [ 30 ] ) proclaimed that the most part of mammalian genomes are nonfunctional and coined the term "junk dna." with the advent of eukaryotic genome sequence data, it is now clear that he was right. there are in fact so much junk dnas in eukaryotic genomes. junk dnas or nonfunctional dnas can be divided into repeat sequences and unique sequences. repeat sequences are either dispersed type or tandem type. unique sequences include pseudogenes that keep homology with functional genes. prokaryote genomes sometimes contain insertion sequences; however, this kind of dispersed repeats constitutes the major portion of many eukaryotic genomes. these interspersed elements are divided into two major categories according to their lengths: short ones (sines) and long ones (lines). one well-known example of sine is alu elements in primate genomes. it is about 300-bp length, and originated from 7sl ribosomal rna gene. let us see the real alu element sequence from the human genome sequence. if we retrieve the ddbj/embl/genbank international sequence database accession number ap001720 (a part of chromosome 21), there are 128 alu elements among the 340kb sequence. the density is 0.38 alu elements per 1 kb. if we consider the whole human genome of ~ 3 billion bp, alu repeats are expected to exist in ~1.13 million copies. one example of alu sequence is shown below from this entry coordinates from 133600 to 133906: ggcgggagcg atggctcacg cctgtaatgc cagcactttg ggaggccgag gtgggtggat cacaaggtca ggagatagag accatcctgg ctaacacggt gaaacactgt ctctactaaa aacacaaaaa actagccagg cgtggtggcg ggtgcctgta atcccagcta ctcgggaggc tgaggcagga gaatggtgtg aacccaggaa gtggagcttg cagtgagctc agattgcgcc actgcactcc agcctgggtg acagagtgag actccatctc aaaaaaaata aaataaataa aaaaaa if we do blast homology search (see chap. 14 ) using ddbj system ( http:// blast.ddbj.nig.ac.jp/blast/blastn ) targeted to nonhuman primate sequences (pri division of ddbj database), the best hit was obtained from chimpanzee chromosome 22, which is orthologous to human chromosome 21. i suggest interested readers to do this homology search practice. alu elements were fi rst classifi ed into j and s subfamilies [ 31 ] . it is not clear about the reason of selection of two characters (j and s), but probably two authors (jurka and smith) used initials of their surnames. in any case, this division was based on the distance from alu consensus sequence; alu elements which are more close to the consensus were classifi ed as s and those not as j. later, a subset of the s subfamily were found to be highly similar with each other, and they were named as y after 'young," for they appeared relatively in young or recent age. rough estimates of the divergence time of alu elements are as follows: j subfamily appeared about 60 million years ago, and s subfamily separated from j at 44 million years ago, followed by further separation of y at 32 million years ago [ 32 ] . figure 8 .6 shows the overall pattern of alu element evolution (based on [ 32 ] ). tandemly repeated sequences are also abundant in eukaryotic genomes, and the representative ones are heterochromatin regions. heterochromatins are highly condensed nonfunctional regions in nuclear dna, in contrast to euchromatins, in which many genes are actively transcribed. heterochromatins usually reside at teromeres, terminal parts of chromosomes, and at centromeres, internal parts of chromosomes, that connect spindle fi bers during cell division. a more than 1-mb teromeric regions of arabidopsis thaliana were found to be tandem repeats of ca. 180-bp repeat unit [ 33 , 34 ] . the nucleotide sequence below is arabidopsis thaliana tandemly repeated sequence ar12 (international sequence database accession number x06467): aagcttcttc ttgcttctca atgctttgtt ggtttagccg aagtccatat gagtctttgt ctttgtatct tctaacaagg aaacactact taggctttta ggataagatt gcggtttaag ttcttatact taatcataca catgccatca agtcatattc gtactccaaa acaataacc the human genome also has a similar but nonhomologous sequence in centromeres, called "alphoid dna" with the 171-bp repeat unit [ 35 ] . the following is the sequence (international sequence database accession number m21746): catcctcaga aacttctttg tgatgtgtgc attcaagtca cagagttgaa cattcccttt cgtacagcag tttttaaaca ctctttctgt agtatctgga agtgaacatt aggacagctt tcaggtctat ggtgagaaag gaaatatctt caaataaaaa ctagacagaa g if we do blast homology search (see chap. 13 ) targeted to the human genome sequences of the ncbi database, there was no hit with this alphoid sequence. this clearly shows that the human genome sequences currently available are far from complete, for they do not include most of these tandem repeat sequences. telomores of the human genome are composed of hundreds of 6-bp repeats, ttaggg. if we search the human genome as 36-bp long 6 tandem repeats of this 6-repeat units as query using the ncbi blast, many hits are obtained. as we already discussed in chap. 4 , authentic pseudogenes have no function, and they are genuine members of junk dnas. when a gene duplication occurs, one of two copies often become a pseudogene. because gene duplication is prevalent in eukaryote genomes, pseudogenes are also abundant. pseudogenes are, by defi nition, homologous to functional genes. however, after a long evolutionary time, many selectively neutral mutations accumulate on pseudogenes, and eventually they will lose sequence homology with their functional counterpart. there are many unique sequences in eukaryote genomes, and majority of them may be this kind of homology-lost pseudogenes. a long rna is initially transcribed from a genomic region having an exon-intron structure, and then rnas corresponding to introns are spliced out. these leftover rnas may be called "junk" rnas, for they will soon be degraded by rnase. only a limited set of genes are transcribed in each tissue of multicellular organisms, but leaky expression of some genes may happen in tissues in which these genes should not be expressed. again these are "junk" rnas, and they are swiftly decomposed. a series of studies (e.g., [ 36 , 37 ] ) claimed that many noncoding dna regions are transcribed. however, van bakel et al. [ 38 ] showed that most of them were found to be artifact of chip-chip technologies used in these studies. if nonsense or frameshift mutations occur in a protein coding sequences, that gene cannot make proteins. yet its mrna may be produced continuously until the promoter or its enhancer will become nonfunctional. in this case, this sort of mutated genes produces junk rnas. if only a small quantity of rnas are found from cells and when they are not evolutionarily conserved, they are probably some kind of junk rnas. as junk dnas and junk rnas exist, cells may also have "junk" proteins. if mature mrnas are not produced in the expected way, various aberrant mrna molecules will be produced, and ribosomes try to translate them to peptides based on these wrong mrna information. proteins produced in this way may be called "junk" proteins, for they often have no or little functions. even if one protein is correctly translated and is moved to its expected cellular location, it can still be considered as "junk" protein. one good example is the abcc11 transporter protein of dry-type cerumen (earwax), for one nonsynonymous substitution at this gene caused that protein to be essentially nonfunctional [ 39 ] . there are various genomic features that are specifi c to eukaryotes other than existence of introns and junk dnas, such as genome duplication, rna editing, c-value paradox, and the relationship between genome size and mutation rates. we will briefl y discuss them in this section. the most dramatic and infl uential change of the genome structure is genome duplications. genome duplications are also called polyploidization, but this term is tightly linked to karyotypes or chromosome constellation. prokaryotes are so far not known to experience genome duplications, which are restricted to eukaryotes. interestingly, genome duplications are quite frequent in plants, while it is relatively rare in the other two multicellular eukaryotic lineages. an ancient genome duplication was found from the genome analysis of baker's yeast [ 40 ] , and rhizopus oryzae , a basal lineage fungus, was also found to experience a genome duplication [ 41 ] . among protists, paramecium tetraurelia is known to have experienced at least three genome duplications [ 42 ] . because we human belongs to vertebrates and the two-round genome duplications occurred at the common ancestor of vertebrates (see chap. 9 ), we may incline to think that genome duplications often happen in many animal species. it is not the case. so far, only vertebrates and some insects are known to experience genome duplications. the reason for this scattered distribution of genome duplication occurrences is not known. if we plot the number of synonymous substitutions between duplogs in one genome, it is possible to detect a relatively recent genome duplication. this is because all genes duplicate when a genome duplication occurs, while only a small number of genes duplicate in other modes of gene duplications (see chap. 3 ). figure 8 .7 shows the schematic view of two cases: with and without genome duplication. lynch and conery (2000; [ 44 ] ) used this method to various genome sequences and found that the arabidopsis thaliana genome showed a clear peak indicative of relatively recent genome duplication, while the genome sequences of nematode caenorhabditis elegans and yeast saccharomyces cerevisiae showed the curves of exponential reduction. it is interesting to note that before the genome sequence was determined, the genome duplication was not known for arabidopsis thaliana, while the genome of saccharomyces cerevisiae was later shown to be duplicated long time ago [ 40 ] . when genome duplications occurred in some ancient time, the number of synonymous substitutions may become saturated and cannot give appropriate result. in this case, the number of amino acid substitutions may be used, even if each protein may have varied rates of amino acid substitutions. in any case, accumulation of mutations will eventually cause two homologous genes to become not similar with each other. therefore, although the possibility of genome duplications in prokaryotes are so far rejected [ 45 ] , it is not possible to infer the remote past events simply by searching sequence similarity. we should be careful to reach the fi nal conclusion. modifi cation of particular rna molecules after they are produced via transcription is called rna editing. all three major rna molecules (mrna, trna, and rrna) may experience editing [ 46 ] . there are various patterns of rna editing; substitutions, in particular between c and u, and insertions and deletions, particularly u, are mainly found in eukaryote genomes. guide rna molecules exist in one of the main rna editing mechanisms, and they specify the location of editing, but there are some other mechanisms [ 47 ] . it is not clear how the rna editing mechanism evolved. tillich et al. [ 47 ] studied chloroplast rna editing and concluded that suddenly many nucleotide sites of chloroplast dna genome started to have rna editing, but later the sites experiencing rna editing constantly decreased via mutational changes. they claimed that there was no involvement of rna editing on gene expression. this result does not give rna editing a positive signifi cance. because there are many types of rna molecules inside a cell, there also exist many sorts of enzymes that modify rnas. it may be possible that some of them suddenly started to edit rnas via a particular mutation. rna editing which did not cause deleterious effects to the genome may have survived by chance at the initial phase. this view suggests the involvement of neutral evolutionary process in the evolution of rna editing. organisms with complex metabolic pathways have many genes. multicellular organisms are such examples. generally speaking, their genome sizes are expected to be large. in contrast, viruses whose genomes contain only a handful of genes have small genome sizes. therefore, their possibility of genome evolution is rather limited. even if amino acid sequences are rapidly changing because of high mutation rates, the protein function may not change. unless the gene number and genome size increase, viruses cannot evolve their genome structures. it is thus clear that the increase of the genome size is crucial to produce the diversity of organisms. however, genomes often contain dna regions which are not indispensable. organisms with large genome sizes have many such junk dna regions. because of their existence, the genome size and the gene number are not necessarily highly correlated. this phenomenon was historically called c-value paradox (e.g., [ 48 ] ), after the constancy of the haploid dna amount for one species was found, yet their values were found to vary considerably among species at around 1950 (e.g., [ 49 -51 ] ). "c-value" is the amount of haploid dna, and c probably stands as acronym of "constant" or "chromosomes." we now know that the majority of eukaryote genome dna is junk, and there is no longer a paradox in c-values among species. 56 ]) found conserved noncoding dna sequences from insects, nematodes, and yeasts by comparing closely related species. we will discuss more on conserved noncoding sequences of vertebrates in chap. 9 . as for plants, kaplinsky [ 62 ] ) compared genome sequences of arabidopsis, grape rice, and brachypodium and found >100 times more abundant cnss from monocots than dicots. hettiarachchi and saitou; [ 63 ] compared genome sequences of 15 plant species and searched lineage-specifi c cnss. they found 2 and 22 cnss shared by all vascular plants and angiosperms, respectively, and also confi rmed that monocot cnss are much more abundant than those of dicots. what kind of the relationship exists between the genome size and mutation rates? if all the genetic information contained in the genome of one organism are necessary for survival of that organism, the individual will die even if only one gene of its genome lost its function by a mutation. an organism with a small genome size and hence with a small number of genes, such as viruses, can survive even if the mutation rate is high. in contrast, organisms with many genes may not be able to survive if highly deleterious mutations often happen. therefore, such organisms must reduce the mutation rate. however, when the nucleotide substitution type mutation rate per generation was compared with the whole-genome size, lynch (2006; [ 65 ] ) found a positive correlation. more recently, lynch (2010; [ 66 ] ) admitted that for organisms with small-sized genomes, these two values were in fact negatively correlated. however, when large-genome-sized eukaryotes are compared, now a positive correlation was observed. we have to be careful when we discuss these two contradictory reports. one considered the rate using unit as physical year, while the other used one generation as the unit. another difference is to use either only protein coding gene region dna sizes or the whole-genome sizes. the relationship between the mutation rate and genome size is not simple. drake et al. (1998; [ 67 ] ) examined this problem and found that the mutation rate per genome per replication was approximately 1/300 for bacteria, while mutation rates of multicellular eukaryotes vary between 0.1 and 100 per genome per sexual or individual generation. table 8 .4 shows the list of the mutation rate and the genome size for various organisms. apparently there is no clear tendency. we will discuss genomes of three multicellular lineages of eukaryotes: plants, fungi, and animals in this section. unfortunately, there seems to be no common feature of genomes of multicellular organisms, so each lineage is discussed independently. arabidopsis thaliana was the fi rst plant species whose 125-mb genome was determined in 2000 [ 68 ] . a. thaliana is a model organism for fl owering plants (angiosperms), with only 2-month generation time. in spite of its small genome size, only 4 % of the human genome, it has 32,500 protein coding genes. the genome sequence of its closely related species, a. lyrata , was also recently determined [ 69 ] . angiosperms are divided into monocots and dicots. a. thaliana is a dicot, and genome sequences of six more species were determined as of december 2013 (see table 8 .5 ). rice, oryza sativa , is a monocot, and its genome size, 370 ~ 410 mb, is much smaller than that of the wheat genome. its japonica and indica subspecies genomes were determined [ 70 ] and [ 71 ] , and the origin of rice domestication is currently in great controversy, particularly in single or multiple domestication events (e.g., [ 72 , 73 ] ). the number of protein coding genes in the rice genome is 37,000 ~ 40,000 [ 74 ] . wheat corresponds to genus triticum , and there are many species in this genus. the typical bread wheat is triticum aestivum , and it is a hexaploid with 42 (7 ã� 6) chromosomes. its genome arrangement is conventionally written as aabbdd [ 75 ] . because it is now behaving as diploid, genomic sequencing of 21 chromosomes (a1-a7, b1-b7, and d1-d7) is under way (see http://www.wheatgenome.org/ for the current status). the hexaploid genome structure emerged by hybridization of diploid (dd) cultivated species t. durum and tetraploid (aabb) wild species aegilops tauschii [ 75 ] . a genome duplication followed hybridization. non-seedling land plants are ferns, lycophytes, and bryophytes, in the order of closeness to seed plants (e.g., [ 76 ] ). a draft genome sequence of a moss, physcomitrella patens was reported in 2008 [ 77 ] , followed by genome sequencing of a lycophyte, selaginella moellendorffi i, in 2011 [ 78 ] . these genome sequences of different lineages of plants are deciphering stepwise evolution of land plants. the genome sequence of baker's yeast ( saccharomyces cerevisiae ) was determined in 1996, as the fi rst eukaryotic organism [ 79 ] . there are 16 chromosomes in s . cerevisiae, and its genome size is about 12 mb. there are a total of 8,000 genes in its genome: 6,600 orfs and 1,400 other genes. the genome-wide gc content is 38 %, slightly lower than that of the human genome. the proportion of introns is very small compared to that of the human genome, and the average length of one intron is only 20 bp, in contrast to the 1,440-bp average length of exons [ 80 ] . as we already discussed, the ancestral genome of baker's yeast experienced a genome-wide duplication [ 40 ] . pseudogenes, which are common in vertebrate genomes, are rather rare in the genome of baker's yeast; they constitute only 3 % of the protein coding genes [ 80 ] . the baker's yeast is often considered as the model organisms for all eukaryotes; however, their genome may not be a typical eukaryote genome. as of december 2013, genome sequences of more than 400 fungi species are available (see ncbi genome list at http://www.ncbi.nlm.nih.gov/genome/browse/ for the present situation). figure 8 .9 shows the relationship between the genome size and gene numbers for 88 genomes. there is a clear positive correlation between them. however, there are some outliers. the perigord black truffl e ( tuber melanosporum ), shown as a i n fig. 8.9 , has the largest genome size (~125 mb) among the 88 fungi species whose genome sequences were so far determined, yet the number of genes is only ~7,500 [ 81 ] . three other outlier species are postia placenta , ajellomyces dermatitidis , and melampsora laricipopulina , shown as b, c, and d in fig. 8.9 , respectively. interestingly, these four outlier species are phylogenetically not clustered well; two are belonging to pezizomycotina of ascomycota and the other two are agaricomycotina and pucciniomycotina of basidiomycota. if we exclude these four outlier species, a good linear regression is obtained, as shown in fig. 8.9 . this straight line indicates that in average, one gene size corresponds to 2.9 kb in a typical fungi genome. if we apply this average gene size to the truffl e genome, its genome size should be ~22 mb, but the real size is 103 mb larger. this suggests that there is unusually large number of junk dna in this genome. in fact, 58 % of its genome consists of transposable elements [ 81 ] . the truffl e genome must still have 24 % more junk dna region. gain and loss of genes in each branch of the phylogenetic tree for fungi species are shown in fig. 8 .10 (based on [ 81 ] ). it will be interesting to examine genome sizes of species related to the perigord black truffl e, so as to infer the evolutionary period when the genome size expansion occurred. the relationship between the genome size and gene numbers among 88 fungi genomes system that is responsible for this is hox genes. we thus fi rst discuss this gene system in this subsection. the genome of c. elegans , fi rst determined genome among animals, will be discussed next, followed by genomes of insects and those of deuterostomes. because genomes of many vertebrate species were determined, we discuss them in chap. 9 , and in particular, on the human genome in chap. 10 . hox genes were initially found through studies of homeotic mutations that dramatically change segmental structure of drosophila by edward b. lewis [ 82 ] . they code for transcription factors, and a dna-binding peptide, now called homeobox domain, was later found in almost all animal phyla [ 83 ] . figure 8 .11 shows the hox gene clusters found in 12 animal groups. there are four hox clusters in mammalian and avian genomes, and they are most probably generated by the two-round genome duplication in the common ancestor of vertebrates (see chap. 9 ). interestingly, the physical order of hox genes in chromosomes and the order of gene expression during the development are corresponding, called "collinearity" [ 84 ] . this suggests that some sort of cis-regulation is operating in hox gene clusters, and in fact, many long transcripts are found, and some of their transcription start sites are highly conserved among vertebrates [ 85 ] . figure 8 .12 shows highly conserved the hox genes control expression of different groups of downstream genes, such as transcription factors, elements in signaling pathways, or genes with basic cellular functions. hox gene products interact with other proteins, in particular, on signaling pathways, and contribute to the modifi cation of homologous structures and creation of new morphological structures [ 87 ] . there are other gene families that are thought to be involved in diverse animal body plan. one of them is the zic gene family [ 88 ] . the zic gene family exists in many animal phyla with high amino acid sequence homology in a zinc-fi nger domain called zf, and members of this gene family are involved in neural and neural crest development, skeletal patterning, and left-right axis establishment. this gene family has two additional domains, zoc and zf-bc. interestingly, cnidaria, platyhelminthes, and urochordata lack the zoc domain, and their zf-bc domain sequences are quite diverged compared to arthropoda, mollusca, annelida, echinodermata, and chordata. this distribution suggests that the zic family genes with the entire set of the three conserved domains already existed in the common ancestor of bilateralian animals, and some of them may be lost in parallel in the platyhelminthes, nematodes, and urochordates [ 88 ] . interestingly, phyla that lost zoc domains have quite distinct body plan although they are bilateralian. caenorhabditis elegans was the fi rst animal species whose 97-mb draft genome sequence was determined in 1998 [ 89 ] . this organism belongs to the nematoda phylum which includes a vast number of species [ 90 ] . brenner (1974; [ 91 ] ) chose this species as model organism to study neuronal system, for its short generation time (~ 4 days) and its size (~1 mm). the following description of this section is based on the information given in online "wormbook" [ 86 ] . there are 22,227 protein coding genes in c. elegans including 2,575 alternatively spliced forms, with 79 % confi rmed to be transcribed at least partially. the number of trna genes is 608, and 274 are located in x chromosome. the three kinds of rrna genes (18s, 5.8s, and 26s) are located in chromosome i in 100-150 tandem repeats, while ~100 5s rrna genes are also in tandem form but located in chromosome v. the average protein coding gene length is 3 kb, with the average of 6.4 coding exons per gene. in total, protein coding exons constitute 25.6 % of the whole genome. figure 8 .13 shows the distribution of the protein coding genes, and fig. 8 .14 the distribution of exon numbers per gene. both distributions have long tails. the median sizes of exons and introns are 123 bp and 65 bp, respectively. intron lengths of c. elegans are quite short compared to these of vertebrate genes (see chap. 9 ). the distribution of protein coding genes varies depending on chromosomes, slightly more dense for fi ve autosomes than x chromosome and more dense in the central region than the edge of one chromosome. processed, i.e., intronless, pseudogenes are rare, and a total of 561 pseudogenes were reported at the wormbase version ws133. about half of them are homologous to functional chemoreceptor genes. genome sequences of four congeneric species of c. elegans ( c. brenneri , c. briggsae , c. japonica , and c. remanei ) were determined ( http://www.ncbi.nlm.nih. gov/genome/browse/ ). a fruit fl y drosophila melanogaster was used by thomas hunt morgan's group in the early twentieth century and has been used for many genetic studies. because of this importance, its genome sequence was determined at fi rst among arthropods in 2000 [ 92 ] . heterochromatin regions of ~50 mb were excluded from sequencing, [ 93 ] . their genome sizes vary from 145 to 258 mb, and the number of genes is 15,000-18,000. interestingly, d . melanogaster has the largest genome size and the smallest number of genes. a total of 12 insect species other than drosophila 12 species were sequenced by end of 2011 [ 1 ] . as of december 2013, their genome sizes are in the range of 108 mb and 540 mb, more than fi ve times difference, and the gene numbers are from 9,000 to 23,000. deuterostomes contain fi ve phyla: echinodermata, hemichordata, chaetognatha, xenoturbellida, and chordata. the genome of sea urchin strongylocentrotus purpuratus [ 94 ] was determined in 2006. its genome size is 814 mb with 23,300 genes. genomes of another sea urchins, lytechinus variegatus and patiria miniata , are also under sequencing, as well as hemicordate saccoglossus kowalevskii . chordata is classifi ed into urochordata (ascidians), cephalochordata (lancelets or amphioxus), and vertebrata (vertebrates). because we will discuss genomes of vertebrates in chap. 9 , let us discuss genomes of ascidians and lancelets only. the genome of ascidian ciona intestinalis was determined in 2002 [ 95 ] , and the genome sequence of its congeneric species, c. savignyi , was also determined three years later [ 96 ] . the genome size of c. intestinalis is ~155 mb with ~16,000 genes. interestingly it contains a group of cellulose synthesizing enzyme genes, which were probably introduced from some bacterial genomes via horizontal gene transfer [ 8 , 97 ] . the c. intestinalis genome also contains several genes that are considered to be important for heart development ( [ 95 ] ), and this suggests that heart of ascidians and vertebrates may be homologous. through the superimposition of phylogenetic trees (see chapter a2) for fi ve genes coding muscle proteins, oota and saitou ([ 98 ] ) estimated that vertebrate heart muscle was phylogenetically closer to vertebrate skeletal muscles. if both results are true, muscles used in heart might have been substituted in the vertebrate lineage. the genome sequences of an amphioxus (cephalochordate branchiostoma fl oridae ) was determined in by holland et al. (2008; [ 104 ] ), and they provide good outgroup sequence data for vertebrates. eukaryotic viruses are relying most of metabolic pathways to their eukaryote host species. therefore, the number of genes in virus genomes is usually very small. for example, infl uenza a virus has 8 rna fragments coding for 11 protein genes, and the total genome size is ~13.6 kb. as in bacteriophages, there are both dna type and rna type genomes in eukaryotic viruses. table 8 .6 shows one example of classifi cation of eukaryotic viruses based on their genome structure [ 99 ] . genomes of double-strand dna genome viruses have four types: circular, simple linear, linear with proteins covalently attached to both ends, and linear but both ends were closed. genomes of single-strand dna genome viruses are either circular or linear. genomes of rna genomes are all linear in both single-and double-strand type. those of single-strand rna genomes are classifi ed into two types: plus strand and minus strand. a subset of single-plus strand rna genome type is experiencing [ 100 ] . megavirus is phylogenetically close to mimivirus [ 101 ] , a member of nucleoplasmic large dna viruses, including pox virus. recently, a larger genome size virus, pandoravirus, with more than 2.5-mb genome, was discovered [ 105 ] . the phylogenetic status of these large genome size dna viruses is unknown at this moment. analysis of the genome sequence of the fl owering plant arabidopsis thaliana the genome of the cucumber, cucumis sativu s l draft genome sequence of the oilseed species ricinus communis the genome of black cottonwood, populus trichocarpa the grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla genome sequence of foxtail millet ( setaria italica ) provides insights into grass evolution and biofuel potential a new database (gcd) on genome composition for eukaryote and prokaryote genome sequences and their initial analyses the genome sequence of rickettsia prowazekii and the origin of mitochondria the hydrogen hypothesis for the fi rst eukaryote mitochondrial genome the complete mitochondrial genome of dugesia japonica (platyhelminthes; order tricladida) the complete nucleotide sequence of the tobacco mitochondrial genome: comparative analysis of mitochondrial genomes in higher plants and multipartite organization widespread horizontal transfer of mitochondrial genes in fl owering plants determination of the melon chloroplast and mitochondrial genome sequences reveals that the largest reported mitochondrial genome in plants contains a significant amount of dna having a nuclear origin small, repetitive dnas contribute signifi cantly to the expanded mitochondrial genome of cucumber the complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression changes in the structure of dna molecules and the amount of dna per plastid during chloroplast development in maize pattern of organization of human mitochondrial pseudogenes in the nuclear genome why genes in pieces? introns. in encyclopedia of evolution . tokyo: kyoritsu shuppan comprehensive splice-site analysis using comparative genomics the ever-growing world of small nuclear ribonucleoproteins intron phylogeny: a new hypothesis trnomics: analysis of trna genes from 50 genomes of eukarya, archaea, and bacteria reveals anticodon-sparing strategies and domain-specifi c features the origin of introns and their role in eukaryogenesis: a compromise solution to the introns-early versus introns-late debate? the evolution of spliceosomal introns: patterns, puzzles and progress genes in pieces: were they ever together? nuclear volume control by nucleoskeletal dna, selection for cell volume and cell growth rate, and the solution of the dna c-value paradox the recent origins of spliceosomal introns revisited correlation of dna exonic regions with protein structural units in haemoglobin remarkable interkingdom conservation of intron positions and massive, lineage-specifi c intron loss and gain in eukaryotic evolution new maximum likelihood estimators for eukaryotic intron evolution analysis of ribosomal protein gene structures: implications for intron evolution intron dynamics in ribosomal protein genes so much "junk" dna in our genome a fundamental division in the alu family of repeated sequences whole-genome analysis of alu repeat elements reveals complex evolutionary history characterization of highly repetitive sequences of arabidopsis thaliana centromeric repetitive sequences in arabidopsis thaliana sequence defi nition and organization of a human repeated dna empirical analysis of transcriptional activity in the arabidopsis genome identifi cation and analysis of functional elements in 1% of the human genome by the encode pilot project most "dark matter" transcripts are associated with known genes a snp in the abcc11 gene is the determinant of human earwax type molecular evidence for an ancient duplication of the entire yeast genome genomic analysis of the basal lineage fungus rhizopus oryzae reveals a whole-genome duplication global trends of whole-genome duplications revealed by the ciliate paramecium tetraurelia size of the protein-coding genome and rate of molecular evolution the evolutionary fate and consequences of duplicated genes comparative genomics in prokaryotes functions and mechanisms of rna editing the evolution of chloroplast rna editing chromosome structure and the c-value paradox la teneur du noyau cellulaire en acide dã©soxyribonuclã©ique ã  travers les organes, les individus et les espã¨ces animales (in french) nucleoprotein determination in cytological preparations the constancy of deoxyribose nucleic acid in plant nuclei conserved linkage between the puffer fi sh (fugu rubripes) and human genes for platelet-derived growth factor receptor and macrophage colony-stimulating factor receptor conserved noncoding sequences are reliable guides to regulatory elements enrichment of regulatory signals in conserved non-coding genomic sequence evolution at two level: on genes and form evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes utility and distribution of conserved noncoding sequences in the grasses conserved noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution conserved noncoding sequences in the grasses arabidopsis intragenomic conserved noncoding sequence the banana ( musa acuminata ) genome and the evolution of monocotyledonous plants computational analysis and characterization of uce-like elements (ules) in plant genomes identifi cation and analysis of conserved noncoding sequences in plants viral mutation rates the origins of eukaryotic gene structure evolution of the mutation rate rates of spontaneous mutation analysis of the genome sequence of the fl owering plant arabidopsis thaliana the arabidopsis lyrata genome sequence and the basis of rapid genome size change a draft sequence of the rice genome ( oryza sativa l. ssp. japonica) a draft sequence of the rice genome phylogeography of asian wild rice, oryza rufi pogon , reveals multiple independent domestications of cultivated rice, oryza sativa independent domestication of asian rice followed by gene fl ow from japonica to indica curated genome annotation of oryza sativa ssp. japonica and comparative genome analysis with arabidopsis thaliana multigene phylogeny of land plants with special reference to bryophytes and the earliest land plants the physcomitrella genome reveals evolutionary insights into the conquest of land by plants the selaginella genome identifi es genetic changes associated with the evolution of vascular plants overview of the yeast genome origin of genome architecture perigord black truffl e genome uncovers evolutionary origins and mechanisms of symbiosis master control genes in development and evolution: the homeobox story from dna to diversity evolution of conserved non-coding sequences within the vertebrate hox clusters through the two-round whole genome duplications revealed by phylogenetic footprinting analysis wormbook -the online review of c. elegans biology function and specifi city of hox genes a wide-range phylogenetic analysis of zic proteins: implications for correlations between protein structure conservation and body plan complexity genome sequence of the nematode c. elegans : a platform for investigating biology an improved molecular phylogeny of the nematoda with special emphasis on marine taxa the genetics of caenorhabditis elegans the genome sequence of drosophila melanogaster evolution of genes and genomes on the drosophila phylogeny the genome of the sea urchin strongylocentrotus purpuratus the draft genome of ciona intestinalis : insights into chordate and vertebrate origins assembly of polymorphic genomes: algorithms and application to ciona savignyi a functional cellulose synthase from ascidian epidermis phylogenetic relationship of muscle tissues deduced from superimposition of gene trees genome science and microorganismal molecular genetics distant mimivirus relative with a larger genome highlights the fundamental features of megaviridae the 1.2-megabase sequence of mimivirus ultraconserved elements in the human genome genomu shinkagaku nyumon (written in japanese, meaning 'introduction to evolutionary genomics') the amphioxus genome illuminates vertebrate origins and cephalochordate biology pandoraviruses: amoeba viruses with genomes up to 2.5 mb reaching that of parasitic eukaryotes key: cord-103029-nc5yf6x4 authors: wichmann, stefan; scherer, siegfried; ardern, zachary title: computational design of genes encoding completely overlapping protein domains: influence of genetic code and taxonomic rank date: 2020-09-25 journal: biorxiv doi: 10.1101/2020.09.25.312959 sha: doc_id: 103029 cord_uid: nc5yf6x4 overlapping genes (olgs) with long protein-coding overlapping sequences are often excluded by genome annotation programs, with the exception of virus genomes. a recent study used a novel algorithm to construct olgs from arbitrary protein domain pairs and concluded that virus genes are best suited for creating olgs, a result which fitted with common assumptions. however, improving sequence evaluation using hidden markov models shows that the previous result is an artifact originating from dataset-database biases. when parameters for olg design and evaluation are optimized we find that 94.5% of the constructed olg pairs score at least as highly as naturally occurring sequences, while 9.6% of the artificial olgs cannot be distinguished from typical sequences in their protein family. constructed olg sequences are also indistinguishable from natural sequences in terms of amino acid identity and secondary structure, while the minimum nucleotide change required for overprinting an overlapping sequence can be as low as 1.8% of the sequence. separate analysis of datasets containing only sequences from either archaea, bacteria, eukaryotes or viruses showed that, surprisingly, virus genes are much less suitable for designing olgs than bacterial or eukaryotic genes. an important factor influencing olg design is the structure of the standard genetic code. success rates in different reading frames strongly correlate with their code-determined respective amino acid constraints. there is a tendency indicating that the structure of the standard genetic code could be optimized in its ability to create olgs while conserving mutational robustness. the findings reported here add to the growing evidence that olgs should no longer be excluded in prokaryotic genome annotations. determining the factors facilitating the computational design of artificial overlapping genes may improve our understanding of the origin of these remarkable genetic constructs and may also open up exciting possibilities for synthetic biology. the triplet nature of the standard genetic code and double-stranded configuration of dna together enable more than one protein to be encoded within the same nucleotide sequence in different reading frames. this property of the code has long been known to be utilised in viruses [1, 2] and there is increasing evidence for overlapping encoding in other organisms [3, 4, 5] , including many genes fully embedded within other coding sequences in alternate reading frames [6] . while a mutation in a stop codon can easily create a short, trivial overlap in neighbouring genes as a chance event, longer, non-trivial overlaps should only be maintained in a genome if the overlapping region encodes a part of the protein essential for its function for both genes. there are a few hypothetical reasons why genes might overlap, and the evidence for functional antisense overlaps in prokaryotes has been discussed in a recent review [7] . while the reduction of genome size is particularly relevant only for some viruses [8, 9] , it has also been studied in bacteria [10] . effects on gene regulation [11] conceivably could affect all organisms, for instance there is the possibility of co-expression of same-strand overlapping genes (olgs) with the mother gene, given that they are potentially expressed from the same mrna. genes within an antisense overlapping pair could also influence each other, for instance in a way similar to what has recently been termed a "noncontiguous operon", where genes in antisense to each other are nonetheless co-expressed as an operon [12] . other proposed benefits of overlapping genes relate to templating structure based on the existing 'mother gene', namely, for genes directly in antisense ("-1 frame"), the creation of proteins with a complementary polarity structure to the gene on the antisense strand [13, 14, 15] or, in the case of sense overlaps, a similar hydrophobicity profile [16] . overlapping open reading frames may play an important role in the origin of de novo genes, exploring new territory in the total space of sequences and functions [17, 18, 19, 20] . while most currently extant olgs are not taxonomically conserved and therefore appear to be evolutionarily young [51] , one claimed example of an ancient olg pair is comprised of the two classes of aminoacyl-trna synthetases which can be encoded in an overlapping manner [21, 22, 23] . despite the many possible effects of overlapping genes (olgs), they are generally not considered a significant phenomenon outside of viruses, due perhaps to perceived difficulties in their evolution for some or all reading frames [24, 26] . the idea that they have been more widespread has long been theorized [25, 26, 56] . as a consequence, most gene prediction algorithms still exclude non-trivially overlapping genes [27] , especially outside of bacteriophages and other viruses. the ncbi rules for annotation of prokaryotic genes do not allow genes completely embedded in another gene in a different frame without individual justification [28] . even in viruses, relatively few overlapping genes have been annotated, particularly antisense gene pairs, although more are regularly being discovered including, for instance, in the pandemic viruses hiv and sars-cov-2 [2, 29, 50] . a recent study [30] quantified the difficulty of constructing olgs by picking random pairs of protein domains and rewriting them so as to overlap, with an algorithm minimizing the amino acid changes in each domain. this is a new approach, as previous studies tried to create overlaps without changing the amino acid sequence of the two genes, which resulted in either a very limited overlap length [31] or could only be done for very specific genes [32] . they found that, remarkably, 16% of 125250 arbitrary protein domain pairs were able to successfully overlap in at least one of the 3 reading frames they investigated, and one of two positions tested. virus domains were much more likely to create putatively functional overlaps than domains from prokaryotes or eukaryotes, as determined by blast searches of the swiss-prot database. this result suggests that creating overlaps is not as difficult as might be expected, implying that an abnormally high threshold of evidence as compared to other gene types should not be required for verifying their existence. this high success rate also opens up many possibilities for synthetic biology. for instance, mutations in overlapping regions are expected to be more deleterious on average, so an artificial genome with many olgs is not only smaller but also expected to be more stable over time on a population level, as mutations are more likely to be strongly selected against. a recent method for stabilizing synthetic genes [33] , where an arbitrary orf was constructed to overlap a gene of interest and was concatenated with an essential gene downstream, could be taken a large step forward by overlapping whole genes thereby creating a system where not only 'polar' mutations are selected against but also more minor mutations, if they also affect the mother gene. genome size has become a significant limiting factor for biomolecular computing, in which genetic programs are inserted into cells [34] . existing compression methods [35] could be greatly improved by using olgs, making more complex systems possible. in this context a well designed stable synthetic genome could include fail-safe measures, such that faulty genetic programs would shut down. here the algorithm provided in [30] is used but improved in the evaluation of the constructed sequences as the analysis in the previous study has some weaknesses resulting in incorrect claims. determining whether an artificial sequence has a specific function from its amino acid sequence only is a very hard problem and not possible today. progress is being made in predicting the protein structure from amino acid sequence [53] , but protein structure does not determine function as essential binding sites can be rendered useless if the amino acid is changed while not changing the overall protein structure. ultimately only experiment can definitively determine the function of a given amino acid sequence. in order to aid the design of expensive experimental setups however, it can at least be determined bioinformatically how similar an artificial sequence is to sequences with known functions. in this study the artificially designed sequences are compared to their original sequences in terms of amino acid identity, amino acid similarity, hidden markov model profile and secondary structure in order to determine the impact of olg construction and which sequences are potentially functional. firstly, the details on how some technical artifacts arose are explained and how to avoid them. in order to further improve the analysis hidden markov models rather than blast is used in this study. while the previous study [30] tried to estimate an upper limit of how many domains can be successfully overlapped in at least one reading frame and position, here the average success rate for olg construction is determined instead, which is more relevant in relation to both understanding constraints on the formation rate of naturally occuring olgs and in assessing the likelihood of successful synthetic creation of olgs. these results in one sense give an upper estimate of the ease of creating overlaps as the difficulty of obtaining an overlapping gene pair naturally is not directly addressed here. on the other hand, overlapping functional domains directly is a "worst case scenario" as there is some evidence that the critical functional domains of one protein in an olg pair tend to overlap less constrained regions of the other protein [36] , and this segregation is also intuitively plausible. in order to estimate the difficulty of achieving overprinting naturally, the minimal number of nucleotide changes needed to create the olg sequence is determined. whether functional domains do in fact overlap in nature, however, deserves further attention. by expanding the analysis of the previous study [30] from the reading frames '+2','-1' and '-3' to all reading frames (see fig. 1 for reading frame definitions), the observed differences between reading frames can be related to the structure of the standard genetic code. through constructing olgs using randomly generated genetic codes it can be studied whether the standard code shows evidence of optimisation regarding olgs. using the improved evaluation of the designed olgs it can be shown that virus genes, surprisingly, are less suited than bacterial and eukaryotic genes to design olgs. figure 1 : illustration of the alternative reading frames. the '+1' frame is the standard or reference reading frame and '+2'/'+3' the sense overlaps, while frames '-1' to '-3' are on the antisense strand. in [30] constructed sequences were evaluated with a blast search against the swiss-prot database. if both overlapping sequences had a match to the best hit with at most an e-value of 10^(-10) and a match length of 85%, the overlap was considered successful. however, the initial sequences were picked from the pfam seed database and it can be shown that most of the chosen sequences are not well represented in the swiss-prot database (see left panel in fig. 2) , with the exception of virus genes. in a search against the swiss-prot database, identities of over 80% were only found for 15% of the non virus genes, while 70% of the virus genes could be found in this category. a curated set in which all sequences have a 100% match in the swiss-prot database but otherwise the same properties has a remarkable 95% success rate for overlaps and the virus vs non-virus difference vanishes (see right panel in fig. 2 ). the advantage reported for virus genes is thus fully explained by dataset-database biases. in any case, the extremely high overall success rate obtained should be investigated. either creating overlaps is indeed unexpectedly easy or the evaluation of functionality used in [30] is not conservative enough. it can be shown that both factors appear to contribute to the surprising result. [30] with different match identities in swiss-prot -virus genes from this dataset have a higher average identity to a swiss-prot entry than non-virus genes. right: percentage of functional olgs for the original dataset used in [30] and the average of 10 curated datasets grouped into virus and non virus genes. in curated datasets all original sequences have an exact match in swiss-prot. each curated dataset has 100 sequences with 70-100 amino acids. the virus versus non-virus difference observed in the dataset of [30] vanishes for the curated datasets. when introducing the minimal number of changes required for two random sequences to fully overlap each other, a similar percentage of each sequence is expected to change. in such a case the e-values of the constructed sequences would be strongly length dependent, as a longer sequence with the same similarity has a lower probability of being found by chance in a database of a given size. when picking datasets with different sequence lengths such a lengthdependence can be found in the blast evaluation (supplementary fig. s1 ). a fixed e-value cutoff cannot adequately evaluate sequences in such a situation as the cutoff value fully determines the result and is chosen arbitrarily. the sequences used in [30] have a length of 70-100 amino acids, and the high success rate for the curated set can be explained by a combination of the sequence length and the choice of the cutoff value. in order to find a reasonable alternative to the fixed e-value cutoff, hidden markov models (hmms) can be used to score the constructed sequences. here hmmer3 (v3.2.1) [37] is used to create profiles for each protein domain family in the pfam database [38] in order to score the constructed sequences. the pfam database consists of a 'seed' database, containing trusted sequences for each family which are used to create hmm profiles, and a 'full' database, containing all the sequences of the uni-prot database sorted into the different families according to the previously constructed profiles. here the hmm profiles are also constructed from the 'seed' sequences and in order to find the sequence most closely representing the profile all full sequences are tested against the profiles. the highest scoring sequences are used to construct olgs. the rest of the 'full' sequences are used as a comparison for the overlapping region of the constructed sequences. a constructed sequence is judged successful if it has a higher score than a sequence at a defined threshold percentile of the 'full' sequences, thereby creating a threshold value which is individual for each protein family. here results for different threshold percentiles are discussed, while highlighting two particular percentile values. firstly, the 50th percentile (median), which marks the score of a typical sequence in the protein family. in this analysis, sequences meeting this threshold can not be distinguished from the naturally occurring protein domains and they will be categorised as typical proteins. since all sequences in the 'full' group are naturally occurring sequences, scoring at least as highly as any of these sequences renders a sequence biologically relevant. in order to avoid extreme outliers which may be misclassified, the 5th percentile is used as the biologically relevant threshold. a relative threshold could alternatively be established with e.g. blast by first picking a single sequence as a starting point for construction and also for comparing the rest of the protein family to in order to find the threshold score as described above. in this case however, it is not clear which sequence to choose as a starting point. a randomly picked sequence could be an outlier of the protein family, resulting in unreliable comparison scores and a higher chance of losing function after constructing olgs. hmms on the other hand provide a profile reflecting the 'average' sequence, which is a better representative for the whole protein family. choosing a family-specific threshold value takes care of most of the length dependencies, but in order to be sure and to be able to compare sequences of different lengths, each score resulting from a comparison between a sequence and a hmm profile is divided by the sequence length. here scores are used instead of e-values, as the latter also depend on the database size, an arbitrary factor in this analysis. aligning the best sequence with the 'seed' sequences using mafft (v7.419) [39] , weights used for sequence construction can be determined just as in [30] . a more detailed description of the calculation of the weights and their influence can be found below. when studying the influence of a protein family's taxonomic classification on the construction of olgs, the 'seed' and the 'full' database are first filtered by the four major taxonomic groups -archaea, bacteria, eukaryotes and viruses -before creating the profiles and the thresholds. muscle (v3.8.31) [40] was used for realigning the 'seed' sequences after taxonomic filtering. for subsequent analyses, random sets from the ~17000 pfam families were chosen, with the condition that each family must have at least 10 'seed' sequences and 4 'full' sequences in order for the weights and the thresholds to be reasonably defined. each dataset consists of 150 families since the variance of the resulting olg success rate barely declines for larger sets (see supplementary fig. s2 ). fig. 3 summarizes the workflow. hmm profiles are constructed from the seed sequences. the sequence with the highest score from the full group is used for olg design. the remaining sequences in the full group are used to construct threshold scores used to evaluate the designed olgs. in order to estimate the expected success rate of an individual overlap attempt, the domains are overlapped at random positions such that one domain is fully embedded into the other. just as in [30] the sequence with the lower quality of the two constructed olgs is used as a conservative representative of the pair. after determining the success for each position, the percentage of successful positions for each olg pair, the average success rate in each reading frame, and the overall success rate averaged across reading frames are calculated. the number of possible positions for each olg pair is equal to their difference in length plus one, so using more than one overlap position in each pair is only possible for genes with different lengths. increasing the number of positions for each gene does not change the expected success rate but reduces its variation between different sets (see supplementary fig. s3 ). comparing the variation caused by choosing random positions and the variation caused by choosing random pfam families, the former turns out to be negligible and consequently only a single randomly chosen position for each olg pair is used for subsequent analyses. the distribution of the percentage of successful positions in each olg pair is calculated from up to 50 different positions (see fig. 4 ). 50.3% of all olg pairs form biologically relevant sequences at all positions in every reading frame while only 2.5% cannot form a biological relevant sequence at any position (see fig. 4 ). 1.9% of the pairs even form typical proteins, as determined by the 50th percentile threshold, at every position in any reading frame (see right panel in fig. 3 ). this result is strongly dependent on the threshold percentile chosen, but due to the wide range of possible results it can still be concluded that the chance of success of a constructed olg pair depends strongly on the particular genes used, as might be expected. in each olg pair 30 sets of up to random 50 positions were tested against the pfam group hmms using the 'biologically relevant' threshold (5th percentile) and the 'typical sequence' threshold (50th percentile) for a successful overlap. while 50.3% of the pairs can be overlapped at any position and 2.5% in no position using the biological threshold only 1.9% can be overlapped at any position and 66.7% in no position using the threshold of typical sequences. the sequence threshold strongly influences the result. in order to determine whether the relative evaluation of olgs really removed the length dependency, the average quality 'q' of an olg pair is determined and compared for olg pairs with different lengths. q is defined as the ratio of the scores of the constructed sequence (s) over the original sequence (s_max) times 100. the quality is therefore the percentage score loss due to the overlap. supplementary fig. s4 shows the mean quality for datasets with different sequence lengths. starting from around 50 amino acids, q is indeed mostly independent of sequence length. the low q values of smaller sequences are because these sequences are less frequently matched to their respective hmm-profile, which results in a score of zero. the reason is probably that the shorter sequences fall below internal detection thresholds of hmmer3 more easily. changing a single amino acid in a short gene changes its quality to a greater extent than in a long gene, resulting in larger fluctuations, which can lower the sequence below detection thresholds. lowering internal thresholds of hmmer3 did not lead to more sequences being recognized by their respective profile. in further analysis the minimum sequence length of 70 amino acids is used so that the percentage of olg pairs in which at least one sequence is not recognised is below 5% (see supplementary fig. s4 ). when taking both sequences of each pair and not only the one with the lower quality, the quality distribution converges to a broad peak at around 76% with increasing sequence length (see supplementary fig. s5) . since the quality also depends on the flexibility of the hmm profiles used to score the sequences the peak is not expected to get any narrower with increasing sequence length and thus to reduce variations in sequence similarities between the constructed and the original sequences. the algorithm to construct olg sequences from [30] uses an exchange matrix (blosum62 [41] ) to find the closest overlapping sequences to the original ones. it determines the codon with the highest sum of the scores for the exchanges in both sequences at each position. sequence weights can prioritise the score of either one or the other sequence at different positions in order to increase the chance of obtaining functional sequences. in [30] , the weight w_i at position i of the sequence is w_i=e^(-s_i), where s_i is the entropy calculated at position i in the alignment. the weights could be defined differently such that their influence on olg construction is stronger or weaker. in order to optimize the weight strength a factor k is added to the entropy in their calculation such that w_i=e^(-ks_i). varying k>0, the optimal weight strength for constructing olgs can be determined, while k=0 means no weights are being used. in the hmm evaluation the influence of k is very weak. a value of k=0.5 is used in order to maximise the quality, q (see supplementary fig. s6 ). picking very high k values q goes to zero. in this case at each position the sequence with the higher conservation maintains its amino acid. this indicates that it is crucial that at each position both sequences are changed in order to create functional olgs. in the blast evaluation k=0 is optimal (see supplementary fig. s7 ), such that no better value can be found for k>0. blast does not take special account of conserved regions of a sequence, so weights can improve one sequence but at the same time will reduce the score of the other. since the lowest scoring of the two sequences is taken to represent the olg pair, introducing weights has a high chance of reducing the success rate in an evaluation using blast, despite increasing biological relevance. this makes an evaluation using hmm or any other method that takes into account sequence conservation significantly preferable for judging constructed olg pairs. the five alternative reading frames differ strongly in the combinatorial constraints imposed by the reference gene (mother gene) via the standard genetic code [24] , e.g. the sequence n|gcn|, with n being any nucleotide, always translates to alanine in the +1 and the -2 frame. it is interesting whether this difference in constraint transfers to the success rate for designing olgs. for olgs resembling typical proteins of their respective families, the success rates for olg construction varies from 14.9% in the '-3' frame to 3.0% in the '-2' frame with an average value of 9.6% across all reading frames (see fig. 1 ). calculating the e-value just as in [30] as a reference, the constructed olgs have a median e-value of 10^(28) to 10^(-37), decreasing with increasing threshold percentile. the result is strongly threshold dependent as 94.5% of the constructed sequences score at least as highly as the worst sequence in the full group, while only 0.02% score better than 95% of the full group. considering combinatorial restrictions of different reading frames [24] the ranking of frames by success rate are exactly as expected, insofar as the success rate of each reading frame is inversely proportional to the extent of combinatorial restrictions found in [24] (see fig. 5 ): the '-2' frame is the least successful reading frame and has the highest restrictions, followed by the '-1' frame, which is the second most restricted frame. next are reading frames '+2' and '+3', which have exactly the same restrictions and surprisingly almost the same success rates, not only in their average value but also in every single dataset (data not shown), despite expected stochastic fluctuations due to some genes simply fitting better to each other. last is the '-3' frame, which has no combinatorial restrictions and the highest success rate. plotting the different success rates in the different reading frames as a function of the number of combinatorial constraints found in [24] , results in a linear relation for the lowest possible threshold, namely that all sequences which are at least as good as the worst in the comparison group are judged successful. as the threshold is increased the linear relation is gradually lost (see supplementary fig. s8 ). for higher thresholds most of the sequences are below the threshold and very little data is left, which might lead to the observed behaviour. in summary, the structure of the standard genetic code appears to strongly influence the construction of olgs. whether the observed relationship between predicted constraints in different frames and the difficulty of constructing olgs is borne out by the proportion of natural olgs found across frames deserves attention across diverse taxa. the threshold chosen within the pfam group has a very strong influence on success rates. the ordering of the reading frames by success rates, namely '-3', '+2'/'+3', '-1' and '-2', matches the ordering by combinatorial restrictions in the standard genetic code, beginning with the least restricted frame [24] . determining the impact of olg construction on an amino acid sequence identity is another indicator of its functionality. it has been argued that a 34% amino acid identity between naturally occurring sequences ensures that both sequences have the same structure [54] . comparing the altered part due to olg construction with the original sequence, in 96.5% of cases both olg sequences share at least 34% of amino acids with their original sequence. in some olg pairs both sequences have an amino acid identity of up to 60% compared to their original sequence. in the biologically more relevant property of amino acid 'similarity', the worst-scoring of the two olgs can be even up to 80% similar to its respective original sequence (c.f. left panel of fig. 6 ). determining the average amino acid identity and similarity between the two olg sequences, the average olg design impact can be determined. the average amino acid identity is 60% in most cases (right panel of fig. 6) showing that in almost all olg pairs one sequence is above and one is below 60% amino acid identity. the average amino acid similarity is 75% in most cases (right panel of fig. 6 ) which again shows that in almost all cases one of the two olg sequences is above and one below 75% identity. the double peak structure of both panels in fig. 6 can be explained by differences for olg pairs in different relative reading frames, which are pooled here (c.f. supplementary fig. s14 ). it follows that in an average olg design, in 20% of all overlap positions the amino acids of both sequences can be maintained, in 30% one sequence maintains its amino acid while the amino acid in the other sequence is changed to a similar one and in 50% one sequence maintains its amino acid and the other sequence cannot maintain a similar amino acid. how well the two sequences can be maintained after the overlap is determined by the standard genetic code and the two specific sequences, the overlap position, their amino acid composition and the amino acid order. while the standard genetic code is a constant factor across all overlaps, all other factors are specific in each case and create the variability in the results. figure 6 : probability density for different amino acid identities and similarities in constructed olg pairs. the data is calculated from 505,000 olg pairs. left: the sequence with the lower identity is representative of the pair. the black line indicates the 34% amino acid identity threshold. right: the mean similarity of both olg sequences represents the pair. the impact of olg design on secondary structure is the last factor studied here. comparing the secondary structure of the olg sequence with its original sequence, a secondary structure similarity is determined. secondary structure is predicted using porter 5 [42] with the "--fast" flag. it can distinguish between the eight different secondary structure motifs of the dictionary of protein secondary structure (dssp) [47, 48, 49] , which are 3_10-, alpha-, and phi-helices, hydrogen bonded turns, beta sheets, beta bridges, bends and coils. determining the same secondary structure similarity for all sequences in the seed group of the pfam database yields a control group. this way the typical deviations between domains with the same function can be determined. comparing probability densities for different secondary structure identities in both groups it can be seen that the constructed olg sequences barely deviate from the seed sequences (c.f. fig. 7 ). in conclusion, in regards to secondary structure the change inflicted on a sequence to create olgs is no more than the differences within naturally occurring protein domain families. it is noteworthy that only amino acid identity and similarity have a strong correlation (r=0.82) so combined with the other parameters, namely the relative hmm score and the secondary structure identity, there is a set of three more or less independent properties for evaluating constructed olgs, and probably for protein homologs in general. the relative hmm score is the hmm score of the olg sequence divided by the hmm score of a sequence at any threshold percentile as discussed above. between each pair of parameters the pearson's correlation is below 0.2, with the exception of the correlation between secondary structure identity and hmm score being r=0.37 or r=0.39 for thresholds of 95% or 100% respectively. olgs are as similar to their original sequences in secondary structure as observed for comparisons of seed sequences of naturally occurring protein domains to the sequence best representing the respective domain family. by comparing olg sequences constructed with the standard genetic code (sgc) to sequences constructed with artificial codes the level of optimality of the sgc can be inferred. since such an approach depends strongly on the codeset used [43] , four different versions with increasing restrictions will be tested. there are two factors defining a genetic code, namely its amino acid composition and the arrangement of amino acids on the 64 codons. the first code set is the random code set and does not constrict any of the two factors. each code can have any of the 20 amino acids used in the sgc at any codon. the second set only restricts the composition of its codes and is called the degeneracy code set. all codes in this set contain the same amount of codons for each amino acid as in the sgc and thus conserving its amino acid composition. the third set is the blocks code set whose codes have a very similar structure to the sgc and while it also restricts the composition of the codes to some degree it mostly determines their arrangement. this code set is created by assigning all codons of the sgc that code for the same amino acid into blocks and shuffling the amino acids assigned to each block and thus conserves the degeneracy structure of the sgc on the third nucleotide. lastly a code set that maintains the mutational robustness of the sgc as calculated in [43] is tested. in short, the mutational robustness is the average change of amino acids due to point mutations and has been shown to be extremely optimal in the sgc relative to similar codes [44] . this set contains block codes like in the second set but only the codes whose mutational robustness is at least as high as the sgc are kept. since these codes are fundamentally block codes they are partly restricted in their amino acid composition, but the arrangement of amino acids in these codes is even more restricted as point mutations from any codons should result in similar amino acids. this code set reflects the fact that different properties of the sgc have a different impact on the fitness or biological optimality of the sgc with the mutational robustness most likely being one of the most important features. here this code set is called the mutational robustness blocks set (mr-blocks set) and it tests the optimality of constructing olgs as an additional property of the sgc after taking into account the mutational robustness. comparing the degeneracy, the block and the mr-blocks code set to the random set, the influence of code composition and arrangement can be determined (see left panel of fig. 8 ). the degeneracy code set reflecting the composition of the sgc has the codes with the highest average success rates indicating that the composition of the sgc is a major factor for this property, but the sgc itself has a very low success rate in comparison, indicating that the amino acid arrangement is an even stronger -in this case negative -factor as the sgc is worse than both the random codes and the degeneracy codes. the block structure of the sgc has a strong negative impact on successful olg design and the sgc is a typical member of this set. enforcing even more structure on the artificial codes in order to maintain the mutation robustness of the sgc further reduces the ability of the sgc to create successful olgs. studying the optimalities of each of the four code sets for flexibility in olg design, it is apparent that the more restricted the code set is, the more optimal the sgc is relative to the set (see right panel of fig. 8) . especially in the mr-blocks code set only a few codes are better than the sgc, however no codeset or reading frame has fewer than 5% of codes doing better (see fig. s19 -s12), which has been a recommended threshold for inferring optimality [45] . this is an expected result even if the code has been optimised for olgs as the success rate for constructing olgs reflects merely the 'flexibility' of a code system, but olg sequences also need to be conserved, which is an almost directly opposing property which also has not been found to be strongly optimal by itself [43] ; it might indeed be expected that overall optimality involves a trade-off between the two. if the sgc has been optimized in this way this could indicate a turning point at which a further increase in mutational robustness results in a smaller fitness increase compared to an increase in the flexibility to create olgs -how to measure fitness for a genetic code is however not clear. while the code composition of the sgc is beneficial for both the ability to create successful olgs and the mutational robustness, the code arrangement of the sgc is only beneficial for mutational error robustness and the sgc (see fig. 2 of [43] ), indicating that the mutational robustness is the more important property. only in the set of codes with the same mutational robustness the optimality for olg design becomes stronger, supporting the turning point hypothesis. figure 8 : olg design success rates for different alternative code sets. the average is calculated from 20 sets of 100 alternative codes, except for the mr-blocks set with 10 sets of 500 codes. the error bars indicate the standard deviation. left: the average success rates compared to the sgc. while the composition of the sgc is a positive factor, the arrangement of the sgc is a negative factor. right: the optimality of different code sets. the black line indicates the 5% threshold. the more restricted the code set the more optimal the sgc appears indicating that the ability to successfully create olgs has only been optimized while maintaining other properties. besides the four basic taxonomic groups (three domains of cellular life: archaea, bacteria, eukaryotes, plus viruses) also old genes can be studied by picking only families which have at least one sequence in all four taxonomic groups since it is expect that these families have already been present in luca or another ancient ancestor (although this high level categorisation is not perfect due to widespread horizontal gene transfer). surprisingly, bacterial and eukaryotic genes are generally significantly better suited to olg construction than virus and archaeal genes with only minimal dependence on the threshold percentile, c.f. fig. s13 . the largest dependence on the threshold percentile is found for the "found in all" genes, which only a total of 50 sequences can be found in the pfam database, so higher stochastic fluctuations are to be expected. using the 'biologically relevant' threshold, the biggest difference is between eukaryotic and archaea genes which have a 20% difference in their success rate (see left panel of fig. 9 ). for olgs which are typical proteins of their respective family, eukaryotic genes are almost twice as likely to be successful as virus genes (see right panel of fig. 9 ). eukaryotes and "found in all" genes are typically the easiest to overlap, which is somewhat unexpected as eukaryotic genes would perhaps be expected to have the youngest protein families, and so to appear less 'flexible' due to having sampled less of the functional space through mutations. more understandable however is that due to being closer to mutational saturation (if more ancient on average) and therefore having explored a larger proportion of functional sequence space, "found in all" genes might appear more 'flexible', resulting in lower weights and thresholds. in order to estimate the difficulty of naturally forming olg sequences, the minimum number of nucleotide changes needed in order to reach the olg sequence from any of the two original sequences is determined (see fig. 10 ). by only taking olgs in which both sequences are above a certain hmm threshold, extreme outliers are gradually removed with increasing threshold but the rest of the distribution stays the same. this indicates that this property is independent of the threshold value, just as for the amino acid identity and similarity, as fewer and fewer designed olgs pass a higher threshold which makes extreme outliers less likely to occur. on average a designed olg sequence has a 25% difference in nucleotides to its original, with half of constructed sequences in the range of 20-30% change. most interesting are outliers on the lower end of the distribution as they indicate whether olgs exist that are potentially reachable by naturally occurring mutations. the lowest nucleotide difference observed is 1.8%, which was for an olg pair that scores better than 25% of the domains in the comparison group. 0.6% of olgs required less than 10% nucleotide change, i.e. 5843 sequences of the 955846 sequences created in this dataset that scored at least as highly as the worst sequence in the comparison group. this suggests intuitively that creating overlaps of the sort constructed here could be possible naturally through accumulation of random mutations. the population genetics of such a hypothetical process is a potential topic for further study, as is an experimental evaluation of functionality. figure 10 : percentage nucleotide change of olgs as a function of hmm threshold %. the minimal nucleotide distance of each of 1010000 olg sequences (two per pair), with a minimal length of 210 nucleotides, to their respective original sequence is determined. there are many aspects of the synthetic construction of olg pairs which can be studied. here factors such as sequence length and the influence of sequence conservation are taken into account. the analysis shows that an evaluation with blast and a fixed e-value cutoff cannot accurately assess the potential functionality of the designed olgs. while the combination of sequence length and an e-value cutoff completely determines the success rate of the constructed olgs, adding in positional weights can only negatively influence the sequences constructed with this method. both problems can be solved however by instead using hmm profiles to determine sequence similarity and then using these to define a threshold for successful olgs derived from sequences in the same protein family. the hmm profiles and the thresholds are though both derived from the pfam database [38] , which makes these results strongly dependent on the database quality. for example, if in one taxonomic group sequences are very similar due to being mostly from the same species or genus, thresholds would appear to be higher and it would be harder for designed olgs to pass these thresholds. further optimization of the construction algorithm can be achieved by determining the optimal weight strength (influence of sequence conservation), which is k=0.5. 94.5% of the constructed olg sequences score at least as highly as the worst-scoring biological sequences in pfam groups, while 9.6% of the sequences cannot be distinguished from naturally occurring domains in their respective protein family. this indicates that the typical variation inside protein families is of the same order of magnitude as the change needed in order to construct artificial olgs by arbitrary pairing of protein domains. this result also holds true for other bioinformatic factors like amino acid identity and secondary structure, since the constructed olgs are typically very similar to naturally occurring domains in these properties. studying artificial olg design success from the perspective of an even more constricting biological parameter like tertiary structure would be an important next step; but besides the amino acid sequence, also codon usage can impact protein structure [52] , along with environmental factors such as the presence of chaperone proteins, which together make it a much harder problem. ultimately, proof of the functionality of artificial sequences cannot yet be realised bioinformatically, and experimental verification is essential. to this end all known independent protein properties available from the sequence should be tested in order to create a gold standard for possibly functional sequences. from this study it is clear that sequence similarity (or identity), hmm-scores and secondary structure should be part of the judged properties. determining relative hmm scores for high thresholds could be used to prefilter sequences for secondary structure prediction as it is the computationally most intensive part of this analysis. considering that domain-domain overlaps are expected to be much harder than overlapping a domain with a less conserved region in another gene, it appears that de novo origin of genes from overlapping orfs may be much less difficult than widely assumed. some constructed olg sequences varied only by 1.8% from their original sequence, and there might be other natural sequences from the same domain that are even closer to the olg sequence. this result could be a starting point for estimating the difficulty of evolving olgs from different starting sequences in natural systems, which is still relatively unexplored despite some early work [46] . the structure of the standard genetic code explains differences between reading frames and is a strong factor in the overall success rate of olg construction. olgs can maintain an average 60% amino acid identity and an average 75% amino acid similarity, which is mostly due to the genetic code. the structure of the standard genetic code is defined by its composition, namely how many codons code for each amino acid, and its arrangement, namely which codons code for each amino acid. it is known that the composition alone can not explain the strong optimality of the standard genetic code for mutational robustness as it stands out from between codes with the same composition as the standard genetic code [43, 55] . considering that the arrangement of the standard genetic code creates such high mutational robustness values [44] it is remarkable that designing olgs also works so well. another factor which deserves further exploration is the age of a protein family, i.e. the time since gene birth. this may correlate with apparent 'sequence flexibility', which is the strongest influence on the result via the threshold values, due to increasing mutational saturation in older protein families. being able to distinguish genuine sequence flexibility from mutational saturation, even in broad terms, could be very useful here. the analysis presented here depends primarily on the reliability of hmm profiles of pfam groups as a guide to biological functionality in constructed sequences. reliability for classifying biological protein sequences into ortholog families, the main use of these hmms, may not correlate well with reliability in scoring artificially constructed sequences for functionality. in other words it may well be that these profiles fail to capture important requirements for protein tertiary structure and/or functionality. future research ought test the best candidates experimentally, and if the best candidates from the methods developed here are not successful, additional factors could be considered in comparing constructed sequences and their natural precursors. for instance, many protein characteristics can be assessed using servers or packages incorporating multiple bioinformatic tools such as predictprotein, for various secondary structural elements [57] , and many sequence properties, such as hydrophobicity profiles, can be computed using the volpes server [58] , which has been applied to the related case of frame-shifted sequences compared to their mother genes [16] . other properties required for functional protein sequences can be inferred from the evolutionary information contained in sequence alignments of protein families. for instance, it has been calculated based on a study of residue-residue co-evolution in ten well-characterized protein families that the proportion of all sequences which fold to the family's structure ranges from approx 10 -24 to 10 -126 [59] . these principles have recently been successfully used in the design of functional proteins [60] , and could conceivably also be applied to olg construction. factors facilitating the existence of olgs may possibly help in predicting olgs in sequenced genomes and should be explored further. for instance, a careful study of relatively 'flexible' sequence regions in taxonomically widespread genes may help find more overlapping genes. most interestingly, bacterial and eukaryotic genes can be overlapped more easily than virus genes, contrary to the findings in [30] . these earlier results can be explained entirely with dataset-database biases, so this algorithm gives no support for the common assumption of a higher intrinsic olg formation capacity of viruses compared with bacteria or eukaryotes. two of the main differences between the taxonomic groups are the expected mutation rates and the average length of a protein. while genomes with higher mutation rates explore sequence space faster and therefore their proteins should appear to be more flexible, virus domains do not appear to be very flexible, despite having the highest mutation rate. the length of the sequences on the other hand has been removed as a factor in this analysis. an artificial factor not considered could be database biases or an exchange matrix (blosum62) biased towards certain kinds of proteins. the latter could be tested by using different matrices created from sequences from different taxonomic groups. it would be important to use the new matrix not only in the construction of the olgs but also in the evaluation by the hmms. so far it is not clear why protein families from different taxonomic groups are so different in their calculated ability to create olgs. a better theoretical understanding of overlapping genes will be extremely useful in microbial genome annotation methods, the study of evolution, and in synthetic biology, and therefore deserves renewed attention. overlapping genes in bacteriophage ï�x174 concomitant emergence of the antisense protein gene of hiv-1 and of the pandemic the novel ehec gene asa overlaps the tegt transporter gene in antisense and is regulated by nacl and growth phase overlapping genes in parasitic protist giardia lamblia overlapping genes in vertebrate genomes are antisense proteins in prokaryotes functional? the evolution of genome compression and genomic novelty in rna viruses gene overlapping and size constraints in the viral world comparative study of overlapping genes in bacteria, with special reference to rickettsia prowazekii and rickettsia conorii overlapping genes in bacterial and phage genomes noncontiguous operon is a genetic organization for coordinating bacterial gene expression is genetic code redundancy related to retention of structural information in both dna strands? complementarity of peptides specified by 'sense' and 'antisense' strands of dna genetic coding algorithm for sense and antisense peptide interactions frameshifting preserves key physicochemical properties of proteins viral proteins originated de novo by overprinting can be identified by codon usage: application to the "gene nursery" of deltaretroviruses origins of genes:" big bang" or continuous creation gene birth contributes to structural disorder encoded by overlapping genes evolution of viral proteins originated de novo by overprinting a minimal trprs catalytic domain supports sense/antisense ancestry of class i and ii aminoacyl-trna synthetases two types of aminoacyl-trna synthetases could be originally encoded by complementary strands of the same nucleic acid functional class i and ii amino acid-activating enzymes can be coded by opposite strands of the same gene the combinatorics of overlapping genes overlapping genes: more than anomalies? do overlapping genes violate molecular biology and the theory of evolution? missing genes in the annotation of prokaryotic genomes a case for a negative-strand coding sequence in a group of positive-sense rna viruses computational design of fully overlapping coding schemes for protein pairs and triplets two proteins for the price of one: the design of maximally compressed coding sequences designing of a single gene encoding four functional proteins engineering gene overlaps to sustain genetic constructs in vivo biomolecular computing systems: principles, progress and potential genetic programs can be compressed and autonomously decompressed in live cells functional segregation of overlapping genes in hiv profile hidden markov models pfam: a comprehensive database of protein domain families based on seed alignments mafft multiple sequence alignment software version 7: improvements in performance and usability muscle: multiple sequence alignment with high accuracy and high throughput amino acid substitution matrices from protein blocks deeper profiles and cascaded recurrent and convolutional neural networks for state-of-the-art protein secondary structure prediction optimality in the standard genetic code is robust with respect to comparison code sets the genetic code is one in a million a neutral origin for error minimization in the genetic code evolution of overlapping genes dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features a series of pdb related databases for everyday needs the structure of proteins: two hydrogenbonded helical configurations of the polypeptide chain a previously uncharacterized gene in sars-cov-2 illuminates the functional dynamics and evolutionary origins of the covid-19 pandemic properties and abundance of overlapping genes in viruses codon usage regulates protein structure and function by affecting translation elongation speed in drosophila cells critical assessment of methods of protein structure prediction (casp)-round xiii twilight zone of protein sequence alignments extreme genetic code optimality from a molecular dynamics calculation of amino acid polar requirement evolution by gene duplication predictprotein-an open resource for online prediction of protein structural and functional features volpes: an interactive web-based tool for visualizing and comparing physicochemical properties of biological sequences how many protein sequences fold to a given structure? a coevolutionary analysis co-evolutionary fitness landscapes for sequence design key: cord-000642-mkwpuav6 authors: moreira, rebeca; balseiro, pablo; planas, josep v.; fuste, berta; beltran, sergi; novoa, beatriz; figueras, antonio title: transcriptomics of in vitro immune-stimulated hemocytes from the manila clam ruditapes philippinarum using high-throughput sequencing date: 2012-04-19 journal: plos one doi: 10.1371/journal.pone.0035009 sha: doc_id: 642 cord_uid: mkwpuav6 background: the manila clam (ruditapes philippinarum) is a worldwide cultured bivalve species with important commercial value. diseases affecting this species can result in large economic losses. because knowledge of the molecular mechanisms of the immune response in bivalves, especially clams, is scarce and fragmentary, we sequenced rna from immune-stimulated r. philippinarum hemocytes by 454-pyrosequencing to identify genes involved in their immune defense against infectious diseases. methodology and principal findings: high-throughput deep sequencing of r. philippinarum using 454 pyrosequencing technology yielded 974,976 high-quality reads with an average read length of 250 bp. the reads were assembled into 51,265 contigs and the 44.7% of the translated nucleotide sequences into protein were annotated successfully. the 35 most frequently found contigs included a large number of immune-related genes, and a more detailed analysis showed the presence of putative members of several immune pathways and processes like the apoptosis, the toll like signaling pathway and the complement cascade. we have found sequences from molecules never described in bivalves before, especially in the complement pathway where almost all the components are present. conclusions: this study represents the first transcriptome analysis using 454-pyrosequencing conducted on r. philippinarum focused on its immune system. our results will provide a rich source of data to discover and identify new genes, which will serve as a basis for microarray construction and the study of gene expression as well as for the identification of genetic markers. the discovery of new immune sequences was very productive and resulted in a large variety of contigs that may play a role in the defense mechanisms of ruditapes philippinarum. the manila clam (ruditapes philippinarum) is a cultured bivalve species with important commercial value in europe and asia, and its culture has expanded in recent years. nevertheless, diseases produced by a wide range of microorganisms, from viruses to metazoan parasites, can result in large economical losses. among clam diseases, the majority of pathologies are associated with the vibrio and perkinsus genera [1] [2] [3] . although molluscs lack a specific immune system, the innate response involving circulating hemocytes and a large variety of molecular effectors seems to be an efficient defense method to respond to external aggressions by detecting the molecular signatures of infection [4] [5] [6] [7] [8] ; however, not many immune pathways have been identified in these animals. although knowledge of bivalve immune-related genes has increased in the last few years, the available information is still scarce and fragmentary. most of the data concern mussels and eastern and pacific oysters [9] [10] [11] [12] [13] [14] , and very limited information is available on the expressed immune genes of r. philippinarum. recently, the expression of 13 immune-related genes of ruditapes philippinarum and ruditapes decussatus were characterized in response to a vibrio alginolyticus challenge [15] . also, a recent 454 pyrosequencing study was carried out by milan et al. [16] , who sequenced two normalized cdna libraries representing a mixture of adult tissues and larvae from r. philippinarum. even more recently ghiselli et al. [17] , have de novo assembled the r. philippinarum gonad transcriptome with the illumina technology. moreover, a few transcripts encoded by genes putatively involved in the clam immune response against perkinsus olseni have been reported by cdna library sequencing [18] . currently (19/12/ 2011) , there are 5,662 ests belonging to r. philippinarum in the genbank database. the european marine genomics network has increased the number of ests for marine mollusc species particularly for ecologically and commercially important groups that are less studied, such as mussels and clams [19] . unfortunately, most of the available resources are not annotated or well described, limiting the identification of important genes and genetic markers for future aquaculture applications. the use of 454-pyrosequencing is a fast and efficient approach for gene discovery and enrichment of transcriptomes in non-model organisms [20] . this relatively low-cost technology facilitates the rapid production of a large volume of data, which is its main advantage over conventional sequencing methods [21] . in the present work, we undertook an important effort to significantly increase the number of r. philippinarum ests in the public databases. specially, the aim of this work was to discover new immune-related genes using pyrosequencing on the 454 gs flx (roche-454 life sciences) platform with the titanium reagents. to achieve this goal, we sequenced the transcriptome of r. philippinarum hemocytes previously stimulated with different pathogen-associated molecular patterns (pamps) to obtain the greatest number of immune-related transcripts as possible. the raw data are accessible in the ncbi short read archive (accession number: sra046855.1). the r. philippinarum normalized cdna library was sequenced with 454 gs flx technology as shown in figure 1 . sequencing and assembly statistics are summarized in table 1 . briefly, a total of 975,190 raw nucleotide reads averaging 284.1 bp in length were obtained. of these, 974,976 exceeded our minimum quality standards and were used in the mira assembly. a total of 842,917 quality reads were assembled into 51,265 contigs, corresponding to 29.9 megabases (mb). the length of the contigs varied from 40 to 5565 bp, with an average length of 582.4 bp and an average coverage of 5.7 reads. singletons were discarded, resulting in 37,093 contigs formed by at least 2 ests, and 26,675 of these contigs were longer than 500 bp. clustering the contigs resulted in 1,689 clusters with more than one contig. the distribution of contig length and the number of ests per contig, as well as the contig distribution by cluster are all shown in figure 2 . even though the knowledge of expressed genes in bivalves has increased in the last few years, it is still limited. indeed, only 41,598 nucleotide sequences, 362,149 ests, 24,139 proteins and 704 genes from the class bivalvia have been deposited in the genbank public database (19/12/11) , and the top entries are for the mytilus and crassostrea genera. for ruditapes philippinarum, these numbers are reduced to 5,662 ests, 612 proteins and 12 genes. this evidences the lack of information which prompted the recent efforts to increase the number of annotated sequences of bivalves in the databases. for non-model species, functional and comparative genomics is possible after obtaining good est databases. these studies seem to be the best resource for deciphering the putative function of novel genes, which would otherwise remain ''unknown''. ncbi swissprot, ncbi metazoan refseq, the ncbi nonredundant and the uniprotkb/trembl protein databases were chosen to annotate the contigs that were at least 100 bp long (49, 847) . the percentage of contigs annotated with a cut off evalue of 10e-3 was 44.7%. contig sequences and annotations are included in table s1 . of these contigs, 3.26% matched sequences from bivalve species and the remaining matched to non-bivalvia mollusc classes (4.13%), other animals (81.38%), plants (2.58%), fungi (1.78%), protozoa (1.50%), bacteria (4.95%), archaea (0.20%), viruses (0.21%) and undefined sequences (0.01%). as shown in figure 3a , the species with the most sequence matches was homo sapiens with 3,106 occurrences. the first mollusc in the top 35 list was lymnaea stagnalis at position 11. the first bivalve, meretrix lusoria, appeared at position 17. r. philippinarum was at position 25 with 124 occurrences. notably, a high percentage of the sequences had homology with chordates, arthropods and gastropods ( figure 3b and c), and only 343 contigs matched with sequences from the veneroida order ( figure 3d ). these values can be explained by the higher representation of those groups in the databases as compared to bivalves and the quality of the annotation in the databases, which has been reported in another bivalve transcriptomic study [22] . the data shown highlight, once again, the necessity of enriching the databases with bivalve sequences. a detailed classification of predicted protein function is shown for the top 35 blastx hits ( figure 4a ). the list is headed by actin with 903 occurrences, followed by ferritin, an angiopoietin-like protein and lysozyme. an abundance of proteins directly involved in the immune response was predicted for this 454 run; ferritin, lysozyme, c1q domain containing protein, galectin-3 and hemagglutinin/amebocyte aggregation factor precursor are immune-related proteins present on the top 35 list. ferritin has an important role in the immune response. it captures circulating iron to overcome an infection and also functions as a proinflammatory cytokine via the iron-independent nuclear factor kappa b (nf-kb) pathway [23] . lysozyme is a key protein in the innate immune responses of invertebrates against gram-negative bacterial infections and could also have antifungal properties. in addition, it provides nutrition through its digestive properties as it is a hydrolytic protein that can break the glycosidic union of the peptidoglycans of the bacteria cell wall [24] . the c1q domain containing proteins are a family of proteins that form part of the complement system. the c1q superfamily members have been found to be involved in pathogen recognition, inflammation, apoptosis, autoimmunity and cell differentiation. in fact, c1q can be produced in response to infection and it can promote cell survival through the nf-kb pathway [25] . galectin-3 is a central regulator of acute and chronic inflammatory responses through its effects on cell activation, cell migration, and the regulation of apoptosis in immune cells [26] . the hemagglutinin/amebocyte aggregation factor is a single chain polypeptide involved in blood coagulation and adhesion processes such as self-nonself recognition, agglutination and aggregation processes. the hemagglutinin/ amebocyte aggregation factor and lectins play important roles in defense, specifically in the recognition and destruction of invading microorganisms [27] . other proteins that are not specifically related to the immune response but could play a role in defense mechanisms include the following: angiopoietin-like proteins, apolipoprotein d and the integral membrane protein 2b. in other animals, angiopoietin-like proteins (angptl) potently regulate angiogenesis, but a subset also function in energy metabolism. specifically, angptl2, the most represented angptl, promotes vascular inflammation rather than angiogenesis in skin and adipose tissues. inflammation occurs via the a5b1 integrin/rac1/nf-kb pathway, which is evidenced by an increase in leukocyte infiltration, blood vessel permeability and the expression of inflammatory cytokines (tumor necrosis factor-a, interleukin-6 and interleukin-1b) [28] . apolipoprotein d (apod) has been associated with inflammation. pathological and stressful situations involving inflammation or growth arrest have the capacity to increase its expression. this effect seems to be triggered by lps, interleukin-1, interleukin-6 and glucocorticoids and is likely mediated by the nf-kb pathway, as there are several conserved nf-kb binding sites in the apod promoter (apre-3 and ap-1 binding sites are also present). the highest affinity ligand for apod is arachidonic acid, which apod traps when it is released from the cellular membrane after inflammatory stimuli and, thus, prevents its subsequent conversion in pro-inflammatory eicosanoids. within the cell, apod could modulate signal transduction pathways and nuclear processes such as transcription activation, cell cycling and apoptosis. in summary, apod induction is specific to ongoing cellular stress and could be part of the protective components of mild inflammation [29] [30] [31] . finally, the short form of the integral membrane protein 2b (itm2bs) can induce apoptosis via a caspase-dependent mitochondrial pathway [32] . to avoid redundancy, the longest contig of each cluster was used for gene ontology terms assignment. a total of 23.05% of the representative clusters matched with at least one go term. concerning cellular components ( figure 4b ), the highest percentage of go terms were in the groups of cell and cell part with 25.9% in each; organelle and organelle part represented 19.67% and 11.38%, respectively. within the molecular function classification ( figure 4c ), the most represented group was binding with 49.25% of the terms, which was followed by catalytic activity (29.12%) and structural molecular activity (4.60%). with regard to biological process ( figure 4d ), cellular and metabolic processes were the highest represented groups with 16.78% and 12.43% of the terms, respectively, which was followed by biological regulation (10.18%). similarities between the r. philippinarum transcriptome and another four bivalve species sequences were analyzed by comparative genomics (crassostrea gigas of the family ostreidae, bathymodiolus azoricus and mytilus galloprovincialis of the family mytilidae and laternula elliptica of the family laternulidae). this analysis could identify specific transcripts that are conserved in these five species. a venn diagram was constructed using unique sequences from these databases according to the gene identifier (gi id number) of each sequence in its respective database: 207,764 from c. gigas, 76,055 from b. azoricus, 121,318 from m. galloprovincialis and 1,034,379 from l. elliptica. c. gigas was chosen because is the most represented bivalve species in the public databases. the other three species are bivalves that have been studied in transcriptomic assays. figure 5 shows that of the total 29,679 clusters, 72% were found exclusively in the r. philippinarum group, while only 7.59% shared significant similarity with all five species. the number of coincidences among other groups was very low (4.14% to 0.31% of sequences), suggesting that 21,454 new sequences were discovered within the bivalve group. the percentage of new sequences is very high compared to previous transcriptomic studies [33] [34] , in which the fraction of new transcripts was approximately 45%. one possible explanation for this discrepancy is the low number of nucleotide and est sequences currently available in public databases for r. philippinarum, but these transcripts could also be regions in which homology is not reached, such as 59 and 39 untranslated regions or genes with a high mutation rate. on the other hand, a comparison between our 454 results and the milan et al. [16] transcriptome using a blastn approach is summarized in table 2 immune-related sequences r. philippinarum hemocytes were subjected to immune stimulation using several different pamps to enrich the est collection with immune-related sequences. the objective was to obtain a more complete view of clam responses to pathogens. a keyword list and go immune-related terms were used to find proteins putatively involved in the immune system. after this selection step, we found that more than 10% of the proteins predicted from the contig sequences had a possible immune function. some sequences were found to be clustered in common, well-recognized immune pathways, such as the complement, apoptosis and toll-like receptors pathways, indicating conserved ancient mechanisms in bivalves ( figures 6, 7, 8 ). the complement system is composed of over 30 plasma proteins that collaborate to distinguish and eliminate pathogens. c3 is the central component in this system. in vertebrates, it is proteolytically activated by a c3 convertase through both the classic, lectininduced and alternative routes [35] . although the complement pathway has not been extensively described in bivalves, there is evidence that supports the presence of this defense mechanism. ests with homology to the c1q domain have been detected in the american oyster, c. virginica [36] , the tropical clam codakia orbicularis [37] , the zhikong scallop chlamys farreri [38] and the mussel m. galloprovincialis [39] [40] . more recently, a novel c1q adiponectin-like, a c3 and a factor b-like proteins have been identified in the carpet shell clam r. decussatus [41] [42] . these data support the putative presence of the complement system in bivalves. our pyrosequencing results, using the blastx similarity approach, showed that the complement pathway in r. philippinarum was almost complete as compared to the kegg reference pathway ( figure 6 ). only the complement components c1r, c1s, c6, c7 and c8 were not detected. i. lectins. lectins are a family of carbohydrate-recognition proteins that play crucial self-and non-self-recognition roles in innate immunity and can be found in soluble or membraneassociated forms. they may initiate effector mechanisms against pathogens, such as agglutination, immobilization and complement -mediated opsonization and lysis [43] . several types of lectins have been cloned or purified from the manila clam, r. philippinarum [44] [45] [46] , and their function and expression were also studied [18, 47] . also, a manila clam tandemrepeat galectin, which is induced upon infection with perkinsus olseni, has been characterized [46] . lectin sequences have been found in the stimulated hemocytes studied in our work: 23 of the contigs are homologous to c-type lectins (calcium-dependent carbohydrate-binding lectins that have characteristic carbohydrate-recognition domains), 115 are homologous to galectins (characterized by a conserved sequence motif in their carbohydrate recognition domain and a specific affinity for bgalactosides), 4 contigs have homology with ficolin a and b (a group of oligomeric lectins with subunits consisting of both collagen-like and fibrinogen-like domains) and 34 contigs have homology with other groups of lectins such as lactose-, mannoseor sialic acid-binding lectins. ii. b-glucan recognition proteins. b-glucan recognition proteins are involved in the recognition of invading fungal organisms. they bind specifically to b-1,3-glucan stimulating short-term immune responses. although these receptors have been partially sequenced in several bivalves, there is only one complete description of them in the scallop chlamys farreri [48] . two contigs with homology to the beta-1,3-glucan-binding protein were found in our study. iii. peptidoglycan recognition proteins. peptidoglycan recognition proteins (pgrps) specifically bind peptidoglycans, which is a major component of the bacterial cell wall. this family of proteins influences host-pathogen interactions through their pro-and anti-inflammatory properties that are independent of their hydrolytic and antibacterial activities. in bivalves, they were first identified in the scallops c. farreri and a. irradians [49, 50] and the pacific oyster c. gigas, and from the latter four different types of pgrps were identified [51] . peptidoglycan-recognition proteins and a peptidoglycan-binding domain containing protein have been found for the first time in r. philippinarum in our results and were present 4 and 1 times, respectively. iv. toll-like receptors. toll-like receptors (tlrs) are an ancient family of pattern recognition receptors that play key roles in detecting non-self substances and activating the immune system. the unique bivalve tlr was identified and characterized in the zhikong scallop, c. farreri [52] . tlr 2, 6 and 13 were present among the pyrosequencing results. tlr2 and tlr6 form a heterodimer, which senses and recognizes various components from bacteria, mycoplasma, fungi and viruses [53] . tlr13 is a novel and poorly characterized member of the toll-like receptor family. although the exact role of tlr13 is currently unknown, phylogenic analysis indicates that tlr13 is a member of the tlr11 subfamily [54] suggesting that it could recognize urinary pathogenic e. coli [55] . it has been demonstrated that tlr13 colocalizes and interacts with unc93b1, a molecule located in the endoplasmic reticulum, which strongly suggests that tlr13 might be found inside cells and might play a role in recognizing viral infections [56] . figure 7 summarizes the tlr signaling pathway with the corresponding molecules found in the r. philippinarum transcriptome. pathogen proteases are important virulence factors that facilitate infection, diminish the activity of lysozymes and quench the agglutination capacity of hemocytes. because protease inhibitors play important roles in invertebrate immunity by protecting hosts through the direct inactivation of pathogen proteases, many bivalves have developed protease inhibitors to regulate the activities of pathogen proteases [1] . some genes encoding protease inhibitors were identified in c. gigas [57] , a. irradians [58] , c. farreri [59] and c. virginica; in the latter a novel family of serine protease inhibitors was also characterized [60] [61] [62] . a total of 23 contigs with homology to serine, cystein, kunitzand kazal-type protease inhibitors and metalloprotease inhibitors were found among our results. lysozyme was one of the most represented groups of immune genes in this transcriptome study with 208 contigs present. it is an antibacterial molecule present in numerous animals including bivalves. although lysozyme activity was first reported in molluscs over 30 years ago, complete sequences were published only recently including those of r. philippinarum [24] . antimicrobial peptides (amps) are small, gene-encoded, cationic peptides that constitute important innate immune effectors from organisms spanning most of the phylogenetic spectrum. amps alter the permeability of the pathogen membrane and cause cellular lysis [63] . in bivalves, they were first purified from mussel hemocyte granules [64, 65] . in mussels, the amp myticin c was found to have a high polymorphic variability as well as chemotactic and immunoregulatory roles [66, 67] . in clams, two amps with similarity to mussel myticin and mytilin [68] and a big defensin [69] are known. we were able to detect 36 contigs with homology to different defensins: defensin-1 (american oyster defensin), defensin mgd-1 (mediterranean mussel defensin) and the big defensin previously mentioned. four contigs were similar to an unpublished defensin sequence from venerupis ( = ruditapes) philippinarum. the primary role of heat shock proteins (hsps) is to function as molecular chaperones. their up-regulation also represents an important mechanism in the stress response [70] , and their activity is closely linked to the innate immune system. hsps mediate the mitochondrial apoptosis pathway and affect the regulation of nf-kb [71] . hsps are well studied in bivalves. for r. philippinarum, several assays have been developed to better understand the hsps profile in response to heavy metals and pathogen stresses [72] [73] [74] . the most important and well-studied groups of hsps were present in our r. philippinarum transcriptome (hsp27, hsp40/ dnaj, hsp70 and hsp90), but other, less common hsps were also represented (hsp10, hsp22, hsp83 and some members from the hsp90 family). recently, several genes related to the inflammatory response against lps stimulation have been detected in bivalves. such is the case of the lps-induced tnf-a factor (litaf), which is a novel transcription factor that critically regulates the expression of tnfa and various inflammatory cytokines in response to lps stimulation. it has been described in three bivalve species: pinctada fucata [75] , c. gigas [76] and c. farreri [77] . other tnf-related genes have been identified in the zhikong scallop, such as a tnfr homologue [78] and a tumor necrosis factor receptor-associated factor 6 (traf6), which is a key signaling adaptor molecule common to the tnfr superfamily and to the il-1r/tlr family [79] . figure 7 shows that several components of the tlr signaling pathway that are present in our transcriptomic sequences (myd88, irak4, traf-3 and -6, tram, btk, rac-1, pi3k, akt, btk and tank). a total of 1,918 contigs, 8.43% of those annotated, had homology with the main groups of putatively pathogenic organisms such as viruses (47 hits), bacteria (1,126 hits), protozoa (341 hits) and fungi (404 hits). figure 9 displays the taxonomic classification of these sequences and table 3 summarizes a list of the known bivalve pathogens found in our results. bacteria constitute the main group found among the sequences not belonging to the clam. as filter-feeding animals, bivalves can concentrate a large amount of bacteria and it could be one of their sources of food [24] . because vibrio spp. are ubiquitous in aquatic ecosystems, it was expected that the vibrionales order, with 141 hits, would be the most predominant. several species of the vibrio genus are among the main causes of disease in bivalves specifically causing bacillary necrosis in larval stages [80] . is noticeable that sequences belonging to the causative agent of brown ring disease in adults of manila clam, vibrio tapetis, have not been found. perkinsus marinus, with 2 matches, is the only bivalve pathogen found within the protozoa (alveolata) group. perkinsosis is produced by species from the genus perkinsus. both p. marinus and p. olseni have been associated with mortalities in populations of various groups of molluscs around the world and are catalogued as notifiable pathogens by the oie. viruses were the least represented among pathogens. the baculoviridae family was the most predominant with 21 matches, but the corresponding sequences were inhibitors of apoptosis (iaps) [81] that could also be part of the clam's transcriptome. five viral families were found in our transcriptome study: iridoviridae, herpesviridae, malacoherpesviridae, picornaviridae and retroviridae. a well-known bivalve pathogen was also identified, the ostreid herpesvirus 1, which has been previously been found to infect clams [82] . fungi had 404 matches in our results. it is known that bivalves are sensitive to fungal diseases, which can degrade the shell or affect the larval bivalve stages [83, 84] . this study represents the first r. philippinarum transcriptome analysis focused on its immune system using a 454-pyrosequencing approach and complements the recent pyrosequencing assay carried out by milan et al. [16] . the discovery of new immune sequences was effective, resulting in an enormous variety of contigs corresponding to molecules that could play a role in the defense mechanisms. more than 10% of our results had relationship with immunity. this new resource is now gathered in the ncbi short read archive with the accession number: sra046855.1. our results will provide a rich source of data to discover and identify new genes, which will serve as a basis for microarray construction and gene expression studies as well as for the identification of genetic markers for various applications including the selection of families in the aquaculture sector. we have found sequences from molecules never described in bivalves before like c2, c4, c5, c9, aif, bax, akt, tlr6 and tlr13, among others. as a part of this work, three immune pathways in r. philippinarum have been characterized, the apoptosis, the toll like signaling pathway and the complement cascade, which could help us to better understand the resistance mechanisms of this economically important aquaculture clam species. animal sampling and in vitro stimulation of hemocytes r. philippinarum clams were obtained from a commercial shellfish farm (vigo, galicia, spain). clams were maintained in open circuit filtered sea water tanks at 15uc with aeration and were fed a total of 100 clams were notched in the shell in the area adjacent to the anterior adductor muscle. a sample of 500 ul of hemolymph was withdrawn from the adductor muscle of each clam with an insulin syringe, pooled and then distributed in 6-well plates, 7 ml per well, in a total of 7 wells, one for each treatment. hemocytes were allowed to settle to the base of the wells for 30 min at 15uc in the darkness. then, the hemocytes were stimulated with 50 mg/ml of polyinosinic:polycytidylic acid (poly i:c), peptidoglycans, ã�-glucan, vibrio anguillarum dna (cpg), lipopolysaccharide (lps), lipoteichoic acid (lta) or 1610 6 ufc/ml of heat-inactivated vibrio anguillarum (one stimulus per well) for 3 h at 15uc. all stimuli were purchased from sigma. pyrosequencing. after stimulation, hemolymph was centrifuged at 1700 g at 4uc for 5 minutes, the pellet was resuspended in 1 ml of trizol (invitrogen) and rna was extracted following the manufacturer's protocol. after rna extraction, samples were treated with turbo dnase free (ambion) to eliminate dna. next, the concentration and purity of the rna samples were measured using a nanodrop nd1000 spectrophotometer. the rna quality was assessed in a bioanalyzer 2010 (agilent technologies). from each sample, 1 mg of rna was pooled and used for the production of normalized cdna for 454 sequencing in the unitat de genã²mica (sct-ub, barcelona, spain). full-length-enriched double stranded cdna was synthesized from 1,5 mg of pooled total rna using mint cdna synthesis kit (evrogen, moscow, russia) according to manufacturer's protocol, and was subsequently purified using the qiaquick pcr purification kit (qiagen usa, valencia, ca). the amplified cdna was normalized using trimmer kit (evrogen, moscow, russia) to minimize differences in representation of transcripts. the method involves denaturation-reassociation of cdna, followed by a digestion with a duplex-specific nuclease (dsn) enzyme [85, 86] . the enzymatic degradation occurs primarily on the highly abundant cdna fraction. the single-stranded cdna fraction was then amplified twice by sequential pcr reactions according to the manufacturer's protocol. normalized cdna was purified using the qiaquick pcr purification kit (qiagen usa, valencia, ca). to generate the 454 library, 500 ng of normalized cdna were used. cdna was fractionated into small, 300-to 800-basepair fragments and the specific a and b adaptors were ligated to both the 39 and 59 ends of the fragments. the a and b adaptors were domain; pkc: protein kinase c; pten: phosphatidylinositol-3,4,5-trisphosphate 3-phosphatase and dual-specificity protein phosphatase pten; raidd: caspase and rip adapter with death domain; tnf r1: tumor necrosis factor receptor 1; tnf-a: tumor necrosis factor alpha; tradd: tnf receptor type 1-associated death domain protein; traf2: tnf receptor-associated factor 2; trail: tnf-related apoptosis-inducing ligand; trail decoy: decoy trail receptor without death domain; trail-r: trail receptor. doi:10.1371/journal.pone.0035009.g008 used for purification, amplification, and sequencing steps. one sequencing run was performed on the gs-flx using titanium chemistry. 454 sequencing is based on sequencing-by-synthesis, addition of one nucleotide, or more, complementary to the template strand results in a chemiluminescent signal recorded by the ccd camera within the instrument. the signal strength is proportional to the number of nucleotides incorporated in a single nucleotide flow. all reagents and protocols used were from roche 454 life sciences, usa. pyrosequencing raw data, comprised of 975,190 reads, were processed with the roche quality control pipeline using the default settings. seqclean (http://compbio.dfci.harvard.edu/tgi/software/) software was used to screen for and remove normalization adaptor sequences, homopolymers and reads shorter than 40 bp prior to assembly. a total of 974,973 quality reads were subjected to mira, version 3.2.0 [87] , to assemble the transcriptome. by default, mira takes into account only contigs with at least 2 reads. the other reads go into debris, which might include singletons, repeats, low complexity sequences and sequences shorter than 40 bp. ncbi blastclust was used to group similar contigs into clusters (groups of transcripts from the same gene). two sequences were grouped if at least 60% of the positions had at least 95% identity. the 51,265 contigs were grouped into a total of 29,679 clusters. an iterative blast workflow was used to annotate the r. philippinarum contigs with at least 100 bp (49,847 contigs out of 51,265). then, blastx [88] with a cut off value of 10e-3, was used to compare the r. philippinarum contigs with the ncbi swissprot, the ncbi metazoan refseq, the ncbi nr and the uniprotkb/trembl protein databases. after annotation, blast2go software [89] was used to assign gene ontology terms [90] to the largest contig of a representative cluster (minimum of 100 bp). this strategy was used to avoid redundant results. default values in blast2go were used to perform the analysis and ontology level 2 was selected to construct the level pie charts. to make a comparison between r. philippinarum and other bivalve species, the nucleotide sequences and ests from c. gigas, m. galloprovincialis, l. elliptica and b. azoricus were obtained from genbank and from dedicated databases, when available. [93] . unique sequences from these databases (based on gi number) were used from each of the databases. these sequences were compared by blastn against the longest contig from each of 29,679 r. philippinarum clusters with a cut off e-value of 10e-05. hits to r. philippinarum sequences were represented in a venn diagram. the comparison between our 454 results, the longest contig from each of 29,679 clusters, and the milan et al. [16] transcriptome, contigs downloaded from ruphibase (http:// compgen.bio.unipd.it/ruphibase/query/), was made by blastn with a cut off e-value of 10e-05. another analysis was carried out to compare just the longest contig from each of 2,005 clusters identified as immune-related and the milan et al. contigs as well. the results were summarized in a table ( table 2 ). the percentage of coverage is the average % of query coverage by the best blast hit and the percentage of hits is the % of query with at least one hit in database, in parenthesis were added the total number of hits. identification of immune-related genes all the contig annotations were revised based on an immunity and inflammation-related keyword list (i.e. apoptosis, bactericidal, c3, lectin, socsâ�¦) developed in our laboratory to select the candidate sequences putatively involved in immune response. the presence or absence of these words in the blastx hit descriptions was checked to identify putative immune-related contigs. the remaining non-selected contigs were revised using the go terms at level 2, 3 and 4 assigned to each sequence after the annotation step that had a direct relationship with immunity. selected contigs were checked again to eliminate non-immune ones and distributed into functional categories. immune-related genes were grouped in three reference immune pathways (complement cascade, tlr signaling pathway and apoptosis) to describe each route indicated by our pyrosequencing results. to identify and classify the groups of organisms that had high similarity with our clam sequences, the uniprot taxonomy [94] was used except for the protozoa group. because protozoa are a highly complex group, a specific taxonomy [95] was followed. briefly, after the blastx annotation step all the hit descriptions included the species name (i.e. homo sapiens) or a code (i.e. human) meaning that protein has been previously identified as belonging to that species. with such information sequences were classified in taxonomical groups and represented in pie charts. table s1 list of contigs (e-value,10-3) of ruditapes philippinarum including sequence, length, description (hit description), accession number of description (hit acc), e-value obtained and database used for annotation (blast). study of diseases and the immune system of bivalves using molecular biology and genomics bacterial disease in marine bivalves, review of recent studies. trends and evolution perkinsosis in molluscs: a review bacteria-hemocyte interactions and phagocytosis in bivalves role of lectins (c-reactive protein) in defense of marine bivalves against bacteria modulation of the chemiluminescence response of mediterranean mussel (mytilus galloprovincialis) haemocytes immune parameters in carpet shell clams naturally infected with perkinsus atlanticus nitric oxide production by carpet shell clam (ruditapes decussatus) hemocytes generation and analysis of a 29,745 unique expressed sequence tags from the pacific oyster (crassostrea gigas) assembled into a publicly accessible database, the gigasdatabase immune gene discovery by expressed sequence tags generated from hemocytes of the bacteria-challenged oyster, crassostrea gigas sequence variability of myticins identified in haemocytes from mussels suggests ancient host-pathogen interactions mytibase, a knowledgebase of mussel (m. galloprovincialis) transcribed sequences insights into the innate immunity of the mediterranean mussel mytilus galloprovincialis development of expressed sequence tags from the pearl oyster, pinctada martensii dunker gene expression analysis of clams ruditapes philippinarum and ruditapes decussatus following bacterial infection yields molecular insights into pathogen resistance and immunity transcriptome sequencing and microarray development for the manila clam, ruditapes philippinarum: genomic tools for environmental monitoring de novo assembly of the manila clam ruditapes philippinarum transcriptome provides new insights into expression bias, mitochondrial doubly uniparental inheritance and sex determination analysis of est and lectin expression in hemocytes of manila clams (ruditapes phylippinarum) (bivalvia, mollusca) infected with perkinsus olseni increasing genomic information in bivalves through new est collections in four species, development of new genetic markers for environmental studies and genome evolution rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing sequencing technologies -the next generation transcriptomic analysis of the clam meretrix meretrix on different larval stages ferritin functions as a proinflammatory cytokine via iron-independent protein kinase c zeta/nuclear factor kappab-regulated signaling in rat hepatic stellate cells cloning and characterization of an invertebrate type lysozyme from venerupis philippinarum c1q and tumor necrosis factor superfamily: modularity and versatility the regulation of inflammation by galectin-3 isolation, cdna cloning, and characterization of an 18-kda hemagglutinin and amebocyte aggregation factor from limulus polyphemus angiopoietin-like proteins: emerging targets for treatment of obesity and related metabolic diseases modulation of apolipoprotein d expression and translocation under specific stress conditions neuroprotective effect of apolipoprotein d against human coronavirus oc43-induced encephalitis in mice apolipoprotein d itm2bs regulates apoptosis by inducing loss of mitochondrial membrane potential transcriptomic signatures of ash (fraxinus spp.) phloem transcriptomics of the bed bug (cimex lectularius) complement and its role in innate and adaptive immune responses potential indicators of stress response identified by expressed sequence tag analysis of hemocytes and embryos from the american oyster, crassostrea virginica analysis of a cdna-derived sequence of a novel mannose-binding lectin, codakine, from the tropical clam codakia orbicularis a novel c1q-domaincontaining protein from zhikong scallop chlamys farreri with lipopolysaccharide binding activity the c1q domain containing proteins of the mediterranean mussel mytilus galloprovincialis: a widespread and diverse family of immune-related molecules mgc1q, a novel c1q-domain-containing protein involved in the immune response of mytilus galloprovincialis differentially expressed genes of the carpet shell clam ruditapes decussatus against perkinsus olseni characterization of a c3 and a factor b-like in the carpet-shell clam, ruditapes decussatus structural and functional diversity of lectin repertoires in invertebrates, protochordates and ectothermic vertebrates purification and characterisation of a lectin isolated from the manila clam ruditapes philippinarum in korea characterization, tissue expression, and immunohistochemical localization of mcl3, a c-type lectin produced by perkinsus olseni-infected manila clams (ruditapes philippinarum) noble tandem-repeat galectin of manila clam ruditapes philippinarum is induced upon infection with the protozoan parasite perkinsus olseni lectin from the manila clam ruditapes philippinarum is induced upon infection with the protozoan parasite perkinsus olseni cdna cloning and mrna expression of the lipopolysaccharide-and beta-1,3-glucan-binding protein gene from scallop chlamys farreri molecular cloning and characterization of a short type peptidoglycan recognition protein (cfpgrp-s1) cdna from zhikong scallop chlamys farreri molecular cloning and mrna expression of peptidoglycan recognition protein (pgrp) gene in bay scallop (argopecten irradians, lamarck 1819) distribution of multiple peptidoglycan recognition proteins in the tissues of pacific oyster, crassostrea gigas molecular cloning and expression of a toll receptor gene homologue from zhikong scallop, chlamys farreri pattern recognition receptors and inflammation the evolution of vertebrate toll-like receptors a tolllike receptor that prevents infection by uropathogenic bacteria unc93b1 delivers nucleotide-sensing toll-like receptors to endolysosomes cg-timp, an inducible tissue inhibitor of metalloproteinase from the pacific oyster crassostrea gigas with a potential role in wound healing and defense mechanisms molecular cloning, characterization and expression of a novel serine proteinase inhibitor gene in bay scallops (argopecten irradians, lamarck 1819) molecular cloning and expression of a novel kazal-type serine proteinase inhibitor gene from zhikong scallop chlamys farreri, and the inhibitory activity of its recombinant domain a novel slowtight binding serine protease inhibitor from eastern oyster (crassostrea virginica) plasma inhibits perkinsin, the major extracellular protease of the oyster protozoan parasite perkinsus marinus evidence indicating the existence of a novel family of serine protease inhibitors that may be involved in marine invertebrate immunity serine protease inhibitor cvsi-1 potential role in the eastern oyster host defense against the protozoan parasite perkinsus marinus antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? innate immunity. isolation of several cysteine-rich antimicrobial peptides from the blood of a mollusc, mytilus edulis a member of the arthropod defensin family from edible mediterranean mussels (mytilus galloprovincialis) evidence of high individual diversity on myticin c in mussel (mytilus galloprovincialis) mytilus galloprovincialis myticin c: a chemotactic molecule with antiviral activity and immunoregulatory properties analysis of differentially expressed genes in response to bacterial stimulation in hemocytes of the carpetshell clam ruditapes decussatus: identification of new antimicrobial peptides molecular characterization of a novel big defensin from clam venerupis philippinarum heat shock proteins: facts, thoughts, and dreams heat shock proteins, cellular chaperones that modulate mitochondrial cell death pathways djla, a membrane-anchored dnaj-like protein, is required for cytotoxicity of clam pathogen vibrio tapetis to hemocytes alternation of venerupis philippinarum hsp40 gene expression in response to pathogen challenge and heavy metal exposure identification of two small heat shock proteins with different response profile to cadmium and pathogen stresses in venerupis philippinarum molecular characterization and expression analysis of a putative lps-induced tnf-alpha factor (litaf) from pearl oyster pinctada fucata cloning, characterization and expression analysis of the gene for a putative lipopolysaccharide-induced tnf-alpha factor of the pacific oyster molecular cloning and characterization of a putative lipopolysaccharide-induced tnf-alpha factor (litaf) gene homologue from zhikong scallop chlamys farreri first molluscan tnfr homologue in zhikong scallop: molecular characterization and expression analysis identification and expression of traf6 (tnf receptor-associated factor 6) gene in zhikong scallop chlamys farreri diversity and pathogenecity of vibrio species in cultured bivalve molluscs an apoptosis-inhibiting baculovirus gene with a zinc finger-like motif detection of ostreid herpesvirus 1 dna by pcr in bivalve molluscs: a critical review synopsis of infectious diseases and parasites of commercially exploited shellfish a fungus disease in clam and oyster larvae a novel method for snp detection using a new duplex-specific nuclease from crab hepatopancreas simple cdna normalization using kamchatka crab duplex-specific nuclease using the miraest assembler for reliable and automated mrna transcript assembly and snp detection in sequenced ests basic local alignment search tool blast2go, a universal tool for annotation, visualization and analysis in functional genomics research gene ontology, tool for the unification of biology. the gene ontology consortium pyrosequencing of mytilus galloprovincialis cdnas: tissue-specific expression patterns insights into shell deposition in the antarctic bivalve laternula elliptica: gene discovery in the mantle transcriptome using 454 pyrosequencing highthroughput sequencing and analysis of the gill tissue transcriptome from the deep-sea hydrothermal vent mussel bathymodiolus azoricus newt, a new taxonomy portal the new higher level classification of eukaryotes with emphasis on the taxonomy of protists key: cord-033010-o5kiadfm authors: durojaye, olanrewaju ayodeji; mushiana, talifhani; uzoeto, henrietta onyinye; cosmas, samuel; udowo, victor malachy; osotuyi, abayomi gaius; ibiang, glory omini; gonlepa, miapeh kous title: potential therapeutic target identification in the novel 2019 coronavirus: insight from homology modeling and blind docking study date: 2020-10-02 journal: egypt j med hum genet doi: 10.1186/s43042-020-00081-5 sha: doc_id: 33010 cord_uid: o5kiadfm background: the 2019-ncov which is regarded as a novel coronavirus is a positive-sense single-stranded rna virus. it is infectious to humans and is the cause of the ongoing coronavirus outbreak which has elicited an emergency in public health and a call for immediate international concern has been linked to it. the coronavirus main proteinase which is also known as the 3c-like protease (3clpro) is a very important protein in all coronaviruses for the role it plays in the replication of the virus and the proteolytic processing of the viral polyproteins. the resultant cytotoxic effect which is a product of consistent viral replication and proteolytic processing of polyproteins can be greatly reduced through the inhibition of the viral main proteinase activities. this makes the 3c-like protease of the coronavirus a potential and promising target for therapeutic agents against the viral infection. results: this study describes the detailed computational process by which the 2019-ncov main proteinase coding sequence was mapped out from the viral full genome, translated and the resultant amino acid sequence used in modeling the protein 3d structure. comparative physiochemical studies were carried out on the resultant target protein and its template while selected hiv protease inhibitors were docked against the protein binding sites which contained no co-crystallized ligand. conclusion: in line with results from this study which has shown great consistency with other scientific findings on coronaviruses, we recommend the administration of the selected hiv protease inhibitors as first-line therapeutic agents for the treatment of the current coronavirus epidemic. the first outburst of pneumonia cases with unknown origin was identified in the early days of december 2019, in the city of wuhan, hubei province, china [1] . revelation about a novel beta coronavirus currently regarded as the 2019 novel coronavirus [2] came up after a high-throughput sequencing of the viral genome which exhibits a close resemblance with the severe acute respiratory syndrome (sars-cov) [3] . the 2019-ncov is the seventh member of enveloped rna coronavirus family (subgenus sarbecovirus, orthocoronavirinae) [3] , and there are accumulating facts from family settings and hospitals confirming that the virus is most likely transmitted from person-to-person [4] . the 2019-ncov has also recently been declared by the world health organization as a public health emergency of international concern [5] and as of the 5th of february 2020, over 24,000 cases has been confirmed and documented from laboratories around the world [6] while more than 28, 000 of such cases were documented in china through laboratory confirmation as of the 6th of february 2020 [7] . despite the fast rate of global spread of the virus, the characteristics clinically peculiar to the 2019-ncov acute respiratory disease (ard) remain unclear to a very large extent [8] . over 8000 infections and 900 deaths were recorded worldwide in the summer of 2003 before a successful containment of the severe acute respiratory syndrome wave was achieved as the disease itself was also a major public health concern worldwide [9, 10] . the infection that led to a huge number of death cases was linked to a new coronavirus also known as the sars coronavirus (sars-cov). coronaviruses are positive-stranded rna viruses and they possess the largest known viral rna genomes. the first major step in containing the sars-cov-lined infection was to successfully sequence the viral genome, the organization of which was found to exhibit similarity with the genome of other coronaviruses [11] . the main proteinase crystal structure from both the transmissible gastroenteritis virus and the human coronavirus (hcov 229e) has been determined with the discovery that the enzyme crystal structure exists as a dimer and the orientation of individual protomers making up the dimer has been observed to be perpendicular to each other. each of the protomers is made up of three catalytic domains [12] . the first and second domains of the protomers have a two-βbarrel fold that can be likened to one of the folds in the chymotrypsin-like serine proteinases. domain iii have five α-helices which are linked to the second domain by a long loop. individual protomers have their own specific region for the binding of substrates, and this region is positioned in the left cleft between the first and second domain. dimerization of the protein is thought to be a function of the third domain [13] . the main proteinase of the sars cov is known to be a cysteine proteinase which has in its active site, a cysteine-histidine catalytic dyad. conservation of the sars cov main proteinase across the genome sequence of all sars coronaviruses is very high, likewise the homology of the protein to the main proteinase of other coronaviruses. on the basis that high similarity exists between the different coronavirus main proteinase crystal structures and the conservation of almost all the amino acid residue side chains involved in the dimeric state formation, it was proposed that the only biologically functional form coronavirus main proteinase might be is its existence as a dimer [14] . more recently, chen et al. in his study which involved the application of molecular dynamic simulations and enzyme activity measurements from a hybrid enzyme showed that the only active form of the proteinase is in its dimeric state [15] . recent studies based on the sequence homology of the coronavirus main proteinase structural model with tgev as well as the solved crystal structure has involved the docking of substrate analogs for the virtual screening of natural products and a collection of synthetic compounds, alongside approved antiviral therapeutic agents in the evaluation of the coronavirus main proteinase inhibition [16] . some compounds from this study were identified for the inhibitory role played against the viral proteinase. these compounds include the l-700,417, which is an hiv-1 protease inhibitor, calanolide a, and nevirapine, both of which are reverse transcriptase inhibitors, an inhibitor of the α-glucosidase named glycovir, sabadinine, which is a natural product and ribavirin, a general antiviral agent [17] . ribavirin was shown to exhibit an antiviral activity in vitro, at cytotoxic concentrations against the sars coronavirus. at the start of the first outbreak of the sars epidemic, ribavirin was administered as a first-line of defense. the administration was as a monotherapy and in combination with corticosteroids or the hiv protease inhibitor, kaletra [18] . according to reports from a very recent research conducted by cao et al., where a total of a 199 laboratoryconfirmed sars-cov-infected patients were made to undergo a controlled, randomized, open-labeled trial in which 100 patients were assigned to the standard care group and 9 patients assigned to the lopinavirritonavir group. 48.4% of the patients in the lopinavir-ritonavir group (46 patients) and 49.5% of the patients in the standard care group (49 patients) exhibited serious adverse events between randomization and the 28th day. the exhibited adverse events include acute respiratory distress syndrome (ards), acute kidney injury, severe anemia, acute gastritis, hemorrhage of lower digestive tract, pneumothorax, unconsciousness, sepsis, acute heart failure etc. patients in the lopinavir-ritonavir group in addition, specifically exhibited gastrointestinal adverse events which include diarrhea, vomiting, and nausea [19] . our current study took advantage of the availability of the sars cov main proteinase amino acid sequence to map out the nucleotide coding region for the same protein in the 2019-ncov. two selected hiv protease inhibitors (lopinavir and ritonavir) were then targeted at the catalytic site of the protein 3d structure which was modeled using already available templates. the predicted activity of the drug candidates was validated by targeting them against a recently crystalized 3d structure of the enzyme, which has been made available for download in the protein data bank. lopinavir is an antiretroviral protease inhibitor used in combination with ritonavir in the therapy and prevention of human immunodeficiency virus (hiv) infection and the acquired immunodeficiency syndrome (aids). it plays a role as an antiviral drug and a hiv protease inhibitor. it is a member of amphetamines and a dicarboxylic acid diamide (fig. 1 ). the complete genome of the isolated wuhan seafood market pneumonia virus (2019-ncov) was downloaded from the genbank database with an assigned accession number of mn908947.3. the nucleotide sequence of the full genome was copied out in fasta format. the gen-bank sequence database is an annotated collection of all nucleotide sequences which are publicly available with their translated protein segments and also open access. this database is designed and managed by the national center for biotechnology information (ncbi) in accordance with the international nucleotide sequence database collaboration (insdc) [20] . nucleotides between the 10055 and 10972 sequence of the 2019-ncov genome was selected as the sequence of interest. translation of the nucleotide sequence of interest in the 2019-ncov and the back-translation of the sars cov main proteinase amino acid sequence was achieved with the use of emboss transeq and backtranseq tools, respectively [21] . transeq reads one or more nucleotide sequences and writes the resulting translated sequence of protein to file while backtranseq makes use of a codon usage table which gives the usage frequency of individual codon for every amino acid [22] . for every amino acid sequence input, the corresponding most frequently occurring codon is used in the nucleotide sequence that forms the output. the corresponding amino acid sequence generated as a product of the transeq translation of the nucleotide sequence of interest had no stop codons and as such was used directly for protein homology modeling without the need for any deletion. two sets of sequence alignments were carried out in this study. the first was the alignment between the translated nucleotide sequence copy of the 2019-ncov genome which was used for the reference protein homology modeling and the amino acid sequence of the sars cov main proteinase while the second alignment was between the back-translated sars cov main proteinase nucleotide sequence and the 2019-ncov full genome. the latter was used in mapping out the protein coding sequence in the 2019-ncov full genome. these alignments were carried out using the clustal omega software package. clustal omega can read inputs of nucleotide and amino acid sequences in formats such as a2m/fasta, clustal, msf, phylip, selex, stockholm, and vienna [23] . template search with blast and hhblits was performed against the swiss-model template library. the target sequence was searched with blast against the primary amino acid sequence contained in the smtl. a total of 120 templates were found. an initial hhblits profile was built using the procedure outlined in remmert et al. [24] followed by 1 iteration of hhblits against nr20. the obtained profile was then searched against all profiles of the smtl. a total of 192 templates were found. models were built based on the targettemplate alignment using promod3. coordinates which are conserved between the target and the template are copied from the template to the model. insertions and deletions are remodeled using a fragment library. side chains were then rebuilt and finally, the geometry of the resulting model, regularized by using a force field [25] . for the estimation of the protein structure model quality, we used the qmean (qualitative model energy analysis), a composite scoring function that describes the main aspects of protein structural geometrics, which can also derive on the basis of a single model, both global (i.e., for the entire structure) and local (i.e., per residue) absolute quality estimates [26] . an appreciable number of alternative models have been produced which formed the basis on which scores produced by the final model was selected. the qmean score was thus used in the selection of the most reliable model against which the consensus structural scores were calculated. molprobity (version 4.4) was used as the structurevalidation tool that produced the broad-spectrum evaluation of the quality of the target protein at both the global and local levels. it greatly relies on the provided sensitivity and power by optimizing the placement of hydrogen and all-atom contact analysis with complementary versions of updated covalent-geometry and torsion-angle criteria [27] . the torsion angles between individual residues of the target protein were calculated using the ramachandran plot. this is a plot of the torsional angles [phi (φ) and psi (ψ)] of the amino acid residues making up a peptide. in the order of sequence, the torsion angle of n(i-1), c(i), ca(i), n(i) is φ while the torsion angle of c(i), ca(i), n(i), c(i+1) is ψ. the values of φ were plotted on the x-axis while the values of ψ were plotted on the y-axis [28] . plotting the torsional angles in this way graphically shows the possible combination of angles that are allowed. the quaternary structure annotation of the template is employed to model the target sequence in its oligomeric state. the methodology as proposed by bertoni et al. [29] was supported on a supervised machine learning algorithm rule, support vector machines (svm), which mixes conservation of interface, clustering of structures with other features of the template to produce a quaternary structure quality estimate (qsqe). the qsqe score is a number that ranges between 0 and 1, and it is a reflection of the accuracy expected of the inter-chain contacts for a model engineered based on a given template and its alignment. the higher score is an indication of a more reliable result. this enhances the gmqe score that calculates the accuracy of the 3d structure of the resulting model. the 3d structural homology modeling of the 2019-ncov genome translated segment was followed by a structural comparison with the sars cov main proteinase 3d structure (pdb: 1uj1). this was achieved using the ucsf chimera which is a highly extensible tool for interactive analysis and visualization of molecular structures and other like data, including docking results, supramolecular assemblies, density maps, sequence alignments, trajectories, and conformational ensembles [30] . high-quality animation videos were also generated. the amino acid constituents of the target protein secondary structures were colored and visualized in 3d using the pymol molecular visualzer which uses opengl extension wrangler library (glew) and freeglut. the py aspect of the pymol is a reference to the programming language that backs up the software algorithm which was written in python [31] . the percentage composition of each component making up the secondary structure was calculated using the chou and fasman secondary structure prediction (cfssp) server. this is a secondary structure predictor that predicts regions of secondary structure from an amino acid input sequence such as the regions making up the alpha helix, beta sheet, and turns. the secondary structure prediction output is displayed in a linear sequential graphical view according to the occurrence probability of the secondary structure component. the cfssp implemented methodology is the chou-fasman algorithm, which is based on the relative frequency analyses of alpha helices, beta sheets, and loops of each amino acid residue on the basis of known structures of proteins solved with x-ray crystallography [32]. the expasy server calculates protein physiochemical parameters as a part of its sub-function, basically for the identification of proteins [33] . we engaged the function of the protparam tool in calculating various physiochemical parameters in the model and template protein for comparison purposes. the calculated parameters include the molecular weight, theoretical isoelectric point, amino acid composition, extinction coefficient, instability index, etc. the inference on evolutionary relationship was made utilizing the maximum likelihood methodology which is the basis of the jtt matrix-based model [34] . the corresponding consensus tree on bootstrap was inferred from a thousand replicates, and this was used to represent the historical evolution of the analyzed taxa. the tree branches forming partitions that were reproduced in bootstrap replicates of less than 50% were automatically collapsed. next to every branch in the tree is the displayed percentage of tree replicates of clustered associated taxa in the bootstrap test of a thousand replicates. initial trees were derived automatically for the search through the application of the neighbor-join and bionj algorithms to a matrix of pairwise distances calculated using a jtt model and followed by the selection of the most superior log likelihood value topology. the phylogenetic analysis was carried out on 12 amino acid sequences with close identity. the complete dataset contained a total of 306 positions. the whole analysis was conducted using the molecular evolutionary and genetics analysis (mega) software (version 7) [35] . ligand preparation and molecular docking protocol 2d structures of the experimental ligands were viewed from the pubchem repository and sketched using the chemaxon software [36] . the sketched structures were downloaded and saved as mrv files which were converted into smiles strings with the openbabel. the compounds prepared as ligands were docked against each of the prepared protein receptors using autodock vina [37] . blind docking analysis was performed at extra precision mode with minimized ligand structures. after a successful docking, a file consisting of all the poses generated by the autodock vina along with their binding affinities and rmsd scores was generated. in the vina output log file, the first pose was considered as the best because it has stronger binding affinity than the other poses and without any rmsd value. the polar interactions and binding orientation at the active site of the proteins were viewed on pymol and the docking scores for each ligand screened against each receptor protein were recorded. the same docking protocol was performed against the sars-cov main proteinase 3d structure that was downloaded from the protein data bank with a pdb identity of 6m2n. obtained outputs were visualized, compared, and documented for validation purpose. the full genome of the 2019-ncov (https://www.ncbi. nlm.nih.gov/nuccore/mn908947.3?report=fasta) consists of 29903 nucleotides, but for the purpose of this study, nucleotides between 10055 and 10972 were considered to locate the protein of interest. the direct translation of this segment of nucleotides produced a sequence of 306 amino acids (fig. 2 ). this amino acid count was reached after the direct translation of the nucleotide sequence of interest as there were no single existing stop codons hence, deletion of any form was needless. as depicted in fig. 3 , few structural differences were noticed. the amino acid sequences making up these nonconserved regions were clearly revealed in fig. 4 . notwithstanding, a 96% identity was observed between both sequences showing the conserved domains were predominant. figure 4 represents the percentage amino acid sequence identity between the target and the template protein, where the positions with a single asterisk (*) depicts regions of full residue conservation while the segments with the colon (:) indicates regions of conservation between amino acid residues with similar properties. positions with the period (.) show regions of conservation between amino acids with less similar properties. the amino acid sequence of the sars coronavirus main proteinase was back-translated to generate the corresponding nucleotide sequence which was then aligned with the 2019-ncov full genome. this was carried out for the purpose of mapping out the region of the 2019-ncov full genome where the proteinase coding sequence is located. as depicted in fig. 5 , the target protein coding sequence is located between 10055 and 10972 nucleotides of the viral genome the outcome of a qmean analysis is anchored on the composite scoring function which calculates several features regarding the structure of the target protein. the expression of the estimated absolute quality of the model is in terms of how well the score of the model is in agreement with the values expected of a set of resultant structures from high-resolution experiments. the global score values can either be from qmean4 or qmean6. qmean4 is a combination of four statistical potential terms represented in a linear form while qmean6 in addition to the functionality of qmean4 uses two agreement terms in the consistency evaluation of structural features with sequence-based predictions. both qmean4 and qmean6 originally are in the range of 0 to 1 with 1 being the good score and are by default transformed into z-scores (table 1) for them to be related to what would have been expected from x-ray structures of high resolution. the local scores are also a combination of four linear potential statistical terms with the agreement terms of evaluation being on a per residue basis. the local scores are also estimated in the range of 0 to 1, where one is the good score (fig. 6) . when compared to the set of non-redundant protein structures, the qmean z-score of the target protein as shown in fig. 7 was 0. the models located in the dark zone are shown in the graph to have scores less than 1 while the scores of other regions outside the dark zone can either be 1 < the z-score < 2 or z-score > 2. good models are often located in the dark zone. whenever such values are found, they result in some strains in the polypeptide chain and in cases of such, the stability of the structure will depend greatly on additional interactions but this conformation may be conserved in a protein family for its structural significance. another αand β-regions clustering principle exemption can be viewed on the right side plot of fig. 8 where the distribution of torsion angles for glycine are the only displayed angles on the ramachandran plot. glycine has no side chain, and this gives room for flexibility in the polypeptide chain hence making accessible the forbidden rotation angles. glycine for this reason is more concentrated in the regions making up the loop where sharp bends can occur in the polypeptide. for this reason, glycine is highly conserved in protein families as the presence of turns at specific positions is a characteristic feature of particular structural folds. the comparative physiochemical parameter computation of the template and target proteins by protparam were deduced from the amino acid sequences of the individual proteins. no additional information was required about the proteins under consideration and each of the complete sequences was analyzed. the amino acid sequence of the target protein has not been deposited in the swiss-prot database. for this reason, inputting the standard single-letter amino acid sequence for both proteins into the text field was employed in computing the physiochemical properties as shown in tables 3, 4 the two hiv protease inhibitors (lopinavir and ritonavir) when targeted at the modeled 2019-ncov catalytic site gave significant inhibition attributes; hence, the in silico study was planned through molecular docking analysis with autodock vina. the binding orientation of the drugs to the protein active site as viewed in the pymol molecular visualizer (fig. 11) showed an induced fit model binding conformation. the same compounds were targeted against the active site of the downloaded pdb 3d structure of the sars-cov main proteinase (pdb 6m2n) for comparison purposes fig. 12 . the active site residues as visualized in pymol are shown in fig. 13 . the binding of lopinavir to the target protein which produced the best binding score was used as the predictive model. residues at the distance of < 5 angstroms to the bound ligand were assumed to form fig. 13 the combined view of the 3d structural comparison between the modeled target protein and the downloaded pdb structure of the viral protein (left column) and their primary sequence alignment (right column). the target protein is colored in grey while its protein data bank equivalence is colored in red. the high structural similarity between the two proteins was validated through their sequence alignment which produced 99.34% sequence identity score. homology modeling which is a computational method for modeling the 3d structure of proteins and also fig. 8 depicted here are two ramachandran plots. the plot on the left hand side shows the general torsion angles for all the residues in the target protein while the plot on the right hand side is specific for the glycine residues of the protein fig. 9 the target protein secondary structures with bound lopinavir. at the top is the secondary structure visualization on pymol with regions making up the alpha helix, beta sheets, and loops shown in light blue, purple, and brown, respectively. below is the prediction by cfssp where the red, green, yellow, and blue lines depict regions of the helices, sheets, turns, and coils (loops), respectively. the predicted secondary structure composition shows a high degree of alpha helix and beta sheets, respectively, occupying 45 and 47% of the total residues with the percentage loop occupancy at 8% regarded as comparative modeling, constructs atomic models based on known structures or structures that have been determined experimentally and likewise share more than 40% sequence homology. the backing principle for this is that there is likely to be an existing three-dimensional structure similarity between two proteins with high similarity in their amino acid sequence. with one of the proteins having an already determined 3d structure, the structure of the unknown one can be copied with a high degree of confidence. there is a higher degree of accuracy for alpha carbon positions in homology modeling than the side chains and regions containing the loop which is mostly inaccurate. as regards the template selection, homologous proteins with determined structures are searched through the protein data bank (pdb) and templates must have alongside a minimum of 40% identity with the target sequence, the highest possible resolution with appropriate cofactors for selection consideration [29] . in this study, the target protein was modeled using the sars coronavirus main proteinase as template. this selection was based on the high resolution and its identity with the target protein which is as high as 96%. qualitative model energy analysis (qmean) is a composite scoring function that describes protein structures on the basis of major geometrical aspects. the scoring function of the qmean calculates the global quality of models on six structural descriptors linear combination basis, where four of the six descriptors are statistical potentials of mean force. the analysis of local geometry is carried out by potential of torsion angle over three consecutive amino acids. in predicting the structure of a protein structure, final models are often selected after the production of a considerable number of alternative models; hence, the prediction of the protein structure is anchored on a scoring function which identifies within a collection of alternative models the best structural model. two distance-dependent interaction potentials are used to assess long-range interactions based on c_β atoms and all atoms, respectively. the burial status of amino acid residues describes the solvation potential of the model while the two terms that reflect the correlation between solvent accessibility, calculated and predicted secondary structure are not excluded [38] . the resultant target protein can be considered a good model as the z-scores of interaction energy of c_β, pairwise energy of all atoms, solvation energy, and torsion angle energy are − 0.35, − 0.65, − 0.77, and 0.36, respectively, as shown in table 1 . the quality of a good model protein can be compared to high-resolution reference structures that are products of x-ray crystallography analysis through z-score where 0 is the average z-score value for a good model [26] . according to benkert et al., qmean z-score provides an estimate value of the degree of nativeness of the structural features that can be observed in a model, and this is an indication that the model is of a good quality in comparison to other experimental structures [26] . our study shows the z-score of the target is "0" as indicated in fig. 6 and such a score is an indication of a relatively good model as it possesses the average z-score for a perfect model. properties of the model that is predicted determines the molprobity scores. work initially done on all-atom contact analysis has shown that proteins possess exquisitely well-packed structures with favorable van der waals interactions which overlap between atoms that do not form hydrogen bonds [39] . unfavorable steric clashes are correlated strongly with the quality of data that are often poor where a near zero occurrence of such steric clashes occurs in the ordered regions of crystal structures with high resolution. therefore, low values of clash scores are indications of a very good model which likewise has been proven by the clash score value exhibited by the target protein that was modeled for the purpose of this study (table 2 ). in addition to the clash score, the protein conformation details are remarkably relaxed, such as staggered χ angles and even staggered methyl groups [40] . applied forces to a given local motif in environments predominantly made up of folded protein interior can produce a locally strained conformation but significant strain are kept near functionally needed minimum by evolution and this is on the presumption that the stability of proteins is too marginal for high tolerance. in traditional validation measures updates, there has been a compilation of rigorously quality-filtered crystal structures through homology, resolution, and the overall score validation at file level, by b-factor and sometimes at residue level, by all-atom steric clashes. the resulting multi-dimensional distributions generated after an adequate smoothing are used in scoring the "protein-like" nature of each local conformation in relation to known structures either for backbone ramachandran values or the side chain rotamers [41] . rotamer outliers are equivalent to < 1% at high resolution while general-case ramachandran outliers to a high-resolution equivalence of < 0.05%, and ramachandran favored to 98%. in this regard, the definition of the molprobity score (mpscore) was given as mpscore = 0.426 *ln(1+clashscore) + 0.33 *ln(1+max(0, rota_out|-1)) + 0.25 *ln(1+ max(0, rama_iffy|-2)) + 0.5 where the number of unfavorable all-atom steric overlaps ≥ 0.4 å per 1000 atoms defines the clashscore [38] . the rota_out is the side chain conformation percentage termed as the rotamer outliers, from side chains that can undergo possible evaluation while rama_iffy is the backbone ramachandran percentage conformations that allows beyond the favored region, from residues that can undergo possible evaluation. the derivatives of the coefficients are from a log-linear fit to crystallographic resolution on a set of pdb structures that has undergone filtration, so that the mpscore of a model is the resolution at which each of its scores will be the values expected thus, the lower mpscores are the indications of a better model. with a clash score of 2.06 and a 95.66% value for the ramachandran favored region as compared to the ramachandran outliers and rotamer outliers with individual values of 0.83% and 5.21% respectively, we arrived at a molprobity score of 1.82 which is low enough to indicate the quality of a good model in our experimental protein. the characteristic repetitive conformation attribute of amino acid residues is the basis for the repetitive nature of the secondary structures hence the repetitive scores of φ and ψ. the range of φ and ψ scores can be used in distinguishing the different secondary structural elements as the different φ and ψ values of each secondary structure elements map their respective regions on the ramachandran plot. peptides of the ramachandran plot have the average values of their α-helices clustered about the range of φ = − 57°and ψ = − 47°while the average values of 130°and ψ = + 140°describes the ramachandran plot clustering for twisted beta sheets [42] . the core region (green in fig. 8 ) on the plot has the most favorable combinations for the φ and ψ values and has the highest number of dots. the figure also shows in the upper right quadrant, a small third core region. this is known as the allowed region and can be found in the areas of the core regions or might not be associated with the core region. it has lesser data points compared to the core regions. the other areas on the plot are regarded as disallowed. since glycine has only one hydrogen atom as side chain, steric hindrance is not as likely to occur as φ and ψ are rotated through a series of values. the glycine residues having φ and ψ values of + 55°and − 116°, respectively [43] do not exhibit steric hindrance and for that reason positioned in the disallowed region of the ramachandran plot as shown in the right hand side plot in fig. 8 . the extinction coefficient is an indication of the intensity of absorbed light by a protein at specific wavelength. the importance of estimating this coefficient is to monitor a protein undergoing purification in a spectrophotometer. woody in his experiment [44] has shown the possibility of estimating a protein's molar extension coefficient from knowledge of its amino acid composition which has been presented in table 3 . the extinction coefficient of the proteins (both the template and the target proteins) was calculated using the equation: e(prot) = numb(tyr) × ext(tyr) + numb(trp) × ext(trp) + numb(cystine) × ext(cystine)the absorbance (optical density) was calculated using the following formula: for this equation to be valid, the following conditions must be met: ph 6.5, 6.0 m guanidium hydrochloride, 0.02 m phosphate buffer. the n-terminal residue identity of a protein is an important factor in the determination of its stability in vivo and also plays a major role in the proteolytic degradation process mediated by ubiquitin [45] . βgalactosidase proteins with different n-terminal amino acids were designed through site-directed mutagenesis, and the designed β-galactosidase proteins have different half-lives in vivo which is striking, ranging from over a hundred hours to less than 2 min, but this is dependent on the nature of the amino terminus residue on the experimental model (yeast in vivo; mammalian reticulocytes in vitro, e. coli in vivo). the order of individual amino acid residues is thus in respect to the conferred half-lives when located at the protein's amino terminus [46] . this is referred to as the "n-end rule" which was what the estimated half-life of both the template and target proteins were based on. the instability index provides an estimate of the protein's stability in a test tube. statistical analysis of 32 stable and 12 unstable proteins has shown [47] that there are specific dipeptides with significantly varied occurrence in the unstable proteins as compared with those in the stable proteins. the authors of this method have assigned a weight value of instability to each of the 400 different dipeptides (diwv). the computation of a protein's instability index is thus possible using these weight values, which is defined as: table 3 amino acid composition table for both template and target protein amino acid residues in one letter codes template 17 12 19 17 12 14 9 26 8 12 30 11 10 16 13 16 26 3 11 24 target 17 11 21 17 12 14 9 26 7 11 29 11 10 17 13 16 24 3 11 27 durojaye et al. where l is the sequence length and diwv(x[i]x[i + 1]) is the instability weight value for the dipeptide starting from position i. a protein that exhibits an instability index value less than 40 can be predicted as a stable protein while an instability index value that exceeds the 40 threshold is an indication that the protein may be unstable. the comparative instability index values for the template and target proteins were 29.67 and 27.65 (table 4) , respectively, showing both are stable proteins. the relative volume occupied by aliphatic side chains (valine, alanine, leucine, and isoleucine) of a protein is known as its aliphatic index. it may be an indicator of a positive factor for an increment in globular proteins thermostability. the aliphatic index of the experimental proteins was calculated according to the following formula [48] : where x(ala), x(val), x(ile), and x(leu) are the mole percent (100 × mole fraction) of alanine, valine, isoleucine, and leucine. the coefficients "a" and "b" are the relative volume of the valine side chain (a = 2.9) and of leu/ile side chains (b = 3.9) to the alanine side chain. the calculated aliphatic index for the experimental protein shows that the thermostability of the target protein is slightly higher than the template. the most common secondary structures are the alpha helices and beta sheets although the beta turns and omega loops also occur. elements of the secondary structures spontaneously form as an intermediate before their folding into the corresponding three-dimensional tertiary structures [49] . the stability and how robust the α-helices are to mutations in natural proteins have been shown in previous studies. they have also been shown to be more designable than the beta sheets; thus, designing a functional all-α helix protein is likely to be easier than designing proteins with both α helix and strands, and this has recently been confirmed experimentally [50] . the template and target proteins both have a total of 306 amino acid residues (table 4 ) with the composition of individual residues shown in table 3 . as shown in fig. 9 , the target protein which shares a structural homology with the template (fig. 3 and the animation video) is predominantly occupied by residues forming alpha helix and beta sheets, with very low percentage of the residues forming loops. the stability of these two proteins is revealed in their physiochemical characteristics which can therefore be linked to the high percentage of residues forming alpha helix. the ultimate goal of genome analysis is to understand the biology of organisms in both evolutionary and functional terms, and this involves the combination of different data from various sources [51] . for the purpose of this study, we compared our protein of interest to similar proteins in the ncbi database to predict the evolutionary relationships between homologous proteins represented in the genomes of each divergent species. this makes the amino acid sequence alignment the most suitable form of alignment for the phylogenetic tree construction. organisms with common ancestors were positioned in the same monophyletic group in the tree, and the same node where the protein of interest (the 2019-ncov main proteinase) is positioned also houses the non-structural polyprotein of the 1ab bat sars-like coronavirus. this shows that the two viral proteins share a common source with shorter divergence period. bootstrapping allows evolutionary predictions on the level of confidence. one hundred is a very high level of confidence in the positioning of the node in the topology. the lower scores are more likely to happen by chance than it is in the real tree topology [52] . the bootstrap value of the above-mentioned viral proteins which is exactly 100 is a very high level of statistical support for their positioning at the nodes in the branched part of the tree. the length of the branches is a representation of genetic distance. it is also the measure of the time since the viral proteins diverged, which means, the greater the branch length, the likelihood that it took a longer period of time since divergence from the most closely related protein [53] . the tw9 and tjf strains of the sars coronavirus orf1a polyprotein and replicase, respectively, are the most distantly related, based on their branch length and as such can be regarded as the out-group in the tree. structure-based drug discovery is the easiest molecular docking methodology as it screens variety of listed ligands (compounds) in chemical library by "docking" them against proteins of known structures which in this study is the modeled 3d structure of the 2019-ncov main proteinase and showing the binding affinity details alongside the binding conformation of ligands in the enzyme active site [54] . ligand docking can be specific, that is, focusing only on the predicted binding sites of the protein of interest or can be blind docking where the entire area of the protein is covered. most docking tool applications focus on the predicted primary binding site of ligands; however, in some cases, the information about the target protein binding site is missing. blind docking is known to be an unbiased molecular docking approach as it scans the protein structure in order to locate the ideal binding site of ligands [55] . the autodock-based blind docking approach was introduced in this study to search the entire surface of the target and template protein for binding sites while optimizing the conformation of peptides simultaneously. for this reason, it was necessary to set up our docking parameters to search the entire surface of the modeled main proteinase of the 2019-ncov. this was achieved using the autogrid to create a very large grid map (center 77 å × − 10 å × 15 å and size 30 å × 60 å × 35 å) with the maximum number of points in each dimension in order to cover the whole protein. we observed a partial overlap in the docking pose of lopinavir to the active site of both template and target protein as compared to the conspicuous difference observed in the binding orientation of ritonavir to the protein active sites. these differential poses can be viewed distinctively in the attached animation video. a keen view of the binding orientation of the two drug candidates to the 2019-ncov virus main proteinase active site (fig. 11 ) is also consistent with the proposed induced fit binding model. in a comparative docking study, the same drug candidates (lopinavir and ritonavir) were docked against the active site of the pdb downloaded version of the viral main proteinase. the docking grid for this purpose was set with precision as the solved pdb structure of the virus included a cocrystalized ligand at the enzyme active site (center -32 å × − 65 å × 42 å and size 25 å × 30 å × 25 å) and experimental ligands bind to this site with precision and variation in poses (fig. 12) . the binding energy results table 5 here, the docking results of lopinavir and ritonavir against the template and target protein are shown. the binding of ritonavir to the template protein produced the highest number of inter model hydrogen bonds while the binding of lopinavir to the target protein formed polar interaction with three residues at the active site as compared to the two formed by the other interactions table 6 the amino acid residues involved in polar interaction, the number of inter-model hydrogen bonds and the docking score of lopinavir and ritonavir upon binding to the 3d pdb download of the sars-cov main proteinase (pdb 6m2n) showed a difference of − 0.3 kcal/mol upon the binding of lopinavir to the template and the pdb 3d structure of the enzyme (pdb 6m2n), and a difference of − 0.5 kcal/ mol between the pdb 3d structure of the enzyme and the target protein (table 5 and 6). the same comparative study was repeated for the binding of ritonavir and a difference of − 0.1 and − 1.0 kcal/mol was observed upon the binding of drug to the template and target proteins, respectively, in comparison with the binding to the downloaded 3d structure of the enzyme from the pdb. the observed consistency in the binding energy of the drug candidates can also serve as a reference to the validity and quality of the modeled protein, which has exhibited a high sequence and structural similarity with the downloaded 3d structure from the protein data bank (fig. 13 ). in an effort to make available potent therapeutic agents against the fast rising 2019 novel coronavirus epidemic, we identified from the viral genome the coding region and modeled the main proteinase of the virus coupled with the evaluation of the efficacy of existing hiv protease inhibitors by targeting the protein active site using a blind docking approach. our study has shown that lopinavir displays a broader spectrum inhibition against both the sars coronavirus and 2019-ncov main proteinase as compared to the inhibition profile of ritonavir. the modeled 3d structure of the enzyme has also provided interesting insights regarding the binding orientation of the experimental drugs and possible interactions at the protein active site. the conclusion from the study of cao et al. as previously discussed however has shown that the administration of the lopinavir-ritonavir therapy might elicit additional health concerns as a result of the extreme adverse events exhibited by the experimental subjects for the purpose of their study. it was also observed that the drugs showed no increased benefit when compared with the standard supportive care. in view of this findings, we therefore suggest a drug modification approach aimed at avoiding the health concerns posed by the lopinavir-ritonavir combined therapy while retaining their proteinase inhibitory activity. supplementary information accompanies this paper at https://doi.org/10. 1186/s43042-020-00081-5. additional file 1. supplementary information to this article can be found online at https://www.rcsb.org/structure/6m2n clinical features of patients with 2019 novel coronavirus in wuhan genomic characterization and epidemiology of 2019 novel coronavirus: implications of virus origins and receptor binding a novel coronavirus from patients with pneumonia in china a familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster importation and human-tohuman transmission of a novel coronavirus in vietnam national health commission of the people's republic of china transmission of 2019-ncov infection from an asymptomatic contact in germany alert, verification and public health management of sars in the post-outbreak period coronavirus in severe acute respiratory syndrome (sars) a novel coronavirus and sars crystal structures of the main peptidase from the sars coronavirus inhibited by a substrate-like aza-peptide epoxide dissection study on the sars 3c-like protease reveals the critical role of the extra domain in dimerization of the enzyme: defining the extra domain as a new target for design of highly-specific protease inhibitors 3c-like proteinase from sars coronavirus catalyzes substrate hydrolysis by a general base mechanism only one protomer is active in the dimer of sars 3c-like proteinase biosynthesis, purification, and substrate specificity of severe acute respiratory syndrome coronavirus 3c-like proteinase a trial of lopinavir-ritonavir in adults hospitalized with severe covid-19 emboss: the european molecular biology open software suite srs, an indexing and retrieval tool for flat file data libraries issues in bioinformatics benchmarking: the case study of multiple sequence alignment hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment the swiss-prot protein knowledgebase and its supplement trembl in 2003 toward the estimation of the absolute quality of individual protein structure models molprobity: more and better reference data for improved all-atom structure validation chapter 2: protein composition and structure modeling protein quaternary structure of homo-and hetero-oligomers beyond binary interactions by homology ucsf chimera-a visualization system for exploratory research and analysis fasman gd (1974) prediction of protein conformation protein identification and analysis tools on the expasy server the rapid generation of mutation data matrices from protein sequences mega7: molecular evolutionary genetics analysis version 7.0 for bigger datasets chemoinformatics: theory, practice, & products critical assessment of the automated autodock as a new docking tool for virtual screening critical assessment of methods of protein structure prediction (casp) round 6 visualizing and quantifying molecular goodnessof-fit: small-probe contact dots with explicit hydrogen atoms a test of enhancing model accuracy in high-throughput crystallography the penultimate rotamer library protein geometry database: a flexible engine to explore backbone conformations and their relationships to covalent geometry circular dichroism spectrum of peptides in the poly(pro)ii conformation calculation of protein extinction coefficients from amino acid sequence data universality and structure of the n-end rule the n-end rule in bacteria correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence thermostability and aliphatic index of globular proteins alpha helices are more robust to mutations than beta strands global analysis of protein folding using massively parallel design, synthesis, and testing time of the deepest root for polymorphism in human mitochondrial dna intraspecific nucleotide sequence differences in the major noncoding region of human mitochondrial dna limitation of the evolutionary parsimony method of phylogenetic analysis efficient docking of peptides to proteins without prior knowledge of the binding site molecular recognition and docking algorithms we appreciate the leadership of the laboratory of cellular dynamics (lcd), university of science and technology of china, for the all-around support and academic advisory role. we also acknowledge the strong support from the ustc office of international cooperation all through the challenging period of the coronavirus epidemic. the authors received no funding for this project from any organization. ethics approval and consent to participate not applicable the authors declare that they have no competing interests. key: cord-266288-buc4dd5y authors: dong, rui; he, lily; he, rong lucy; yau, stephen s.-t. title: a novel approach to clustering genome sequences using inter-nucleotide covariance date: 2019-04-09 journal: front genet doi: 10.3389/fgene.2019.00234 sha: doc_id: 266288 cord_uid: buc4dd5y classification of dna sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including multiple sequence alignment (msa) are time-consuming and computationally expensive. the alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. also, the interactions among nucleotides are neglected in most methods. here we propose a new accumulated natural vector (anv) method which represents each dna sequence by a point in ℝ(18). by calculating the accumulated indicator functions of nucleotides, we can further find an accumulated natural vector for each sequence. this new accumulated natural vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. thus global comparison of dna sequences or genomes can be done easily in ℝ(18). the tests of anv of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed anv method. with the rapid development of next generation sequencing technology, more and more information of the genome sequences is available. studying sequence similarity is a crucial question in research and can explain phylogenetic relationships by constructing trees. one of the most commonly used methods, multiple sequence alignment (msa) uses dynamic programming, a regression technique that finds an optimal alignment by assigning scores to different possible alignments and taking the one with the highest score (yu et al., 2013a) . however, the computational cost of msa is extremely high and msa may not produce accurate phylogeny for diverse systems of different families of rna viruses (yu et al., 2013b) . alignment-free approaches have been developed to overcome those limitations. published alignment-free methods include markov chain models (apostolico and denas, 2008) , chaos theory (hatje and kollmar, 2012) , and some other methods based on the statistics of oligomer frequency and associated with a fixed length segment, known as k-mer (sims et al., 2009) . yau and his team proposed the natural vector method, which takes the position of each nucleotide into consideration. the natural vector method performs well on many datasets (deng et al., 2011; yu et al., 2013b; hoang et al., 2016; li et al., 2016) , however, it only considers the number, average position and dispersion of positions of each nucleotide. relationships between nucleotides are also important, especially when the functions may be related to interactions of nucleotides, such as the folding of a chromosome. in this paper, we propose a new accumulated natural vector (anv) method, which not only considers the basic property of each nucleotide, but also the covariance between them. in the traditional natural vector (nv) method, each sequence is uniquely represented by a single point in r 12 . the traditional natural vector approach is firstly introduced in deng et al. (2011) : for a sequence of length n, n α (αǫ{a, c, t, g}) denotes the number of nucleotide α in the sequence. s [α] [v] is the distance from the first nucleotide (regarded as origin) to the v th nucleotide α in the dna sequence. t α = n α v=1 s [α] [v] denotes the total distance of each set of a,c,g,t from the origin, αǫ{a, c, t, g}. µ α = t α n α , is the mean value of the distances of nucleotide α from the origin. , is the normalized central moment of order 2, which can also be seen as the variance of the positions of nucleotide α. therefore, a dna sequence can be represented by a 12-dim vector: in this paper, we propose an accumulated natural vector approach, which projects each sequence into a point in r 18 , where the additional six dimensions describe the covariance between nucleotides. obviously, anv can provide more information than the traditional nv method, and doesn't include the human intervention, such as choosing the optimal value of k in the k-mer method. therefore, it can distinguish different sequences and classify species into correct clusters with higher accuracy and less time cost. the following six datasets were used to validate the method. the coronaviruses dataset includes 36 viral genomes, in which 34 viruses are from the exact same dataset with (woo et al., 2005; yu et al., 2010; hoang et al., 2015) and the other two viruses are new members in coronavirus. the second dataset consists of the genomes of 38 influenza a viruses, which is a classic dataset to test if a new proposed method performs well. the third dataset includes 72 viruses from zheng et al. (2015) , which focuses on the classification of ebolaviruses. the fourth one is from our colleagues' previous paper (li et al., 2016) which includes 351 viruses chosen randomly under some criteria. the fifth one is the mitochondrial genomes of 31 mammals, which can be clustered into seven well-known categories. all the sequence materials can be found on ncbi with the reference number provided in the appendices. we also generated different mutations by simulation in a dna sequence and constructed phylogenetic trees of simulated sequences to test our anv method. all computations in this paper are done on a dell laptop equipped with intel i7 processor under windows 10 home premium with 8 gb ram, together with the matlab (version r2017a) and mega x. for a given genomic sequence, we first define four indicator functions (u) for adenine, cytosine, guanine and thymine, respectively: if α appears at the i th position of the sequence 0, if α doesn ′ t appear at the i th position of the sequence (1) where αǫ{a, c, t, g}, and i = 1, 2, . . . , n. heren is the length of the whole sequence. for example, if the genomic sequence is "atctagct, " then the four indicator functions are shown in table 1 . here are some simple properties about the indicator functions: 1. each column has the sum of 1. 2. each row has the sum of the number of corresponding nucleotide. now we define four accumulated indicator functions as the following:ũ the four accumulated indicator functions for the example above ("atctagct"), are shown in table 2 . here are some properties about the accumulated indicator functions: 1. the i th column has the sum of i. α∈{a,c,g,t}ũ 2. the last column is the total number of the nucleotideα in the sequence. µ k is the average position in the natural vector in deng et al. (2011) . property 1 and 2 can be easily proved by the definition of indicator function (u α ) and accumulated indicator function (ũ α ), now we prove the property 3, which builds up the relationship between the accumulated indicator function and the average position of a specific nucleotide. if we assume that the positions of nucleotide α are t 1 , t 2 , . . . , t n α , where n α is the number of nucleotide α in the sequence, then the basic form of accumulated indicator function should be, which satisfies 1 ≤ t 1 < t 2 < . . if we add up those n elements above and denote the sum as α and t 0 = 0, we have: therefore, we use to describe the average position of nucleotide α, which indicates the distance of the average position to the end of the sequence. for two finite point sets with equal number of elements: a = {a 1 , a 2 , . . . , a n }, b = b 1 , b 2 , . . . , b n in r, which satisfy a 1 < a 2 < . . . < a n and b 1 < b 2 < . . . < b n , the covariance of two sets can be defined as follows: where u a = n i = 1 a i /n and u b = n i = 1 b i /n. now we apply the covariance formula above to the accumulated indicator functions. a set is a collection of definite, distinct objects, known as the elements or members of the set. now for each nucleotide, we have an array of n elements which is the accumulated indicator function for the nucleotide α ∈ {a, c, g, t}: [0, 0, . . . , 0, 1, 1, . . . , 1, 2, 2, . . . , 2, . . . (n α − 1) , however, those n elements cannot build up a set of n elements since many of them are replicated. hence, we extend the definition of set to a generalized concept, where the elements in a set can be the same. in this generalized definition, each nucleotide has a set of n elements and they can be arranged in the ascending order, i.e., from the smallest to the biggest number. thus, we can use the covariance formula (11). as the example of sequence "atctagct, " the covariance of nucleotide a and c can be computed in this way: the generalized set of nucleotide a is {1,1,1,1,2,2,2,2} and of c is {0,0,1,1,1,1,2,2}. each generalized set has n = 8 elements and the generalized covariance would be similarly, we can get cov (a,g), cov (a, t), cov (c, g), cov (c, t), cov (g, t). for two nucleotides like α and β, the covariance formula is then it is obvious that when α = β, the corresponding formula should be the formula above defines the variance of the positions of nucleotide α. for a given nucleotide sequence, now we can build up its accumulated natural vector. the first four dimensions describe the number of each nucleotide, denoted as n a , n c , n g , n t , which are the last column of the accumulated indicator functions. the second four dimensions describe the average distance of nucleotides to the end of the sequence, denoted as (10). the third four dimensions describe the divergence of each nucleotide, denoted (16). please note that this d α is a little different from the d α 2 in the traditional natural vector method since the previous definition of variance cannot be extended to a reliable definition of covariance. the last six dimensions describe the covariances between each two nucleotides, denoted as cov (a, g), cov (a, t), cov (c, g), cov (c, t), cov (g, t) as formula (15). and the universal form of accumulated natural vector is from section 2.2.1 to section 2.2.5, we introduce how a dna sequence is represented by a vector in r 18 space. therefore, the distance between two sequences can be measured by the euclidean distance between two vectors. suppose that now we have two sequences in r w (in our case, w = 18), denoted as x = (x 1 , ..., x w ) and (y 1 , ..., y w ), the euclidean distance between them is for a dataset of m different sequences, we can construct a distance matrix d = (d ij ) m×m , and d ij (≥ 0) represents the euclidean distance between sequence i and sequence j. d is a symmetric matrix and the diagonal element is zero. in this research, we use mega x to build up phylogenetic trees. in order to eliminate the influences of different algorithms of constructing trees, we apply the unweighted pair group method with arithmetic mean (upgma) algorithm (sneath and sokal, 1973) for analysis on the four datasets. for comparison with other common alignment or alignmentfree method, we also perform k-mer and msa (clustalw or muscle) on the same dataset. the feature frequency profile (ffp) (woo et al., 2005) , which is based on k-mer frequency, calculates the frequency of each k-mer in the sequence and turns a dna sequence into a vector in a 4 k -dimensional space. the euclidean distance between two k-mer vectors can also be computed by formula (17). we apply msa method, clustalw on several datasets as well, with the default parameters in mega x. clustalw is much slower than another msa algorithm, muscle, while clustalw can give a better result. muscle is applied on the fourth dataset of 351 viruses and after we get the alignment result of the viruses, distance matrix is calculated using hamming distance, to find the nearest neighbor of each virus. hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. it measures the minimum number of substitutions required to change one string into the other or the minimum number of errors that could have transformed one string into the other. since alignment approaches are to arrange the sequences to identify regions of similarity between the sequences, the alignment would provide the performance of each sequence on a fixed number of positions. therefore, the hamming distance can be calculated by simply counting the number of pairwise differences in character states. in the simulated dataset, we use the pairwise alignment distance by the "seqpdist" function inside matlab bioinformatics toolbox, which uses the jukes-cantor algorithms as the correct tree, since the sequences are simulated according to a base sequence. then the distance matrices are compared using robinson-foulds distances, which can measure the congruence to the reference topology. we apply the accumulated natural vector method on five datasets, and compare the results with common methods, such as msa, k-mer (ffp) and the traditional natural vector method. from comparison, the results of accumulated natural vector are more accurate and the calculation cost is very small compared to others. a dataset of 351 viruses has also been tested, and laptop cannot bear such a heavy burden of calculation of aligning them but alignment-free can still be done in a reasonable time. we also use a server to align segments of 351 sequences, to compare the results to anv and other methods. anv also gives the best performance on this dataset. besides, we simulate another dataset of 20 sequences from a randomly generated sequence with length of 1,000bp, and test the phylogenetic trees from this and other methods. we have chosen those datasets of different sizes (number of sequences, and lengths of sequences), which to test if anv can be suitable in all cases. most datasets have been analyzed by previous researches, therefore we can compare our results to others to evaluate the performances. four datasets consist of viruses that are closely related to human health, and the mammal's dataset and simulated dataset show that this method can perform on other types of sequences as well. coronavirus belongs to the subfamily coronavirinae in the family coronaviridae, in the order nidovirales. in this paper, we construct a dataset with 36 coronaviruses, in which 34 viruses are from the exact same dataset with (woo et al., 2005; yu et al., 2010 ; hoang et al., 2015) . the other two viruses are two new members in coronavirus. details of the coronaviruses can be found in table s1 . the new chinagd01 (lu et al., 2015) was identified in guangdong province (china) in 2015 and is an imported middle east respiratory syndrome coronavirus. the other one mers-cov/kor is from south korea (kim et al., 2015) . as of 15 june 2015, the mers-cov was spreading in south korea, and the chinagd01 case was a south korean national who traveled to guangdong in may 2015. therefore, those two members were considered highly correlated with each other. the genomic size of coronaviruses ranges from about 9 to 31 kbp, with the average of 27,567 nucleotides. using our accumulated natural vector and upgma method (sneath and sokal, 1973) , we can build up a phylogenetic tree as shown in figure 1 . figure 1 shows that the two new members are clustered together with group 4, which is also well-known as sars (severe acute respiratory syndrome). between november 2002 and july 2003, an outbreak of sars in southern china caused an eventual 8,098 cases, resulting in 774 deaths reported in 37 countries. both mers-cov and sars viruses are beta-coronaviruses, however, they belong to different lineages, for more details please see (drexler et al., 2013; hilgenfeld and peiris, 2013) . the phylogenetic tree indicates that the chinagd01 and mers-cov/kor forms a monophyletic clade, sister to the sars clade, which may possibly be a variant from some sars viruses. we also performed the same procedure with k-mer method on the coronaviruses dataset. however, how to choose an optimal k-value is an interesting topic that requires manual intervention. sims et al. showed in woo et al. (2005) that the location of the peak in the distribution of k-mers, i.e., the k with the largest vocabulary, is related to the sequence length n. the k with maximum information is empirically determined but may be closely approximated by where 4 is the alphabet size. they have shown in sims et al. (2009) that reliable tree topologies are typically obtained with k-mer resolutions where k > k hmax whereas lengths below k h max yield unreliable trees. the upper limit of resolution can be empirically determined by a criterion that the tree topology for feature length k is equal to that of k+1, i.e., tree topologies converge. according to this principle, we have 7 ≤ k ≤ 9. we show the result of k = 7 in figure 2a , and the results of k = 8 and k = 9 are in the figures s1, s2 . the four outgroup viruses cannot be clustered together as another branch from the tree of coronaviruses, meanwhile the group 1 was divided into smaller groups. the traditional clustalw algorithm of multiple sequence alignment (msa) is also applied on the same dataset, and the result is shown in figure 2b . msa cannot cluster viruses from same groups together either. from this example, we can see that our anv method is better than the k-mer and msa method. influenza a viruses are single-stranded rna viruses, which have been a major health threat to both human society and animals . influenza a viruses' nomenclature is based on the surface glycoproteins: hemagglutinin (ha) and neuraminidase (na) (obenauer et al., 2006) . ha has 15 subtypes and na has 9 subtypes, which forms 135 different combinations. the ncbi number of the analyzed 38 influenza a viruses can be found in table s2 . our result agrees with previous work by hoang et al. (2015) . furthermore, we find that all the influenza a viruses are clustered with the same h and n type in figure 3 , with only one exception of a/turkey/minnesota/1/1988(h7n9). there is no specific research about this virus and we infer that it may be the intermediate from h7n3 to h7n9. h7n3 had an outbreak in july 2012, causing millions of poultry's infection, but there is no report of infection from human to human yet. however, h7n9 was identified in shanghai, china at the end of march 2013. considering that the ha glycoprotein of those two subtypes are the same and the close outbreak date, we indicate that the h7n9 on march 2013 might be a variant from h7n3, and a/turkey/minnesota/1/1988 (h7n9) plays a key role in this variation. we get the same conclusion in another work as well (dong et al., 2018) . more biological research on this virus should be done to deepen our understanding of influenza a viruses to accelerate the invention of an effective vaccine and to prevent more dangerous variants. the k-mer method and msa are also performed on this dataset as shown in figures 4a,b . the k-value is determined in the same procedure as in the coronaviruses dataset as 5. in figure 4a , the viruses from h1n1 and h5n1 are mixed up together with each other, while msa has a worse result in figure 4b . the results also indicate that k-mer and msa cannot reveal the real relationships among the viruses. to get a direct image of the relationships between influenza a viruses, we draw the natural graph of them. natural graph was first introduced by zheng et al. (2015) . in figure 5 , the blue lines represent the 1-level connected components and the red ones 2-level. classes are marked in different colors and it is obvious that after the construction of two levels, the influenza a viruses with the same h and n are clustered together, including the a/turkey/minnesota/1/1988(h7n9) which is number 34 in figure 3 . h7n9 and h7n3 are clustered together in level 2, indicating that they have a closer relationship, which accords with our previous conjecture. to illustrate that the new proposed anv method is an important improvement of the traditional natural vector method, a 72 ebolaviruses dataset is tested, which is a subset of the 163 viruses used in zheng et al. (2015) . it consists of 38 ebola virus (ebov), 11 sudan virus (sudv), 9 reston virus (restv), 1 taï forest virus (tafv), 6 bundibugyo virus (bdbv), 6 marburg virus (marv) and 1 lloviuvirus (llov). details of this dataset are shown in table s3 . in figure 6a , the phylogenetic tree shows that from the novel accumulated natural vector method classifies all viruses into the right groups, however, in figure 6b , the traditional natural vector method divides ebov class into two clusters and sudv is misclassified with some ebov virus. this is an indication that including covariance between nucleotides helps improve the accuracy of classification. hence this is an important improvement to the traditional natural vector and other alignment-free methods. we also test a large dataset of 351 viruses in li et al. (2016) , and the details of this dataset can be found in table s4 . the average length of them is 11,952 nucleotides and it makes alignment methods on a laptop impossible. only server or cloud computing can finish such a task. here we use 1-nearest neighbor (1-nn) method (li et al., 2016) to see the accuracy of the prediction. this evaluation is inspired by the high rate of missing labels in many databases of viruses. for example, if a virus with missing family label has been added to the database, and it should share the same family label with the virus (stored in the database already) that is closest to it, then we can predict the missing family label according to the information of its nearest neighbor. therefore, for a dataset with no missing labels, we can count how many viruses share the same label with its neighbor. "nearest neighbor" of a specific virus can be defined as the virus that has the smallest euclidean distance in the dataset to it for the alignment-free methods. for alignment results, we use the hamming distance to measure the distance between two sequences. if the virus shares its distance with its neighbor, we consider it as a "correct" one, since even if its label is missing we can still predict it from its nearest neighbor. the accuracy can be computed by dividing the number of correct ones by the number of all viruses, in this case, by 351. we compare the result of anv to the k-mer method since they are all alignment-free methods, and the results are shown in table 3 . the optimal choice of k is made by the same procedure in the other datasets. from table 3 , it is evident that anv has much higher accuracy than the k-mer method, meanwhile using much less time. thus, we have proved that anv can apply to practical use with high time-efficiency and high-accuracy. for the alignment in this part, we tried to align all the sequences with full length on our server, but it fails to give a reliable result. therefore, we extract 3,000 bp from the beginning and align 351 pieces of segments all with length of 3,000 bp. the results are shown in table 3 as well and the accuracy is still not as good as what anv gives. our accumulated natural vector performs well not only on virus datasets, but also on other common species. we extract 31 mammalian mitochondrial genomes with the average length of 16,696 nucleotides, and the ncbi numbers of them can be found in table s5 . the genomes are from seven known clusters: primates, carnivora, cetartiodactyla, perissodactyla, eulipotyphla, lagomorpha, and rodentia. the accumulated natural vector method can still distinguish the differences among the seven clusters, as shown in figure 7a . ffp (k-mer) method has also been tested as well (the optimal k-value for this dataset is 8), as shown in figure 7b . since the species that includes in different paper are not all the same, it is hard to compare the whole topology of phylogenetic trees, however, our work still only has a small difference from the previous work in murphy et al. (2001) and tarver et al. (2016) . the difference can be attributed to that mitochondrial genomes in mammals may not always reflect the organismal evolutionary history (morgan et al., 2014) , however, it still keeps more information than k-mer does in figure 7b , since the distance within each group is smaller than the distances among groups, we can still distinguish clusters based on current dataset. in ladoukakis and zouros (2017) , point out that most of the information researchers gained about the tree of life through the use of mtdna remains valid, while we should pay more attention to its role in the function of the organism and its value as a tool in the study of major evolutionary novelties in the history of life. therefore, the result implies that our anv method can capture the key information hidden inside the dna sequences and gives us a reliable topology among mammals. to verify is the similarity distance by our method can be used for clustering dna sequences effectively, we also generated different mutations in dna sequences and constructed phylogenetic trees by various methods. we simulated a sequence of length 1,000 bp as a base sequence, and generated two new sequences named "a_original" and "b_original" using point mutations. both a and b have 100 nucleotides different from the original sequence. we then similarly evolved a and b into different mutants by four different mutations (substitutions, deletion, insertion, and transposition) as did in yin et al. (2014) . table 4 is the detailed description on the simulated dna sequences with different mutations. since the sequences are mutated slightly based on an exon sequence, we take the aligned result as the "correct" relationships among the sequences, and the alignment is done by the "seqpdist" function in matlab bioinformatics toolbox. this function uses the classical jukes-cantor algorithm and we calculate the pairwise alignment distance. for comparison, we use the anv method, ffp method (we test k = 4,5,6 in this case, since the lengths of sequences are about 1,000 bp). the upgma trees of alignment, anv and ffp (k=4) methods are shown in figures 8, 9a ,b separately. among these trees, it is not very obvious which one is more similar to the alignment results, therefore we calculate the robinson-foulds distances between the distance matrix and the "correct" matrix and the results are shown in table 5 . here we apply the program named "robinson-foulds" (robinson and foulds, 1981) when calculating table 5 . the simulated dataset is in table s6 . actually, the differences among trees mainly lie in the branch of sequences generated from b, and anv gives a more similar result, since the order is slightly disorganized by b5 and the transpositional sequences, while in figure 9b , the whole branch of b is different from the alignment result. in this paper, we propose a novel vector named accumulated natural vector to analyze sequences, genomes and their phylogenetic relationships. results from our analysis largely agree with the earlier studies, which indicates that our approach can detect the similarity and difference among sequences. therefore, constructing phylogenetic trees only by sequence data could be done accurately in a very reasonable time, without using large computing platforms or conducting biological experiments of high cost. our method can be applied in a global comparison of all genomes and provide a new powerful tool by including the correlations of nucleotides. we are working on extending the anv method to protein sequences, nevertheless, for a protein sequence, it would produce an 1,830-dim vector for each sequence. the calculation cost for this is too large under the current technology. the covariance for three amino acids at a time may be more reasonable, since three consequent nucleotides can also become a codon in expression region of a sequence. fast algorithms for computing sequence distances by exhaustive substring composition a novel method of characterizing genetic sequences: genome space with biological distance and applications a new method to cluster genomes based on cumulative fourier power spectrum ecology, evolution and classification of bat coronaviruses in the aftermath of sars a phylogenetic analysis of the brassicales clade on an alignmet-free sequence comparison method from sars to mers: 10 years of research on highly pathogenic human coronaviruses numerical encoding of dna sequences by chaos game representation with application in similarity comparison a new method to cluster dna sequences using fourier power spectrum complete genome sequene of middle east respiratory syndrome coronavirus kor/knih/002_05_2016, isolated in south korea evolutionary and inheritance of animal mitochondrial dna: rules and exceptions virus classification in 60-dimensional protein space complete genome sequence of middle east respiratory syndrome coronavirus (mers-cov) from the first imported mers-cov case in china mitochondrial data are not suitable for resolving placental mammals phylogeney molecular phylogenetics and the origins of placental mammals large-scale sequence analysis of avian influenza isolates comparison of phylogenetic trees alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions numerical taxonomy the interrelationships of placental mammals and the limits of phylogenetic inference characterization and complete genome sequence of a novel coronavirus, coronavirus hku1, from patients with pneumonia a measure of dna sequence similarity by fourier transform with applications on hierarchical clustering protein sequence comparison based on k-string dictionary real time classification of viruses in 12 dimensions a novel construction of genome space with biological geometry ebolavirus classification based on natural vectors ss-ty and rh conceived the idea of covariance. rd implemented the idea and wrote the first draft of the manuscript. lh discussed and revised the first draft. rd, lh, rh, and ss-ty all contributed to the writing of the manuscript and agreed with the manuscript results and conclusions. they jointly developed the structure and arguments for the paper, made critical revisions and approved final version, and reviewed and approved the final manuscript. this study is supported by the national natural science foundation of china (91746119) (to ss-ty), tsinghua university start-up fund (to ss-ty). the corresponding author would like to thank national center for theoretical sciences (ncts) for providing excellent research environment while part of this research was done. the supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. conflict of interest statement: the authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.copyright © 2019 dong, he, he and yau. this is an open-access article distributed under the terms of the creative commons attribution license (cc by). the use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. no use, distribution or reproduction is permitted which does not comply with these terms. key: cord-253436-dz84icdc authors: wille, michelle; muradrasoli, shaman; nilsson, anna; järhult, josef d. title: high prevalence and putative lineage maintenance of avian coronaviruses in scandinavian waterfowl date: 2016-03-03 journal: plos one doi: 10.1371/journal.pone.0150198 sha: doc_id: 253436 cord_uid: dz84icdc coronaviruses (covs) are found in a wide variety of wild and domestic animals, and constitute a risk for zoonotic and emerging infectious disease. in poultry, the genetic diversity, evolution, distribution and taxonomy of some coronaviruses have been well described, but little is known about the features of covs in wild birds. in this study we screened 764 samples from 22 avian species of the orders anseriformes and charadriiformes in sweden collected in 2006/2007 for cov, with an overall cov prevalence of 18.7%, which is higher than many other wild bird surveys. the highest prevalence was found in the diving ducks—mainly greater scaup (aythya marila; 51.5%)—and the dabbling duck mallard (anas platyrhynchos; 19.2%). sequences from two of the greater scaup cov fell into an infrequently detected lineage, shared only with a tufted duck (aythya fuligula) cov. coronavirus sequences from mallards in this study were highly similar to cov sequences from the sample species and location in 2011, suggesting long-term maintenance in this population. a single black-headed gull represented the only positive sample from the order charadriiformes. globally, anas species represent the largest fraction of avian cov sequences, and there seems to be no host species, geographical or temporal structure. to better understand the eitiology, epidemiology and ecology of these viruses more systematic surveillance of wild birds and subsequent sequencing of detected cov is imperative. coronaviruses (covs) are found in a wide variety of animals in which they can cause respiratory, enteric, hepatic, and neurological disease of varying severity. in humans, covs cause a large fraction of common colds [1] , as well as more rare, but serious disease such as the outbreak of severe acute respiratory syndrome (sars, caused by sars-cov) in 2003, and the polymerase (rdrp) were generated using methods described in [14] or [15] , and sequences generated in this study have been deposited in genbank under accession numbers kt882615-28. resulting sequences were aligned using the mafft algorithm [16] implemented in geneious r7 (biomatters, new zealand). the nucleotide substitution model was determined in mega 5.2 [17] and maximum likelihood trees were constructed using phyml [18] with alrt branch support in seaview [19] . trees were projected using figtree v1.4 [20] . a total of 764 birds from 11 species of anserifomes (ducks, geese, swans, n = 691) and 11 species of charadriiformes (gulls, terns, shorebirds, n = 143) were sampled in this study. overall, the prevalence of cov was high, at 18.7%, but species, groups and orders were unevenly represented. diving ducks had the highest prevalence (39%), driven by high prevalence in greater scaup (aythya marila; 51.5%), however the sample size was small (n = 37). dabbling ducks of the genus anas also had high prevalence; mallard, in particular, had a prevalence of 19.2%. while anseriformes had the highest prevalence, in the order charadriiformes, cov was detected in a black-headed gull (chroicocephalus ridibundus), but absent in tern and wader species sampled (table 1) . eleven sequenced mallard cov and three sequenced scaup cov were gammacoronaviruses. putative maintenance, despite no global spatial, temporal, host species patterns a global phylogeny of all sequenced avian gammacoronaviruses (not ibv-like) rdrp show no spatial, temporal or host species differentiation between clades. that is, all clades contain sequences from multiple geographic locations, different years of sampling, and different host species (fig 1a, s1 and s2 figs) . indeed, viruses sequenced in this study were placed in a clade containing sequences from hong kong, china, beringia, the united states of america and sweden (fig 1b and s2 table) . however, despite limited global patterns of clade differentiation, 8/10 sequences from mallards in 2007 were most closely related (within 98-99% identity) to cov sequences from mallards at ottenby in 2011 (fig 1b and s1 table) , suggesting local circulation of these rdrp. mallard cov sequence 69998 had the highest pairwise identity to viruses from hong kong, but is nested in the broad and highly similar clade containing the largest number of sequences from ottenby in 2011 ( fig 1b) . a single mallard cov sequence (69740) was in a differentiated clade, most similar to sequences from hong kong and more distantly china (fig 1b and s2 table and s2 fig) . sequences from scaup cov were not similar to mallard cov; scaup 67699 was similar to sequences from beringia and china (within 98% identity; s1 table) . two scaup cov sequences (67693 and 67703) were highly similar to only one sequence in genbank, which was collected from a tufted duck (aythya fuligula) in hong kong (jn788847). all other sequences were less than 92% similar, indicating a differentiated lineage that is poorly sampled (fig 1b and s1 in this study we aimed to further contribute to the natural history of avian coronaviruses by screening an array of waterbirds for these viruses. we found a prevalence of 18.7% cov, which is higher than the 0-15% reported previously in wild bird studies [11, 14, 15, [21] [22] [23] [24] [25] [26] [27] . this may be due to the high proportion of dabbling ducks, particularly mallard, in our study, and the temporal and spatial features of the dabbling duck sampling. more specifically, sampling of dabbling ducks occurred at a migratory stopover site, and thus high prevalence could be driven by migrants which not only import viruses through infected individuals, but also replenish the pool of susceptible individuals across the migratory period, resulting in high prevalence at these locations [28, 29] . directly comparing prevalence estimates to previous studies, however, is challenging due to the array of different methods utilized, ranging from conventional pcr to multiplex real-time pcr, in addition to species distributions and seasonal variations. regardless, this study corroborates previous studies suggesting the importance of dabbling ducks (anas sp) as hosts of avian coronaviruses [11] [12] [13] [14] [15] [21] [22] [23] [24] [25] , but also highlighting the importance of diving ducks. the high diving duck prevalence in this study was driven by high prevalence in greater scaup, which could be due to congregation in very large flocks facilitating transmission, but could also be due to other factors such as a local variation in disease prevalence or reflect outbreak dynamics wherein the virus rapidly expands across the susceptible population. overall, diving ducks have only been sampled in this study and by chu et al. (2011) , so it is unclear whether this pattern of high prevalence is present globally. we found a low prevalence in waders, which is in contrast to [14] and [21] , which found a very high prevalence (>20%) in waders, but corroborates findings in [22, 25, 26] . this may be due to differences in congregation and feeding patterns of waders between the locations, but has to be interpreted with caution, as numbers of waders sampled in our study, and others, were small. finally, while assessment of gulls is limited, this study corroborates findings with muradrasoli et al (2010) indicating that this group requires further scrutiny. in order to better assess cov dynamics, resampling the same site across time is imperative, and this is the first study to do so. coronaviruses from mallards were previously assessed using samples collected at ottenby in 2011 [15] and we detected high similarity between viruses from 2007 and those from 2011. indeed, 8/10 cov were most closely related to sequences from 2011, which strongly suggests maintenance and/or location transmission of these rdrp lineages at this location. this is unlikely to be a coincidence as swedish mallard cov make up less than 1/3 of those available sequences, and all current rdrp clades are comprised of sequences from asia. the waterfowl host appears important in these samples, however assessing host species biases is challenging due to the overwhelming number of sequences from anas species. however, coronaviruses from diving ducks were rather different from existing diversity as demonstrated by the similarity between two of three scaup cov sequences from this study and tufted duck cov sequence from hong kong. the limited number of sequences in this clade could be due to sampling shortcomings, or that currently developed primers do not amplify this clade effectively. despite few studies, small samples sizes and differences in prevalence, what is clear, is that in the northern hemisphere waterfowl species, especially dabbling and diving ducks are important in the epidemiology of avian covs. it is interesting to note that these patterns are very similar to those found in low pathogenic influenza a viruses: high prevalence in waterfowl and gulls in the northern hemisphere [30] , and little host species and temporal structuring within waterfowl derived viruses in the conserved polymerase genes (such as pb2, pb1) [31] . further, detection in fecal or cloacal samples suggest these viruses may also replicate in the gastrointestinal tract, which is the true for turkey coronavirus [9] . as demonstrated by years of influenza sampling [32] , to better understand the eitiology, epidemiology and ecology of these viruses more systematic surveillance needs to be undertaken and more viruses need to be sequenced, particularly full genomes. given the importance of coronaviruses as human pathogens, as exemplified by sars and mers-cov and the potential for wild birds as reservoirs, spreaders and mixing vessels, further studies of coronaviruses in wild birds are warranted. human coronaviruses: what do they cause? antivir ther coronaviruses: important emerging human pathogens bats are natural reservoirs of sars-like coronaviruses middle east respiratory syndrome (mers): a new zoonotic viral pneumonia. virulence virus taxonomy: classification and nomenclture of viruses: ninth report of the internaltional committee on taxonomy of viruses interspecies transmission and emergence of novel viruses: lessons from bats and birds discovery of seven novel mammalian and avian coronaviruses in the genus deltacoronavirus supports bat coronaviruses as the gene source of alphacoronavirus and betacoronavius and avian coronavirus as the gene source of gammacoronavirus and deltacoronavirus evolutionary insights into the ecology of coronaviruses turkey coronavirus is more closely related to avian infectious bronchitis virus than to mammalian coronaviruses: a review pathogenicity of turkey coronavirus in turkeys and chickens avian coronavirus in wild aquatic birds genomic analysis and surveillance of the coronavirus dominant in ducks in china broadly targeted multiprobe qpcr for detection of coronaviruses: coronavirus is common among mallard ducks (anas platyrhynchos) prevalence and phylogeny of coronaviruses in wild birds from the bering strait area (beringia) temporal dynamics, diversity, and interplay in three components of the viriodiversity of a mallard population: influenza a virus, avian paramyxovirus and avian coronavirus multiple alignment of dna sequences with mafft molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods estimating maximum likelihood phylogenies with phyml seaview version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building figtree v1.1.1: tree figure drawing tool diverse gammacoronaviruses detected in wild birds from madagascar detection and molecular characterization of infectious bronchitis-like viruses in wild bird populations genetically diverse coronaviruses in wild bird populations of northern england molecular identification and characterization of novel coronaviruses infecting graylag geese (anser anser), feral pigeons (columbia livia) and mallards (anas platyrhynchos) identification of avian coronavirus in wild aquatic birds of the central and eastern usa surveillance of avian coronaviruses in wild bird populations of korea absence of coronaviruses, paramyxoviruses, and influenza a viruses in seabirds in the southwestern indian ocean animal migration and infectious disease risk juveniles and migrants as drivers for seasonal epizootics of avian influenza virus global patterns of influenza a virus in wild birds the evolutionary genetics and emergence of avian influenza a viruses in wild birds spatial, temporal, and species variation in prevalence of influenza a viruses in wild migratory birds we wish to thank and the duck trappers at ottenby bird observatory and jonas waldenström for collecting and providing samples used in this study, jonas blomberg for kindly providing sequence, mallard cov sequences generated in this study are indicated with a filled circle and scaup cov sequences with an asterisk. scale bar represents number of substitutions per site. traditional projection of panel a with support values is presented in s2 fig. key: cord-022348-w7z97wir authors: sola, monica; wain-hobson, simon title: drift and conservatism in rna virus evolution: are they adapting or merely changing? date: 2007-09-02 journal: origin and evolution of viruses doi: 10.1016/b978-012220360-2/50007-6 sha: doc_id: 22348 cord_uid: w7z97wir this chapter argues that the vast majority of genetic changes or mutations fixed by rna viruses are essentially neutral or nearly neutral in character. in molecular evolution one of the remarkable observations has been the uniformity of the molecular clock. an analysis of proteins derived from complete potyvirus genomes, positive-stranded rna viruses, yielded highly significant linear relationships. these analyses indicate that viral protein diversification is essentially a smooth process, the major parameter being the nature of the protein more than the ecological niche it finds itself in. synonymous changes are invariably more frequent than nonsynonymous changes. positive selection exploits a small proportion of genetic variants, while functional sequence space is sufficiently dense, allowing viable solutions to be found. although evolution has connotations of change, what has always counted is natural selection or adaptation. it is the only force for the genesis of a novel replicon. there is no such thing as a perfect machine. accordingly, nucleic acid polymerization is inevitably error-prone. yet the notoriety and abundance of rna viruses attests to their great success as intracellular parasites. indeed some estimates suggest that 80% of viruses have rna genomes. it follows that replication without proofreading can be a successful strategy. there is a price to pay, however. manfred eigen was the first to point out that without proofreading there is a limit on the size of rna genomes. obviously, if the mutation rate is too high, any rna virus will collapse under mutation pressure. as it happens, rna viral genomes are up to 32 kb long while mutation rates are 1-2 per genome per cycle or less. possibly, rna viruses and retroviruses have simply not invested in proofreading, in which case mutations represent an inevitable genetic noise, to be tolerated or eliminated. hence there would be no loss of fitness, fixed mutations being neutral. a corollary of this would be that the intrinsic life style of a virus is set in its genes. the alternative is to suppose that most fixed mutations are beneficial to the virus in allowing it to keep ahead of the host and/or host population. by this token variation is an integral part of the viral modus vivendi. the twin requirements of a successful virus are replication and transmission. under the rubric replication, a virus could vary to increase its fitness, exploit different target cells or evade adaptive immune responses. in terms of transmission, variation might allow a virus to overcome herd immunity. these two scenarios emphasize the two sides of the molecular evolution debate; one highlights neutrality while the other puts a premium on positive selection. purifying or negative selection is ever operative -a poor replicon invariably goes asunder. through rounds of error and trial, positive selection is the only means of creating a novel replicon. so long as the ecological niche occupied doesn't change, the virus doesn't need to change, purifying selection being sufficient to ensure existence. this raises an important issue: we know that, over the time that we are living and loving, as well as doing experiments, writing papers and reviewing, humans are not evolving. ernst mayr noted that "the brain of 100 000 years ago is the same brain that is now able to design computers" (mayr, 1997) . positive fitness selection among mammals is effectively inoperative over our lifetimes. and certainly since we have known about hiv and aids. how is it that vertebrates, invertebrates, plants, fungi and bacteria, all species with a low genomic mutation rate, can control viruses which mutate so much faster-sometimes by a factor of 106 (holland et al., 1982; gojobori and yokoyama, 1985; domingo et al., 1996) . yet they do. we come to the basic question-to what extent is genetic variation exploited by an rnavirus, if at all? and if so, what is the virus adapting to? the answer invariably given to the second question is "the adaptive immune system" (seibert et al., 1995) . yet apart from the vertebrates none of the other groups mentioned above mounts antigen-specific immune responses. this chapter will argue that most fixed mutations are neutral. in molecular evolution one of the remarkable observations has been the uniformity of the molecular clock. although there has been intense debate as to what molecular clocks mean and quite how far they deviate from null hypotheses, fibronectin fixes mutations faster than alpha-or beta-globin, which do so faster than cytochrome c, etc. rates of amino acid fixation are intrinsic to different proteins. yet some viruses give rise to persistent infections, others to sequential acute infections. all succumb to the vagaries of transmission bottlenecks. how many rounds of infection are necessary to fix mutations? for example, the tremendous dynamics of viral replication have been described. whether it be hiv, hbv or hcv, plasma viral turnover is of the order of 108-10 '2 virions per day (ho et al., 1995; wei et al., 1995; nowak et al., 1996; zeuzem et al., 1998) . between 10% and 90% of plasma virus is cleared. in the case of hiv this can involve more than 200 rounds of sequential replication per year (wain-hobson, 1993a,b; ho et al., 1995; pelletier et al., 1995; wei et al., 1995) . many of these variables and unknowns can be removed by comparing the fixation of aminoacid substitutions in pairs of viral proteins from two genomes. if one assumes that the two gene fragments remain linked, through the hellfire of ay immune responses and bottlenecking inherent in transmission, relative degrees of fixation should be attainable. note that, so long as frequent recombination between highly divergent genomes is not in evidence, this assumption should be valid. this procedure is outlined in figure 6 .1. the first example is taken from the vast primate immunodeficiency virus database (lanl, 1998) . when normalized to the p66 reverse transcriptase product designated rt, amino acid sequence divergence for p17 gag, p24 gag, integrase, vif, gp120, the ectodomain of gp41 and nef all reveal highly significant linear relationships ( figure 6 .2, table 6 .1). the relative rates vary by a factor of two or more. why the hypervariable gp120 protein shows a relatively low degree of change with respect to the reverse (henikoff and henikoff, 1993) . it is well established that protein sequence comparisons are more informative when weighted for genetic and structural biases in amino acid replacements. in the blosum weight matrices series, the actual matrix that was used depends on how similar the sequences to be aligned are. different matrices work differently at each evolutionary distance. for a given virus, different protein sequence sets were compared to a given reference such as rt in the case of hiv/siv. n indicates the number of independent two-by-two comparisons. the data were checked for the possibility that a rogue genome strongly influenced the data. only in the case of the inoviridae were there insufficient complete sequences, six in fact, to yield satisfying analyses. instead all pairwise comparisons were made, hence the data points reflect dependent data (#). the form of the linear regressions are given where y and x refer to the first and second protein listed in the column "paired proteins". the correlation coefficients r were highly significant in all cases, the corresponding probabilities being: + < 0.02;" < 0.005; *< 0.001. .2 graphical representation of paired divergence for orthologous proteins taken from complete hiv-1, hiv-2 and siv genome sequences, y = different proteins, x = p66 sequence of the reverse transcriptase (rt). x and y values correspond to blosum-corrected fractional divergence. only non-overlapping regions were taken into account. the straight lines were obtained by linear regression analysis. their characteristics are given in table 6 .1. transcriptase (rt) can be explained by gap stripping, which eliminates the hypervariable regions. consequently the gp120 data effectively reflects the conserved regions. the linearity, even out to considerable differences, indicates that multiple substitutions and back mutations, which must be occurring, do so to comparable degrees. although these data were derived from completely sequenced primate immunodeficiency viral genomes, analyses on larger data sets, such as p17 gag/p24 gag or gp120/gp41, yielded relative values that differed from those given in table 6 .1 by at most 14%. the absence of points far from the linear regression substan-tiates the assumption that recombination between highly divergent genomes is rare. this does not preclude recombination between closely related genomes. the linear regressions passed close to the origin in nearly all cases. only for nef was there some deviation, suggesting that nef was saturating to a different extent from all other proteins. however, as linear correlations involving nef data were always statistically significant, this trend may be fortuitous. note that the data cover the earliest phase, intrapatient variation (generally <10%), continuing smoothly to cover interclade, intertype and finally interspecies comparisons. yet this in spite of different environments-that of an individual's immune system, different immune systems stigmatized by highly polymorphic hla, and finally differences between humans, chimpanzees, mandrills and african green monkeys accumulated over 30 million years. the same forces were uppermost during all stages of diversification. it is remarkable that the very different proteins, such as gp120 and the gp41 ectodomain (surface glycoproteins), p17 gag and p24 gag (structural), rt and integrase (enzymes) and nef and vif (cytoplasmic), all yield linear relationships ( figure 6 .2), as though fixation was an intrinsic property of the protein. applying the same analysis to complete rhinoviral genomes yielded comparable results, i.e. highly significant linear relationships for vp1, vp2 and vp3 (capsid proteins), p2a and 3c (proteases), p2c (cytoplasmic proteins involved in membrane reorganization) compared with the rna-dependent rna polymerase (3d) as reference ( figure 6 .3, table 6 .1). hence figure 6 .2 does not represent some quirk of primate lentiviruses. of course, vertebrate viruses have a redoubtable adversary in the host adaptive immune system. the swiftness of secondary responses is reminder enough. an analysis of proteins derived from complete potyvirus genomes, positive-stranded rna viruses, yielded highly significant linear relationships (table 6 .1). a number of revealing points can be made. firstly, the linear relationships hold out to very large blosum distances (0.9). secondly, potyviruses infect a wide variety of different plants, as their florid names betray. finally, the linear relationships cannot result from adaptive immune pressure because plants are devoid of adaptive immune systems. they only have powerful innate immune responses. unfortunately there are insufficient insect rna viral sequences to allow a comparable study. however, a glance at a few beetle nodavirus capsid sequences (dasgupta et al., 1984; dasgupta and sgro, 1989) shows extensive genetic variation with a majority of synonymous base substitutions, typical of most comparisons of mammalian viral sequences (see below). for the time being there doesn't seem to be anything obviously different about insect virus sequence variation. although insects do not mount adaptive immune responses, the breadth and complexity of their innate immune systems is salutary (brey and hultmark, 1998) . a final example is afforded by the inoviruses, bacteriophages of the fd group, which includes m13. although dna viruses fix mutations at a slower rate than rna viruses, they too show linear relationships among comparisons of their i, ii, iii and iv proteins (table 6 .1). and of course bacteria are devoid of adaptive immunity as well. whether the comparisons were between capsid proteins versus enzymes, or secretory versus cytoplasmic molecules, significant linear relationships were obtained for pairwise comparisons in amino acid variation in all cases. such proteins are vastly different in their threedimensional folds and functions. some are "seen" by humoral immunity, others are not. for the plant viruses and bacteriophages, only innate immunity is operative. it is as though the rate of amino acid sequence accumulation is an intrinsic feature of the protein, reminiscent of the differing slopes for the accumulation of substitutions by alpha-globin and cytochrome c already alluded to. of course pairwise comparisons of these two proteins from differing organisms would yield a straight line going through the origin in a manner typical of figures 6.2 and 6.3. hence it is fairly safe to assume that, for viral proteins too, amino acid substitutions are accumulated smoothly over time. indeed, this has been shown explicitly for a number of proteins from a varied group of viruses, including the influenza a, coronaviruses, hiv and herpes viruses (hayashida et al., 1985; gojobori et al., 1990; querat et al., 1990; villaverde et al., 1991; elena et al., 1992; sanchez et al., 1992; mcgeoch et al., 1995; yang et al., 1995; leitner et al., 1997; plikat et al., 1997) . the above analyses indicate that viral protein diversification is essentially a smooth process, the major parameter being the nature of the protein more than the ecological niche it finds itself in. the simplest hypothesis to explain the smoothness of protein sequence diversification is that the majority of fixed amino acid substitutions are neutral, being accumulated at rates intrinsic to each protein. this is not to say that positive selection is inoperative, merely that the majority of fixed substitutions are essentially neutral, so much so that it does not strongly distort the data from a linear relationship expected for genetic drift. in other words, neither the impact of different environments nor the ferocity of the adaptive immune response has much to do with fixation of most substitutions. this is important for the one-dimensional man in all of us sequencers who see all mutations and ask questions about genotype and phenotype -usually about genotype. a short aside is necessary here. it is interesting that in a few areas of rna virology much has been made of escape from the adaptive immune response, particularly cytotoxic t lymphocytes, so leading to persistence (nowak and mcmichael, 1995; mcmichael and phillips, 1997) . however, it is not at all obvious that this is the case (wain-hobson, 1996) . it must not be forgotten that it is possible to vaccinate against a number of rna viruses such as measles, polio and yellow fever. be that as it may, many dna viruses, intracellular bacteria and parasites persist. in these cases de novo genetic variation arising from point mutations is too slow a means to thwart an adaptive immune response. for example, after 1700 generations, under experimental conditions whereby muller's ratchet was operative, s. typhimurium accumulated mutations such that only 1% of the 444 lineages tested had suffered an obvious loss of fitness (anderson and hughes, 1996) . that this number of generations could be achieved within as little as 45 days gives an idea of the time necessary to generate a mutation affecting fitness. this is more than enough time to make a vigorous immune response. some inklings of immune system escape for the herpes virus ebv (de campos-lima et al., 1993 came to nought (burrows et al., 1996; khanna et al., 1997) . when antigenic variation is in evidence among dnabased microbes, it invariably results from the use of cassettes and multicopy genes rather than point mutations resulting from dna replication. and of course such complex systems could have only come about by natural selection. finally de novo genetic variation of an rna virus has never been suggested or shown to be necessary for the course of an acute infection. for a virus to persist thanks to genetic variation the phenomenon of epitope escape must be strongly in evidence by the time of seroconversion, generally 5-6 weeks. yet such data are not forthcoming, and not for want of trying. when viruses do play tricks with the immune system it is invariably by way of specific viral gene products that interfere with the mechanics of adaptive and innate immunity (ploegh, 1998) . in the clear cases where genetic variation is exploited by rna viruses, it is used to overcome barriers to transmission set up by the host population, e.g. herd immunity. the obvious example is influenza a virus antigenic variation in mammals. another way of assessing the contribution of positive selection to sequence variation is to compare the relative proportions of synonymous (ks) and non-synonymous (k) base substitutions per site. a ka/k s ratio of less than 1 indicates that purifying selection is uppermost, while a ratio more than i is taken as evidence of an excess of positive selection. comparisons for hiv proteins from different isolates have yielded the same result (myers and korber, 1994) . some mileage was made out of the fact that this ratio increased with increasing distance of sivs with respect to hiv-1, which in turn led to a discussion of siv pathogenesis (shpaer and mullins, 1993) . however, this may reflect a lack of adequate correction for multiple hits. this effect is illustrated by a comparison of the set of 72 orthologous proteins encoded by herpes simplex viruses 1 and 2 (hsv-1 and hsv-2; figure 6 .4a). the more divergent the protein sequence, the greater the ka/k s ratio. that some proteins fix substitutions faster than others is no surprise. yet as figure 6 .4b shows, the k s values change little as they are near to saturation. when k is small, k s > k. this suggests that reliable interpretation of ka/k s ratios is possible only when the degree of nucleic acid divergence is small. now this is the realm of viral quasispecies rendered accessible by pcr. hiv studies abound, reflecting both the phenomenal degree dolan et al., 1998) . a. ka/k s ratio as a function of uncorrected percentage amino acid sequence divergence (linear regression was ka/k s = 1.25 divergence + 0.04, r = 0.87 (p < 0.001)). b. individual k s and k a variation with percentage divergence (k s = 0.53 divergence + 0.35 and k a -0.76 divergence -0.03 with correlation coefficients of 0.54 and 0.97 respectively, p < 0.001 for both). note how at small degrees of divergence, k>>k a decreases as divergence increases. basically, k s is approaching saturation, being uncorrected for multiple and/or back mutations. of sequence variation and its importance as a pathogen, so we'll stick to some such examples that are illustrative. concerning ka/k s ratios for hiv gene segments, widely varying conclusions have been published supporting all sides (meyerhans et al., 1989 pelletier et al., 1995; wolinsky et al., 1996; leigh brown, 1997; price et al., 1998) , so much so that three comments are in order. firstly, many studies have used small numbers of sequences and substitutions and even regions as small as nonameric hla class-irestricted epitopes. in such cases statistical analyses are essential to test the significance of the distribution of synonymous and non-syn-onymous substitutions. this is particularly important as the point substitution matrix is highly biased (pelletier et al., 1995; plikat et al., 1997) . it turns out that when the proportions are so analysed the distributions are rarely significantly different from the neutral hypothesis (leigh brown, 1997) . secondly, the method for counting substitutions is highly variable, ranging from two-by-two comparisons, scoring the number of altered sites in a data set, to phylogenetic reconstruction. this latter method reflects more closely the process of genetic diversification. when so analysed, almost all of the data sets indicated proportions of synonymous to non-synonymous substitutions indistinguishable from that suggested by genetic drift and/or purifying selection (pelletier et al., 1995; plikat et al., 1997) . thirdly, prudence is called for. the fact that obviously defective sequences can be identified, occasionally accounting for large fractions of the sample (martins et al., 1991; gao et al., 1992) , indicates that not all genomes have undergone the rigours of selection (nietfield et al., 1995) . indeed, in peripheral blood, hiv is invariably lurking as a silent provirus within a resting memory t-cell. such t-cells have half lives of 3 months or more (michie et al., 1992) . hence it would be erroneous to interpret findings based on a single or clustered samples (price et al., 1997) . only when the above caveats are borne in mind is there any hope of discerning how hiv accumulates mutations. when these issues are attended to, purifying selection is dominant (pelletier et al., 1995; leigh brown, 1997; plikat et al., 1997) . one must not deny that positive selection is operative, merely that it is hard to pinpoint when looking at full-length sequences. indeed it is like looking for the proverbial needle in a haystack. in the context of ka/ks-type analyses, the two classic cases in the literature are the hla class i and ii molecules and influenza a haemagglutinin (hughes, 1998; hughes and nei, 1988; ina and gojobori, 1994) . the peptide contact residues of both class i and ii molecules have been under tremendous positive selection. changes in the five antigenic sites on the flu a haemagglutinin help the virus overcome herd immunity set up during previous flu epidemics. consequently, finding ka/k s > i in these regions was, in some ways, a pyrrhic victory because the papers needed experimental data to identify the positively selected segments in the first place. more recently endo et al. (1996) have screened the sequence data bases for proteins in which ka/k s > 1. of 3595 homologous gene groups screened, covering about 20 000 sequences, only 17 groups came up positive, of which two were encoded by rna virusesthe equine infectious anaemia virus envelope proteins and the reovirus g1 (outer capsid) proteins. the former case is intriguing as there is no obvious correlation between sequence changes and neutralizing antibodies (carpenter et al., 1987) . the authors noted that, when a comparable ka/k s analysis was restricted to small segments, the number of protein groups scoring positive rose to 5% (endo et al., 1996) . despite the explanatory power of these ratios, the number of identifiable cases of positively selected segments is small indeed. these numbers would probably shrink were phylogenic reconstruction used. to summarize the section, synonymous changes are invariably more frequent than nonsynonymous changes. positive selection may be operative in the evolution of viral protein sequences. when it is, it apparently exploits only a small fraction of mutants. the two rates touted by evolutionary-minded virologists are the mutation rate and the mutation fixation rate. the first describes the rate of genesis of mutations, the second attempts to describe their fixation within the population sampled over a period of time. in the case where all substitutions are neutral, the mutation rate (m) equals the fixation rate (f) per round of replication. it appears that such a situation applies to the evolution of parts of the siv and hiv-1 genomes over 1-3 years (pelletier et al., 1995; plikat et al., 1997) . if fixation rates are measured over one year, then f = n.m, where n is the annual number of consecutive rounds of replication. it is simple to show that several hundred rounds of sequential replication are required (wain-hobson, 1993b; pelletier et al., 1995) . given that the proviral load of an hiv-l-positive patient (~107-10 9) changes by less than a factor of 10 over 5 years or more, and given the assumption that an infected cell produces sufficient virus to generate two productively infected cells, then annual production would be something akin to 2 200 , or 10 6~ which is impossible. clearly even a productive burst size of 2 is too large (wain-hobson, 1993a,b) . this must be reduced to 1.1 to achieve a realistic proviral load (1.12~176 note that the real value for the effective burst size must be even lower, as proviral load is turning over more slowly than once a day. yet to explain the temporal increase in proviral load, the productive burst size must be 2 or more. thus the calculation reveals massive destruction of infected cells, precisely what was to be expected from immensely powerful innate and adaptive immune responses. when purifying selection is in evidence, some additional factor must be introduced to couple the fixation and mutation rates. as the accumulation of most substitutions proceeds in a protein-specific linear manner for small degrees of divergence, the above equation can be modified to f -p-n. m, where 1 > p > 0 is a constant indicating the degree of negative selection. note immediately that, as p < 1, more rounds of replication are needed to produce the same percentage amino acid fixation. a corollary is an even greater degree of destruction of infected cells. consider the example of a virus that is fixing substitutions only slowly, about 10 -5 per site per year, something like the ebola virus glycoprotein. the mutation rate for ebola is not known but is probably around 10 -4 per site per cycle (drake, 1993) . hence p.n ~10 -1. what is the value of n? most mammalian viruses replicate within 24 h, while obviously outside of a body they do not replicate. consequently a value of n = 50-200 is probably not unreasonable. accordingly p; 2 x 10 -3 to 5 x 10-4. this means that most mutations generated are deleterious. of those that are fixed, most are neutral, as has been discussed above. the last two sentences describe a profoundly conservative strategy-rna viruses are seen merely to replicate far more than giving rise to genetically distinct, even exotic, siblings. what a stultifying picture, in contrast to the shock-horror of tabloid newspaper virology and that atmospheric, yet profoundly ambiguous term, emerging viruses. conservative perhaps, but is there any suggestion that viruses are more or less so than other replicons? like extrapolation, choosing examples can be problematic. however let's consider one example, the eukaryotic and retroviral aspartic proteases (doolittle et al., 1989) . the former exist as a monomer with two homologous domains, while the retroviral counterpart functions as a homodimer. despite these differences the folding patterns are almost identical, meaning that the enzymes may be considered orthologous. between humans and chickens there is approximately 38% amino acid divergence among typical aspartic proteases ( figure 6 .5). the hiv-1 and hiv-2 proteases differ by a little more, 52%. no one would doubt the considerable differences in design, metabolism and lifestyle separating us and chickens. on either side of the hiv protease coding region one finds differences: hiv-1 is vpx-vpu ⧠while hiv-2 is the opposite, i.e. vpx+vpu-; there are differences in the size and activities of the tat gene product; the ltrs are subtly different. yet both replicate in the same cells in vivo, produce the same disease, albeit with different kinetics: hiv-2 infection progresses more slowly. if these differences are esteemed too substantial, consider the 28% divergence between the hiv and chimpanzee siv proteases. these two viruses are isogenic. pig and human chromosomal aspartic proteases may differ by around 17%, the differences between these two species being, george orwell apart, obvious to all. even by this crude example, the aids viruses would seem to be more conservative than mammals in their evolution. the same argument pertains to the rhinoviral p2a and 3c serine proteases (figure 6 .5). this conclusion is even more surprising when it is realized that hiv is fixing mutations at a rate of 10-2-10 -3 per base per year. by contrast, mammals are fixing mutations approximately one million times less rapidly, i.e. approximately 10-8-10 -9 per base per year (gojobori and yokoyama, 1987) . however, the generation times of the two are vastly different, about i day for hiv and about 15-25 years for humans. normalizing for this yields a 100-fold higher fixation rate per generation for hiv than for humans. amalgamating this with the preceding paragraph, we see that hiv is not only evolving qualitatively in a conservative manner, but it is doing so despite a 100-fold greater propensity to accommodate change. the same arguments go for almost all rna viruses and retroviruses. why is this? although they mutate rapidly, their hosts are effectively invariant in an evolutionary sense. probably sticking to the niche is all that matters, which is no mean task given the strength of innate and adaptive antiviral immune responses. john maynard smith's argument was simply put. for organisms with a base substitution rate of less than i per genome per cycle, he reasoned that all intermediates linking any two sequences must be viable, otherwise the lineage would go extinct. the example used was self explanatory: word --+ wore--+ gore -+ gone ~ gene (maynard smith, 1970) . the same is true for viruses, even though their mutation rates are 6 orders higher; the rate for a given protein is still less than 1 substitution per cycle. even for rather stable viruses like ebola/marburg and human t-cell leukaemia virus type 1 and 2 (htlv-1/-2), the number of intermediates is huge. while the enormity of sequence space is basically impossible to comprehend, the amount accessible to a virus remains vast. for the lineage to exist, the probability of finding a viable mutant must be at least 1/population size within the host. imagine a stem-loop structure. any replacement of a g:c base pair must proceed by a single substitution, given that the probability of a double mutation is approximately 10 -4 that of a single mutation. let substitution of a g:c pair pass by a g:u intermediate, finishing up as a:u. although g:u mismatches are the most stable of all mismatches, they are less so than either a g:c or an a:u pair. there are two scenarios: either the g:u substitution is of so little consequence that it is fixed per se, in which case there would be no selection pressure to complete the process to a:u. alternatively, the g:u substitution is sufficiently deleterious for selection of a secondary mutation to occur from a pool of variants, so completing the process. yet the g:u intermediate cannot be so debilitating otherwise the process would have little chance of going to completion. note also that if the fitness difference is small with respect to the g:c or a:u forms, more rounds of replication are necessary to achieve fixation of g:u to a:u. a corollary is that there must be a range within which fitness variation is tolerated. this is reminiscent of nearly neutral theories of evolution and their extension to rna viruses (chao, 1997; ohta, 1997) . note also that from a theoretical perspective the same secondary structure can be found in all parts of sequence space with easy connectivity (schuster, 1995; schuster et al., 1997) . figure 6 .6 shows a number of variations on an hiv stem-loop structure, crucial for ribosomal frameshifting between the gag and pol open reading frames. there have been substitutions at positions 1, 2, 5, 8 and 11 and even an opening up of the loop. all come from viable strains, yet the environment in which these structures are operative, the human ribosome, is invariant. if the changes are all neutral the situation is formally comparable to the steady accumulation of amino acid substitutions. however, if the intermediates are less fit, it has to be understood how they can survive long enough in the face of a plethora of competitors, approximately 1/mutation rate or about 10 000 for hiv. the latter is probably the case as there are hiv-1 genomes with c:g to u:a substitutions at positions 5 and 8 ( figure 6 .6). extensions of nearly neutral theory would fit these findings well (chao, 1997; ohta, 1997) . that there are many solutions to this stem-loop problem is clear. if hiv-2 is brought into the picture, the remarkable plurality of solutions is further emphasized ( figure 6 .6). degeneracy in solutions found by viruses is revealed by some interesting experiments on viral revertants. the initial lesions substantially inactivated the virus. yet with a bit of patience, sometimes more than 6 months, replicationcompetent variants that were not back mutations were identified (klaver and berkhout, 1994; olsthoorn et al., 1994; berkhout et al., 1997; willey et al., 1988; escarmis et al., 1999) . as the frequencies of mutation and back mutation are not equivalent, such findings are, perhaps, not surprising. what they show is the range of possible solutions adjacent to that created by the experimentalist. loss of fitness can be achieved hiv-2 gag-pol figure 6.6 "shifty" rna stem-loop structures from hiv-1 m, n and o group strains as well as from hiv-2 rod. this structure is part of the information that instructs the ribosome to shift from the gag open reading frame to that of pol. in addition to the hairpin is a heptameric sequence (underlined). frameshifting occur within the gag uua codon within the heptamer and continues agg.gaa etc.* highlights differences in nucleotide sequences compared with the m reference strain lai. by sequential plaquing of rna viruses, the socalled miiller's ratchet experiment, which has been analysed at the genetic level for fmdv . different lesions characterized different lineages. recent work was aimed at characterizing the molecular basis of fitness recovery following large population passage. not one solution was found but a variety, even in parallel experiments (escarmis et al., 1999) . this reveals the impact of chance in fitness selection on a finite population of variants, which is trivially small given the immensity of sequence space. another example of degeneracy in viable solutions is the isolation of functional ribozymes from randomly synthesized rna (bartel and szostak, 1993; ekland et al., 1995) . from a pool of approximately 1014 variants, through repeated rounds of positive selection, it was estimated that the frequency of the ribozyme was of the order of 10 -8, which is small indeed. yet 10-8.1014=106 . even erring by four orders of magnitude, 100 distinct ribozymes could well have been present in the initial pool. although the sequence space occupied may well represent a tiny proportion of that possible for a rna molecule of length n, the space is so large that the number of viable solutions is large, large enough to permit a plethora of parallel solutions to the same problem. these experiments, ribozyme from dust, are cases in plurality. further evidence of the large proportion of viable solutions in protein sequence space comes from in vitro mutagenesis. for example bacteriophage t4 lysozyme can absorb large numbers of substitutions (rennell et al., 1991) with very few sites resisting replacement (figure 6.7) . other examples include the lymphokine, interleukin 3, in which some forms with enhanced characteristics were noted (olins et al., 1995; klein et al., 1997) . with modern mutagenic methods allowing mutation rates of 0.1 per base per site or less, hypermutants of the e. coli r67 dihydrofolate reductase (dhfr) were found by random sequencing of as little as 30 clones (martinez et al., 1996) . whatever the mutation bias, mutants with 3-5 amino acid replacements within the 78-residue protein could be attained (figure 6 .8). other mutagenesis studies sought enzymes with enhanced catalytic constants or chemical stability. for subtilisin e variants with enhanced features for two parameters could be identified from a relatively small population of randomly mutagenized molecules (kucher and arnold, 1997) . these data indicate that functional sequence space is probably far more dense than hitherto thought. most of the above examples concern maintenance or enhancement of function. an interest-ing example was recently afforded by engineering cyclophilin into a proline-specific endopeptidase (qu6m6neur et al., 1998) . the proline binding pocket of cyclophilin was modified such that a single amino acid change (a91s) generated a novel serine endopeptidase with a 101~ proficiency with respect to cyclophilin. addition of two further substitutions (f104h and n106d) generated a serine-aspartic-acid-histidine catalytic triad, the hallmark of serine proteases. the final enzyme proficiency was 3.5 x 1011 mol/1, typical of many natural enzymes. this shows the interconnectedness of sequence spaces for two functionally very different proteins. if sequence space were sparsely populated, the probability of observing such phenomena would be small. many viruses recombine, and via molecular biology more can be made, some of which are tremendously useful research tools, such as the shivs, chimeras between siv and hiv ( figure 6 .9). although many groups have tried to recombine naturally hiv-1 with hiv-2 or siv, none has succeeded. natural and artificial recombination represent major jumps in sequence space. that one can observe such genomes means that the new site in functional sequence space must be only a few mutations figure 6.7 systematic amino acid replacement of bacteriophage t4 lysozyme residues. amber stop codons were engineered singly into each residue apart from the initiator methionine. the plasmids were used to transform 13 suppressor strains. of the resulting 2015 single amino acid substitutions, 328 were found to be sufficiently deleterious to inhibit plaque formation. more than half (55%) of the positions in the protein tolerated all substitutions examined. the side chains of residues that were refractory to substitution were generally inaccessible to solvent. the catalytic residues are glu11 and asp20. adapted from rennell et al., 1991 . the e. coli r67 plasmid. all were trimethoprim resistant. only differences with respect to the parent sequence are shown. a representation of the three-dimensional structure is shown above. adapted from martinez et al., 1996, with permission. from a reasonably viable solution, otherwise it would take too long to generate large numbers of cycles and, along with them, mutants. the ferocity of innate and adaptive immunity must never be forgotten. off on an apparent tangent, the phylogeny of geoffrey chaucer's the canterbury tales was recently analysed by programs tried and tested for nucleic acid sequences. the authors used 850 lines from 58 fifteenth-century manuscripts (barbrook et al., 1998) . apart from the fact that it appears that chaucer did not leave a final version but some annotated working copy, the radiation in medieval english space is fascinating. all the versions are viable and "phenotypically" equivalent even though the "genotypes" are not so. it is ironic that william caxton's first printed edition was far removed from the original. (n.b., printers merely make fewer errors than scribes, tantamount to adding a 3' exonuclease domain to an rna polymerase). given the inevitability of mutation, is it possible that over the aeons natural selection has selected for proteins that are robust, those that are capable of absorbing endless substitutions? for if amino acid substitutions were very difficult to fix, huge populations would need to be explored before change could be accommodated. recently the unstructured n-terminal segment of the e. coli r67 dhfr was shown to stabilize amino acid substitutions in a non-functional miniprotein devoid of this segment (figure 6 .10; martinez et al., 1996) . while the mechanism by which this occurs is unknown, it suggests that there may be parts of proteins, even multiple or discontinuous segments, that may help the protein accommodate inevitable change. formally it can be seen that such figure 6.9 genetic organization of naturally occurring hiv-1 and siv recombinants and unnatural, genetically engineered, siv-hiv-1 chimeras called shivs. segments are hatched according to stain origin. references are hiv-1 mal and hiv-1 ibng (gao et al., 1996) , hiv-1 92rw009.6 (gao et al., 1998) , sivagm sab-1 (jin et al., 1994) and shivsbg (dunn et al., 1996) . proteins would have both short-and long-term selective advantages, for they would permit the generation of larger populations of relatively viable variants as well as buffering the lineage against the effects of bottlenecking. what fraction of amino acid residues is necessary for function? answer -very few. a few examples taken from among the primate immunodeficiency viruses are typical. almost all these viruses infect the same target cell using the membrane proteins cd4 and ccr5. primary hiv-1 isolates use the chemokine receptor ccr5 and rarely the homologous molecule cxcr4, which differs by 81% in its extracellular domains. yet two substitutions in the viral envelope protein gp120 are sufficient to allow use of the cxcr4 molecule (hwang et al., 1991) . curiously, the ccr5 chemokine receptor homologue, us28, encoded by human cytomegalovirus, can be used by hiv-1 despite the fact that us28 and ccr5 differ by 88% in the same extracellular regions . clearly only a small set of residues are necessary for docking . another example is afforded by the vpu protein, which is unique to hiv-1 and the chim-panzee virus sivcpz (huet et al., 1990) . vpu is a small protein inserted into the endoplasmic reticulum, tucked well away from humoral immunity. despite an average amino acid sequence difference of 0.5% among orthologous human and chimpanzee proteins, hiv/sivcpz vpu divergence is almost beyond reliable sequence alignment (figure 6 .11): an n-terminal hydrophobic membrane anchor and a couple of perfectly conserved serine residues, which are phosphorylated, and that's about it. among hiv-1 strains, or between sivcpz sequences, the situation was a little better. yet the necessity of keeping vpu is beyond doubt. a fine final example concerns the hiv/siv rev proteins. these small nuclear proteins are crucial to viral replication. despite this, only 5 residues are perfectly conserved. the situation has been taken beyond the limit, at least ex vivo, in that the htlv-1 rex protein can functionally complement for hiv-1 rev (rimsky et al., 1989) , despite the fact that they are completely different proteins. the above is reminiscent of what is known about enzymes and surface recognition. provided the protein fold is maintained, only a small fraction of residues actually contribute to function, a point made recently in two reviews on rna viral proteases (ryan and flint, 1997; ryan et al., 1998) . insertions and deletions are generally less than 2-3 residues in length and confined to turns, loops and coils (pascarella and argos, 1992) . if globular proteins or at least domains are, to a first approximation, taken as spheres, then the surface area is the least for any volume. if amino acids are equally viewed as smaller, closely packed spheres, then a minimum number will be exposed on the surface, ready to partake in recognition and function. the molecular biologist frequently thinks like an engineer who can redesign from scratch. yet replicons have been constrained by a series of historical events representing variations on a founding theme. while they are fit enough to survive, are they the best possible? this question is salutary, for we live in a society that is more and more competitive and, thanks to global communications, knows about the most successful athletes or businessmen worldwide. yet who can remember the name of any olympic athlete who came in fourth? is not fourth best in any large population remarkable? how good are viruses as machines? once again let us look at some examples from hiv-1. reverse transcription feeds on cytoplasmic dntps. yet supplementing the culture milieu with deoxycytidine -which is scavenged and phosphorylated to the triphosphate-substan-tially increased viral replication (meyerhans et al., 1994) . it is known that good expression of a foreign protein is frequently compromised by inappropriate codon usage. by redesigning codon usage of the jellyfish (aequorea victoria) green fluorescent protein gene to correspond to that typical of mammalian genes, greatly improved expression was achieved in mammalian cells (haas et al., 1996) . the same group engineered codon usage of the hiv-1 gp120 glycoprotein gene segment to correspond to that of the abundantly expressed human thy-1 surface antigen. again expression was greatly improved (haas et al., 1996) . the coup de grace came with the reciprocal experiment-engineering thy-1 gene codon usage to correspond to that of gp120. thy-1 surface expression was greatly reduced (haas et al., 1996) . since hiv-1 was first sequenced, it has been known that its codon usage is highly biased (wain-hobson et al., 1985; bronson and anderson, 1994) . something is clearly overriding maximal envelope expression. furthermore, gp120 codon usage is similar for all other hiv-1 genes whether they be structural or regulatory. for that matter, codon usage is comparable for most lentiviruses bronson and anderson, 1994) . it was possible to show via dna vaccination that codon-engineered gp120 elicited stronger immune responses in mice than the normal counterpart (andre et al., 1998) . might this finding suggest that the optimum is actually away from mass production? yet if there is a shadow of reality in this thesis, it indicates that fitness optima in vivo may not necessarily parallel the expectations of fitness based on ex-vivo models. in this context note also that htlv-1 infects exactly the same cell as hiv, yet its codon usage is very different from that of hiv and the thy-1 gene (seiki et al., 1983) . if fitness optimization were ever operative in vivo, then one would predict steady increases in virulence for those viruses that do not set up herd immunity. at some point a plateau would be reached. yet the higgledy-piggledy way by which virulent strains come and go suggests that this is not so. some might use the word stochastic. whatever. if fitness selection can be overridden and we don't have a good theory for it, then we're in a sorry state. there is abundant evidence that, as a good first approximation, rna viruses ex vivo perform as expected from the quasi-species model (holland et al., 1982; eigen and biebricher, 1988; duarte et al., 1992; clarke et al., 1993; eigen, 1993; novella et al., 1995; domingo et al., 1996; quer et al., 1996; domingo and holland, 1997) , which is fitness dominated. problems arise transposing it to the in vivo situation, notably: 9 first and foremost: how does one determine fitness in vivo? should such measurements score intrahost viral titres or transmission probabilities from an index case? (if a virus doesn't spread it's dead.) for outbred populations, is it in fact virulence? 9 second: host innate immunity is hugely powerful, a fact leading rolf zinkernagel to remark with typical aplomb that in terms of immunity "an inferferon receptor knock-out mouse is a 1% mouse" (huang et al., 1993; van den broek et al., 1995a,b ). yet the enhanced susceptibility of scid humans or various knock-out mice to infections indicates the part played by adaptive immunity. for example, influenza a can persist in scid children (rocha et al., 1991) . how are innate and adaptive immune responses coupled and how are they influenced by genetic polymorphisms? 9 third: with acquired immunity rising by day 3 in an acute infection, the virus is replicating in the face of a predator whose amplitude is increasing. 9 fourth: immune responses are densitydependent. that is, the more the virus replicates the stronger the immune response. if the relationship were simply linear one could see how a virus might be able to keep just ahead, given a short lag in the immune response time. but if it were non-linear? indeed it must be so, otherwise it would not be possible to resolve an acute infection. it is not easy to discern where optimal viral fitness would lie. 9 fifth: the wrath of combined immune responses is such that there is massive viral turnover. for the three best known cases, hiv, hbv and hcv, between 108 and 1012 virions are turning over daily, representing between 10% and 90% of the whole (ho et al., 1995; wei et al., 1995; nowak et al., 1996; zeuzem et al., 1998) . indeed, these are probably underestimates, given beautiful data from the late 1950s and 1960s showing that, for a variety of rna viruses, plasma titres decay with a half-life of 15-20 min, whether the animal be immunologically naive or primed (mimms, 1959; nathanson and harrington, 1967) . from this one may conclude that any viral population is unlikely to be in equilibrium. and if a population is not in equilibrium, fitness selection is compromised. 9 sixth: a glance at any histology slide or textbook is a salient reminder of spatial discontinuities over distances of one or two cell diameters. for example the hugely delocalized immune system is characterized by a multitude of different lymphoid organs, a myriad of subtly different susceptible cell types, and a m616e of membrane molecules. the exquisite spatial heterogeneity of hiv within the epidermis and splenic white pulps has been described (cheynier et al., 1994 (cheynier et al., , 1998 sala et al., 1994) . the same seems to be true for hcv-infected liver (martell et al., 1992) . for hpv infiltration of skin, spatial discontinuities and gradients are also apparent. discontinuities reduce the possibilities for competition and hence selection of the fitter forms. indeed the m~iller's ratchet experiment and clonal heterogeneity are the most vivid expressions of this. 9 seventh:muchhasbeenmadeofprivilegedsites and viral reservoirs. basically this is reminding us of the fact that immune surveillance is modulated in some organs like the brain. there are some suggestions that cytotoxic t-cells have difficulty infiltrating the kidney. viral reservoirs undermine fitness selection. 9 eighth: in the case of the immunodeficiency viruses, antigenic stimulation of infected yet resting memory t-cells means that variants may become amplified for reasons that have nothing to do with the fitness of the variant (cheynier et al., 1994 (cheynier et al., , 1998 ). mayr again: "wherever one looks in nature, one finds uniqueness" (mayr, 1997) . as mentioned, the cardinal difference between the behaviour of rna viruses ex vivo and in vivo is the existence of spatial discontinuities. for replicons, cloning is the ultimate separation. it allows a variant to break away from dominating competitors, disrupts or uncouples a fitter variant locked in competitive exclusion (de la torre and holland, 1990) . the effect of bottlenecking on fitness, as well as the m(iller's ratchet experiments, have been described (chao, 1990; duarte et al., 1992; novella et al., 1995; escarmis et al., 1996) . transmission frequently involves massive bottlenecking, and is very much an exercise in cloning. all this should not surprise because allopatric speciation is omnipresent in the origin of species, darwin's galapagos finches being an obvious example. given the non-equilibrium structure of viral variants, vastly restricted population sizes in respect to sequence space, founder effects in vivo take on great importance. while answers for some of these issues seem far away, constraints on fitness selection cannot be so strong that a chain of infections becomes a mfiller's ratchet experiment. yet is that correct? in the experiments with phage ~6, vsv and fmdv, most of the lineages resulted in decreased fitness. yet for some there were no changes, while for a few there were even increases in the fitness vectors (chao, 1990; duarte et al., 1992; escarmis et al., 1996) . could symptomatic infections reflect bottleneck transmission of those fitter clones with asymptomatic (subclinical) infection representing fitness-compromised clones? analysis of rna viruses ex vivo is analogous to the study of bacteria in chemostats. fitness selection dominates. yet there is a world of difference between bacterial strains so selected and natural isolates. one of the observations frequently made upon isolation of pathogenic bacteria is the loss of bacterial virulence determinants (miller et al., 1989) . indeed, ex-vivo passage of rna viruses has been used to select for attenuated strains used in vaccination. a virus must replicate sufficiently within a host to permit infection of another susceptible host. if the new host is of the same species, differences between the two are minimal-a small degree of polymorphism being inevitable in outbred populations. given that viruses with a small coding capacity interact particularly intimately with the host-cell machinery, it follows that infection of a host from a related species has a greater probability of succeeding if the cellular machinery is comparable. indeed, the closer the two species, the greater the probability. in turn, if the virus gets a toehold and can generate a quasispecies, then only few mutations would probably be necessary to adapt to the new niche. yet species is a difficult word. what might a viral species be? martin (1993) wrote a fascinating review on the number of extinct primate species estimated from the fossil record. depending on the emergence time of primates of modern aspect, he was able to estimate the total number that existed as 5500-6500. the present number of 200 primates species would thus represent about 3.4-3.8%. more importantly from our viewpoint was his calculation of the average survival time of fossil primate species as a mere 1 million years (martin, 1993) . given that rna viruses are fixing mutations approximately i million times faster than mammals (holland et al., 1982; gojobori and yokoyama, 1985; , a viral species would become extinct after approximately 1 year! immediately the annual influenza a strain comes to mind. yet rabies, polio and htlv-1 have arguably been around for millennia. clearly the word "species", when taken from primatology, cannot apply to the viral world. frogs provide a more interesting example. they have been around for several hundred millions of years, and members of some lineages can interbreed despite 75 million years separation. naturally, their protein sequences have not stood still during that time (wilson et al., 1977) . enough is conserved to allow breeding. maybe the primate picture has undue weight in our appreciation of virology. phenotype can be maintained despite changes in genotypeobvious to a biologist. as usual, holland wasn't far from the mark when he wrote: as human populations continue to grow exponentially, the number of ecological niches for human rna virus evolution grows apace and new human virus outbreaks will likely increase apace. most new human viruses will be unre-markable -that is they will generally resemble old ones. inevitably, some will be quite remarkable, and quite undesirable. when discussing rna virus evolution, to call an outbreak (such as aids) remarkable is merely to state that it is of lower probability than an unremarkable outbreak. new viruses can and do emerge but on a scale that is probably 15-20 logs less than the number of viral mutants generated up to that defining moment (wain-hobson, 1993 ). they will result from a small number of mutations and a dose of reproductive isolation. the above has attempted to show that the vast majority of genetic changes fixed by rna viruses are essentially neutral or nearly neutral in character. positive selection exploits a small proportion of genetic variants, while functional sequence space is sufficiently dense, allowing viable solutions to be found. although evolution has connotations of change, what has always counted is natural selection or adaptation. it is the only force for the genesis of a novel replicon. once adapted to its niche, there is no need to change. in such circumstances an rna i i i ii i ii i bacteria, archaea, yeast figure 6.12 latitude in microbial genome sizes. rna viruses and retroviruses are confined to one log variation in size (3 to --32 kb). by contrast, dna viruses span more than 2.5 logs going from the single-stranded porcine circulovirus (1.8 kb) to chlorella virus (~330 kb, encoding at least 12 dna endonuclease/methyltransferase genes; zhang et al., 1998) and bacteriophage g (--670 kb). the distinction between phage dna and a plasmid has often proven difficult (waldor and mekalanos, 1996) . as can be seen, the genome size of the largest dna viruses overlaps the smallest intracellular bacteria such as mycoplasmas (580 and 816 kb) and is not too far short of autonomous bacteria such as haemophilus influenzae (1.83 mb). virus would no longer be adapting, even though it could be changing. why is the evolution of rna viruses so conservative? why do they mutate rapidly yet remain phenotypically stable? the lack of proofreading proscribes the genesis of large genomes, restricting their genome sizes to a 1 log range (figure 6.12) . among the smallest rna and retroviruses are ms2 and hepatitis b virus, both about 3 kb, while the largest are the coronaviruses at 32 kb or more. most of their proteins are structural or regulatory and take up the largest part of the coding capacity of the virus. additional proteins broadening the range of interactions with the host cell, or rendering the replicon more autonomous, are relatively few. large, gene-sized duplications that may contribute to diversification and novel phenotypes are rare, reducing the exploration of new horizons. thus, evolution of rna viruses is probably conservative because they cannot shuffle domains so generating new combinations. that the information capacity of rna viral genomes is limited by a lack of proofreading is neither here nor there, for they are remarkably successful parasites. rna viruses change far more than they adapt. muller's ratchet decreases fitness of a dna-based microbe increased immune response elicited by dna vaccination with a synthetic gp120 sequence with optimized codon usage the phylogeny of the canterbury tales isolation of new ribozymes from a large pool of random sequences forced evolution of a regulatory rna helix in the hiv-1 genome role of the first and third extracellular domains of cxcr-4 in human immunodeficiency virus coreceptor activity molecular mechanisms of immune responses in insects nucleotide composition as a driving force in the evolution of retroviruses unusually high frequency of epstein-barr virus genetic variants in papua new guinea that can escape cytotoxic t-cell recognition: implications for virus evolution role of host immune response in selection of equine infectous anemia virus variants fitness of rna virus decreased by muller's ratchet evolution of sex and the molecular clock in rna viruses hiv and t-cell expansion in splenic white pulps is accompanied by infiltration of hiv-specific cytotoxic t-lymphocytes antigenic stimulation by bcg as an in vivo driving force for siv replication and dissemination genetic bottlenecks and population passages cause profound fitness differences in rna viruses nucleotide sequences of three nodavirus rna2's: the messengers for their coat protein precursors primary and secondary structure of black beetle virus rna2, the genomic messenger for bbv coat protein precursor hla-a11 epitope loss isolates of epstein-barr virus from a highly al1+ population t cell responses and virus evolution: loss of hla all-restricted ctl epitopes in epstein-barr virus isolates from highly all-positive populations by selective mutation of anchor residues rna virus quasispecies populations can suppress vastly superior mutant progeny the genome sequence of herpes simplex virus type 2 rna viral mutations and fitness for survival basic concepts in rna virus evolution origins and evolutionary relationships of retroviruses rates of spontaneous mutations among rna viruses rapid fitness losses in mammalian rna virus clones due to muller's ratchet high viral load and cd4 lymphopenia in rhesus and cynomolgus macaques infected by a chimeric primate lentivirus constructed using the env, rev, tat, and vpu genes from hiv-1 lai the viral quasispecies sequence space and quasispecies distribution structurally complex and highly active rna ligases derived from random rna sequences does the vp1 gene of foot-and-mouth disease virus behave as a molecular clock? largescale search for genes on which positive selection may operate genetic lesions associated with muller's ratchet in an rna virus multiple molecular pathways for fitness recovery of an rna virus dibilitated by operation of miiller's ratchet determining divergence times with a protein clock: update and reevaluation human infection by genetically diverse siv-sm related hiv-2 in west africa the heterosexual human immunodeficiency virus type 1 epidemic in thailand is caused by an intersubtype (a/e) recombinant of african origin a comprehensive panel of near-fulllength clones and reference sequences for non-subtype b isolates of human immunodeficiency virus type 1 rates of evolution of the retroviral oncogene of moloney murine sarcoma virus and of its cellular homologues molecular evolutionary rates of oncogenes molecular clock of viral evolution, and the neutral theory codon usage limitation in the expression of hiv-1 envelope glycoprotein evolution of influenza virus genes performance evaluation of amino acid substitution matrices rapid turnover of plasma virions and cd4 lymphocytes in hiv-1 infection rapid evolution of rna genomes rna virus populations as quasispecies immune response in mice that lack the interferon-gamma receptor genetic organization of a chimpanzee lentivirus related to hiv-1 protein phylogenies provide evidence of a radical discontinuity between arthropod and vertebrate immune systems pattern of nucleotide substitution at major histocompatibility complex class i loci reveals overdominant selection identification of the envelope v3 loop as the primary determinant of cell tropism in hiv-1 statistical analysis of nucleotide sequences of the hemagglutinin gene of human influenza a viruses mosaic genome structure of simian immunodeficiency virus from west african green monkeys the role of cytotoxic t-lymphocytes in the evolution of genetically stable viruses evolution of a disrupted tar rna hairpin structure in the hiv-1 virus the receptor binding site of human interleukin-3 defined by mutagenesis and molecular modeling directed evolution of enzyme catalysts analysis of hiv-1 env gene sequences reveals evidence for a low effective number in the viral population tempo and mode of nucleotide substitutions in gag and env gene fragments in human immunodeficiency virus type 1 populations with a known transmission history molecular phylogeny and evolutionary timescale for the family of mammalian herpesviruses escape of human immunodeficiency virus from immune control hepatitis c virus (hcv) circulates as a population of different but closely related genomes: quasispecies nature of hcv genome distribution primate origins: plugging the gaps exploring the functional robustness of an enzyme by in vitro evolution independent fluctuation of human immunodeficiency virus type 1 rev and gp41 quasispecies in vivo natural selection and the concept of a protein space this is biology temporal fluctuations in hiv quasispecies in vivo are not reflected by sequential hiv isolations in vivo persistence of a hiv-l-encoded hla-b27-restricted cytotoxic t-lymphocyte epitope despite specific in vitro reactivity restriction and enhancement of human immunodeficiency virus type 1 replication by modulation of intracellular deoxynucleoside triphosphate pools lifespan of human lymphocyte subsets defined by cd45 isoforms coordinate regulation and sensory transduction in the control of bacterial virulence the response of mice to large intravenous injections of ectromelia virus. i. the fate of injected virus experimental infection of monkeys with langat virus sequence constraints and recognition by ctl of an hla-b27-restricted hiv-1 gag epitope size of genetic bottlenecks leading to virus fitness loss is determined by mean initial population fitness how hiv defeats the immune system viral dynamics in hepatitis b virus infection the meaning of near-neutrality at coding and non-coding regions saturation mutagenesis of human interleukin-3 leeway and constraints in the forced evolution of a regulatory rna helix analysis of insertions/deletions in protein structures the tempo and mode of siv quasispecies development in vivo calls for massive viral replication and clearance identification of a chemokine receptor encoded by human cytomegalovirus as a cofactor for hiv-1 entry genetic drift can dominate short-term human immunodeficiency virus type 1 nef quasispecies evolution in vivo viral strategies of immune evasion positive selection of hiv-1 cytotoxic t lymphocyte escape variants during primary infection antigen-specific release of beta-chemokines by anti-hiv-1 cytotoxic t lymphocytes engineering cyclophilin into a proline-specific endopeptidase reproducible nonlinear population dynamics and critical points during replicative competitions of rna virus quasispecies nucleotide sequence analysis of sa-omvv, a visna-related ovine lentivirus: phylogenetic history of lentiviruses systematic mutation of bacteriophage t4 lysozyme trans-dominant inactivation of htlv-i and hiv-1 gene expression by mutation of the htlv-i rex transactivator antigenic and genetic variation in influenza a (hin1) virus isolates recovered from a persistently infected immunodeficient child virus-encoded proteinases of the picornavirus super-group virus-encoded proteinases of the flaviviridae spatial discontinuities in human immunodeficiency virus type 1 quasispecies derived from epidermal langerhans cells of a patient with aids and evidence for double infection genetic evolution and tropism of transmissible gastroenteritis coronaviruses how to search for rna structures. theoretical concepts in evolutionary biotechnology rna structures and folding. from conventional to new issues in structure predictions natural selection on the gag, pol, and env genes of human immunodeficiency virus 1 (hiv-1) human adult t-cell leukemia virus: complete nucleotide sequence of the provirus genome intergrated in leukemia cell dna rates of amino acid change in the envelope protein correlate with pathogenicity of primate lentiviruses nucleotide sequence of the visna lentivirus: relationship to the aids virus antiviral defense in mice lacking both alpha/beta and gamma interferon receptors immune defence in mice lacking type i and/or type ii interferon receptors fixation of mutations at the vp1 gene of footand-mouth disease virus. can quasispecies define a transient molecular clock? the fastest genome evolution ever described: hiv variation in situ viral burden in aids running the gamut of retroviral variation nucleotide sequence of the aids virus lysogenic conversion by a filamentous phage encoding cholera toxin viral dynamics in human immunodeficiency virus type i infection in vitro mutagenesis identifies a region within the envelope gene of the human immunodeficiency virus that is critical for infectivity biochemical evolution adaptive evolution of human immunodeficiency virus-type 1 during the natural course of infection molecular evolution of the hepatitis b virus genome quantification of the initial decline of serum hepatitis c virus rna and response to interferon alfa chlorella virus ny-2a encodes at least 12 dna endonuclease / methyltransferase genes we would like to thank past and present members of the laboratory and numerous colleagues for endless discussions over the years. mark mascolini needs a special word of thanks for painstakingly going through the manuscript. this laboratory is supported by grants from the institut pasteur and the agence nationale pour la recherche sur le sida. key: cord-265857-fs6dj3dp authors: liu, yu-tsueng title: infectious disease genomics date: 2010-12-24 journal: genetics and evolution of infectious disease doi: 10.1016/b978-0-12-384890-1.00010-8 sha: doc_id: 265857 cord_uid: fs6dj3dp the history and development of infectious disease genomics are discussed in this chapter. hgp must not be restricted to the human genome and should include model organisms including mouse, bacteria, yeast, fruit fly, and worm. the completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. the polysaccharide capsule is important for meningococci to escape from complement-mediated killing. with the completion of the genome sequence of a virulent menb strain, a “reverse vaccinology” approach was applied for the development of a universal menb vaccine by novartis. the indispensable fatty acid synthase (fas) pathway in bacteria has been regarded as a promising target for the development of antimicrobial agents. through a systematic screening of 250,000 natural product extracts, a merck team identified a potent and broad-spectrum antibiotic, platensimycin, which is derived from streptomyces platensis. vector biology network was formed to achieve three goals (1) to develop basic tools for the stable transformation of anopheline mosquitoes by the year 2000; (2) to engineer a mosquito incapable of carrying the malaria parasite by 2005; and (3) to run controlled experiments to test how to drive the engineered genotype into wild mosquito populations by 2010. the most immediate impact of a completely sequenced pathogen genome is for infectious disease diagnosis. the history and development of infectious disease genomics are tightly associated with the human genome project (hgp) (watson, 1990) . a series of important discussions about the hgp were made in 1985 and 1986 (dulbecco, 1986; watson, 1990) , which led to the appointment of a special national research council (nrc) committee by the national academy of sciences to address the needs and concerns, such as its impact, leadership, and funding sources. the committee recommended that the united states begin the hgp in 1988 (nrc, 1988) . they emphasized the need for technological improvements in the efficiency of gene mapping, sequencing, and data analysis capabilities. in order to understand potential functions of human genes through comparative sequence analyses, they also advised that the hgp must not be restricted to the human genome and should include model organisms including mouse, bacteria, yeast, fruit fly, and worm. in the meantime, the office of technology assessment (ota) of the us congress also issued a similar report to support the hgp (ota, 1988) . in 1990, the department of energy (doe) and the national institutes of health (nih) jointly presented an initial 5-year plan for the hgp (dhhs and doe, 1990) . in october 1993, the sanger center/institute (hinxton, uk) was officially open to join the hgp. the cost of dna sequencing was about $2à5 per base in 1990, and the initial aim was to reduce the costs to less than $0.50 per base before large-scale sequencing (dhhs and doe, 1990) . the sequencing cost gradually declined during the subsequent years. in 2004, the national human genome research institute (nhgri) challenged scientists to achieve a $100,000 human genome (3 gb/haploid genome) by 2009 and a $1000 genome by 2014 to meet the need of genomic medicine. the first complete genome to be sequenced was the phix174 bacteriophage (5.4 kb) by sanger's group in 1977 (sanger et al., 1977 . the complete genome sequence of sv40 polyomavirus (5.2 kb) was published in 1978 (fiers et al., 1978; reddy et al., 1978) . the human epsteinàbarr virus (170 kb) genome was determined in 1984 (baer et al., 1984) . the first completed free-living organism genome was *e-mail: ytliu@ucsd.edu haemophilus influenza (1.8 mb), sequenced through a whole-genome shotgun approach in 1995 (fleischmann et al., 1995) . the second sequenced bacterial genome, mycoplasma genitalium (600 kb), was completed in less than a month in the same year using the same approach (smith, 2004) . the doe was the first to start a microbial genome program (mgp) as a companion to its hgp in 1994 (doe, 2009 . the initial focus was on nonpathogenic microbes. along with the development of the hgp, there was exponential growth of the number of completely sequenced freeliving organism genomes. the fungal genome initiative (fgi) (fgi, 2010) was established in 2000 to accelerate the slow pace of fungal genome sequencing since the report of the genome of saccharomyces cerevisiae in 1996 (goffeau et al., 1996) . one of the major interests was to sequence organisms that are important in human health and commercial activities. as of september 2009, 1100 completed genome projects, a 1.7-fold increase from 2 years ago, were documented (liolios et al., 2010) . these include 914 bacterial, 68 archaeal, and 118 eukaryotic genomes. in addition, more than 4000 other ongoing sequencing projects were reported. the genomes of human malaria parasite plasmodium falciparum and its major mosquito vector anopheles gambiae were published in 2002 (gardner et al., 2002; holt et al., 2002) . the effort to sequence the malaria genome began in 1996 by taking advantage of a clone derived from laboratory-adapted strain (hoffman et al., 1997) . many parasites have complex life cycles that involve both vertebrate and invertebrate hosts and are difficult to maintain in the laboratory. currently, a few other important human pathogenic parasites, such as trypansomes el-sayed et al., 2005) , leishmania (ivens et al., 2005) , and schistosomas (berriman et al., 2009; consortium, 2009) , have been either completely or partially sequenced (brindley et al., 2009; aurrecoechea et al., 2010) . in the meantime, the genome sequence of aedes aegypti, the primary vector for yellow fever and dengue fever, was published in 2007 . the genome size (1376 mb) of this mosquito vector is about 5 times larger than the previously sequenced genome of the malaria vector anopheles gambiae. approximately 50% of the genome consists of transposable elements. in 2010, the genome sequence of the body louse (pediculus humanus humanus), an obligatory parasite of humans and the main vector of epidemic typhus (rickettsia prowazekii), relapsing fever (borrelia recurrentis), and trench fever (bartonella quintana), was reported (kirkness et al., 2010) . its 108 mb genome is the smallest among the known insect genomes. genome sequencing projects for other important human disease vectors are in progress megy et al., 2009 ). these include culex pipiens (mosquito vector of west nile virus), ixodes scapularis (tick vector of lyme disease, babesia, and anaplasma), and glossina morsitans (tsetse fly vector of african trypanosomiasis). the challenge to sequence the genome of an insect vector is much greater than a microbe. for example, the genomes of ticks were estimated to be between 1 and 7 gb and may have a significant proportion of repetitive dna sequences, which may be a problem for genome assembly (pagel van zee et al., 2007) . furthermore, the evolutionary distances among insect species may also affect homology-based gene predictions. it is as important to understand the sequence diversity within a species as to perform a de novo sequencing of a reference genome from the perspective of human health. this is true for both hosts and pathogens (feero et al., 2008; alcais et al., 2009) . the goal of the 1000 genomes project is to find most genetic variants that have frequencies of at least 1% in the human populations studied (kaiser, 2008) . one of the similar efforts for human pathogens is the nih influenza genome sequencing project. when this project began in november 2004, only seven human influenza h3n2 isolates had been completely sequenced and deposited in the genbank database (fauci, 2005; ghedin et al., 2005) . as of may 2010, more than 5000 human and avian isolates have been completely sequenced, including the 1918 "spanish" influenza virus (taubenberger et al., 2005) . databases for human immunodeficiency virus (hiv) and hepatitis c virus have also been established. while most human studies of microbes have focused on the disease-causing organisms, interest in resident microorganisms has also been growing. in fact, it has been estimated that the human body is colonized by at least 10 times more prokaryotic and eukaryotic microorganisms than the number of human cells (savage, 1977) . it was suggested to have "the second human genome project" to sequence human microbiome (relman and falkow, 2001) . highly variable intestinal microbial flora among normal individuals has been well documented (eckburg et al., 2005; costello et al., 2009; turnbaugh et al., 2009) . therefore, the human microbiome project (hmp) was initiated by the nih to study samples from multiple body sites from each of at least 250 "normal" volunteers to determine whether there are associations between changes in the microbiome and several different medical conditions, and to provide both standardized data resources and new technological approaches (peterson et al., 2009) . the completed or ongoing genome projects (table 10 .1) will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. specific examples will be provided to illustrate how the information provided by various genome projects may help achieve the goal of promoting human health. meningococcal isolates produce 1 of 13 antigenically distinct capsular polysaccharides, but only 5 (a, b, c, w135, and y) are commonly associated with disease (lo et al., 2009) . the polysaccharide capsule is important for meningococci to escape from complement-mediated killing. while conventional vaccines consisting of the conjugation of capsular polysaccharides to carrier proteins for meningococcus serogroups a, c, y, and w-135 have been clinically successful, the same approach failed to produce clinically useful vaccine for serogroup b (menb). the capsule polysaccharide (α2-8 n-acetylneuraminic acid) of menb is identical to human polysialic acid and therefore is poorly immunogenic (finne et al., 1987) . alternatively, vaccines consisting of outer membrane vesicles (omv) have been successfully developed to control menb outbreaks in areas where epidemics are dominated by one particular strain (bjune et al., 1991; sierra et al., 1991; boslego et al., 1995; jackson et al., 2009) . the most significant limitation of this type of vaccine is that the immune response is strain-specific, mostly directed against the porin protein, pora, which varies substantially in both expression level and sequence across strains (martin et al., 2000; pizza et al., 2000) . with the completion of the genome sequence of a virulent menb strain, a "reverse vaccinology" approach was applied for the development of a universal menb vaccine by novartis (pizza et al., 2000; tettelin et al., 2000; giuliani et al., 2006) . through bioinformatic searching for surface-exposed antigens, which may be the most suitable vaccine candidates due to their potential to be readily recognized by the immune system, 570 open reading frames (orfs) were selected from a total of 2158 orfs of the mc58 genome. eventually, five antigens were chosen as the vaccine components based on a series of criteria including the ability of candidates to be expressed in escherichia coli as recombinant proteins (350 candidates), the confirmation of surface exposure by immunological analyses, the ability of induced protective antibodies in experimental animals (28 candidates), and the conservation of antigens within a panel of diverse meningococcal strains, primarily the disease-associated menb strains (pizza et al., 2000; giuliani et al., 2006; rinaudo et al., 2009) . the vaccine formulation consists of an fhbp-gna2091 fusion protein, a gna2132-gna1030 fusion protein, nada, and omv from the new zealand menzb vaccine strain, which contains the immunogenic pora. initial phase ii clinical results in adults and infants showed that this vaccine could induce a protective immune response against three diverse menb strains in 89à96% of subjects following three vaccinations and 93à100% after four vaccinations (rinaudo et al., 2009) . in 2010, a phase iii trial for this vaccine (4cmenb) has met primary endpoint. targeting an essential pathway is a necessary but not sufficient requirement for an effective antimicrobial agent (brinster et al., 2009) . identification of essential genes in a completely sequenced genome has been actively pursued with various approaches (hutchison et al., 1999; ji et al., 2001) . the indispensable fatty acid synthase (fas) pathway in bacteria has been regarded as a promising target for the development of antimicrobial agents (wright and reynolds, 2007) . the subcellular organization of the fatty acid biosynthesis components is different between mammals (type i fas) and bacteria (dissociated type ii fas), which raises the likelihood of host specificity of the targeting drugs. comparison of the available genome sequences of various species of prokaryotes reveals highly conserved fas ii systems suggesting that the antimicrobial agent can be broad spectrum (zhang et al., 2003) . in addition, through computational analyses, new members of the fas ii system have been discovered in different bacterial species (heath and rock, 2000; marrakchi et al., 2002) . one of the protein components in this system, fabi, is the target of an anti-tuberculosis drug isoniazid and a general antibacterial and antifungal agent, triclosan (banerjee et al., 1994; levy et al., 1999; zhang et al., 2006) . through a systematic screening of 250,000 natural product extracts, a merck team identified a potent and broad-spectrum antibiotic, platensimycin, which is derived from streptomyces platensis and a selective fabf/b inhibitor in fas ii system (wang et al., 2006) . treatment with platensimycin eradicated staphylococcus aureus infection in mice. platensimycin did not have cross-resistance to other antibiotic-resistant strains in vitro, including methicillin-resistant s. aureus, vancomycin-intermediate s. aureus, and vancomycin-resistant enterococci. no toxicity was observed using a cultured human cell line. the activity of platensimycin was not affected by the presence of human serum in this study. however, the fas ii system appears to be dispensable for another gram-positive bacterium, streptococcus agalactiae, when exogenous fatty acids are available, such as in human serum (brinster et al., 2009; balemans et al., 2010) . the susceptibility to inhibitors targeting the fas ii system indicates heterogeneity in fatty acid synthesis or in acquiring exogenous fatty acids among gram-positive pathogens (balemans et al., 2010) . comparative genomic approaches may be useful to identify and develop a strategy to target the salvage pathway for streptococcus agalactiae. alternatively, similar approaches as described earlier for menb vaccine may also be applied for streptococcus agalactiae (group b streptococcus) (maione et al., 2005) . an early mathematical model for malaria control suggested that the most vulnerable element in the malaria cycle was survivorship of adult female mosquitoes (macdonald, 1957; enayati and hemingway, 2010) . therefore, insect control is an important part of reducing transmission. the use of ddt as an indoor residual spray in the global malaria eradication program from 1957 to 1969 reduced the population at risk of malaria to b50% by 1975 compared with 77% in 1900 (hay et al., 2004; enayati and hemingway, 2010) . engineering genetically modified mosquitoes refractory to malaria infection appeared to be an alternative approach (curtis, 1968) given the environmental impact of ddt and the emergence of insecticide-resistant insects. the vector biology network (vbn) was formed in 1989 and proposed a 20-year plan with the world health organization (who) in 2001 to achieve three major goals: (1) to develop basic tools for the stable transformation of anopheline mosquitoes by the year 2000; (2) to engineer a mosquito incapable of carrying the malaria parasite by 2005; and (3) to run controlled experiments to test how to drive the engineered genotype into wild mosquito populations by 2010 (alphey et al., 2002; morel et al., 2002; beaty et al., 2009) . while some proof-of-concept experiments were achieved for the first two aims in 2002 when the anopheles gambiae genome was completely sequenced (catteruccia et al., 2000; ito et al., 2002) , the progress has been relatively slow (marshall and taylor, 2009) . genomic loci of the anopheles gambiae responsible for plasmodium falciparum resistance have been identified through surveying a mosquito population in a west african malaria transmission zone (riehle et al., 2006) . a candidate gene, anopheles plasmodium-responsive leucine-rich repeat 1 (apl1), was discovered. subsequently, other resistant genes have also been identified (blandin et al., 2009; povelones et al., 2009) . studying the genetic basis of resistance to malaria parasites and immunity of the mosquito vector will be important to control malaria transmission. perhaps the most immediate impact of a completely sequenced pathogen genome is for infectious disease diagnosis. the information may be of great importance to the public health when a newly emerged or re-emerged pathogen is discovered. the 2009 swine-origin influenza a virus (s-oiv) (dawood et al., 2009) and 2003 sars (severe acute respiratory syndrome) coronavirus rota et al., 2003) are the two most recent examples. s-oiv emerged in the spring of 2009 in mexico and was also discovered in specimens from two unrelated children in the san diego area in april 2009 (cdc, 2009; dawood et al., 2009) . those samples were positive for influenza a but negative for both human h1 and h3 subtypes. the complete genome sequence and a real-time pcr-based diagnostic assay were released to the public in late april. the outbreak evolved rapidly and the who declared the highest phase 6 worldwide pandemic alert on june 11, 2009. s-oiv has three genome segments (ha, np, ns) from the classic north american swine (h1n1) lineage, two segments (pb2, pa) from the north american avian lineage, one segment (pb1) from the seasonal h3n2, and most notably, two segments (na, m) from the eurasian swine (h1n1) lineage (dawood et al., 2009) . with the available influenza genome database, diagnostic assays to distinguish previous seasonal h1n1, h3n2, and s-oiv can be easily accomplished (lu et al., 2009) . a comprehensive pathogen genome database is not only useful for infectious disease diagnosis but also for novel pathogen discovery (liu, 2008) . homologous sequences within the same family or among different family members are important for new pathogen identification even with the advent of third-generation sequencing technology (munroe and harris, 2010) . de novo pathogen discovery may be also complicated by coexisting microorganisms, such as commensal bacteria in the human body. without prior knowledge of these microorganisms, one may be misled. in 2003, a microarray-based assay, designated virochip, was used to help discover the sars coronavirus (wang et al., 2003) . the virochip contained the most highly conserved 70mer sequences from every fully sequenced reference viral genome in genbank. the computational search for conservation was performed across all known viral families. a microarray hybridized with a reaction derived from a viral isolate cultivated from a sars patient revealed that the strongest hybridizing array elements belong to families astroviridae and coronaviridae. alignment of the oligonucleotide probes having the highest signals showed that all four hybridizing oligonucleotides from the astroviridae and one oligonucleotide from avian infectious bronchitis virus, an avian coronavirus, shared a core consensus motif spanning 33 nucleotides. interestingly, it had been known previously through bioinformatic analyses that this sequence is present in the 3 0 utr of all astroviruses, avian infectious bronchitis virus, and an equine rhinovirus (jonassen et al., 1998) . therefore, a new member of the coronavirus was identified through the unique hybridizing pattern and subsequent confirmations. the finding of the seventh human oncogenic virus, merkel cell polyomavirus (mcv) (feng et al., 2008) in 2008 is another example of why conserved sequences are important for novel pathogen discovery. mcv is the etiological agent of merkel cell carcinoma (mcc), which is a rare but aggressive skin cancer of neuroendocrine origin. two cdna libraries derived from mcc tumors were subjected to high-throughput sequencing by a next-generation roche/454 sequencer. nearly 400,000 sequence reads were generated. the majority (99.4%) of the sequences derived from human origin were removed from further analyses. only one of the remaining 2395 cdna was homologous to the t antigen of two known polyomaviruses. one additional cdna was subsequently identified to be part of the mcv sequence when the complete viral sequence was known. later analyses showed that 80% (8/10) of the mcc had integrated mcv in the human genome. monoclonal viral integration was revealed by the patterns of southern blot analysis. only 8à16% of control tissues had low copy number of mcv infection. while we can expect that the efforts of a variety of genome projects may improve human health, the socioeconomic issues that are not discussed in this chapter may be substantial. in addition, the tremendous amount of information derived from these projects will also be a challenge for scientists as well nonscientists to follow and understand. human genetics of infectious diseases: between proof of principle and paradigm malaria control with genetically manipulated insect vectors eupathdb: a portal to eukaryotic pathogen databases dna sequence and expression of the b95-8 epsteinàbarr virus genome essentiality of fasii pathway for staphylococcus aureus inha, a gene encoding a target for isoniazid and ethionamide in mycobacterium tuberculosis the influenza virus resource at the national center for biotechnology information from tucson to genomics and transgenics: the vector biology network and the emergence of modern vector biology the genome of the african trypanosome trypanosoma brucei the genome of the blood fluke schistosoma mansoni effect of outer membrane vesicle vaccine against group b meningococcal disease in norway dissecting the genetic basis of resistance to malaria parasites in anopheles gambiae efficacy, safety, and immunogenicity of a meningococcal group b (15:p1.3) outer membrane protein vaccine in iquique, chile. chilean national committee for meningococcal disease helminth genomics: the implications for human health type ii fatty acid synthesis is not a suitable antibiotic target for gram-positive pathogens stable germline transformation of the malaria mosquito anopheles stephensi swine influenza a (h1n1) infection in two children-southern california, marchàapril the schistosoma japonicum genome reveals features of hostàparasite interplay bacterial community variation in human body habitats across space and time possible use of translocations to fix desirable genes in insect pest populations the comprehensive microbial resource understanding our genetic inheritance, the u.s. human genome project: the first five years: fiscal years microbial genome program a turning point in cancer research: sequencing the human genome diversity of the human intestinal microbial flora the microbial rosetta stone database: a compilation of global and emerging infectious microorganisms and bioterrorist threat agents the genome sequence of trypanosoma cruzi, etiologic agent of chagas disease malaria management: past, present, and future the genome gets personal-almost clonal integration of a polyomavirus in human merkel cell carcinoma fungal genome initiative complete nucleotide sequence of sv40 dna an igg monoclonal antibody to group b meningococci cross-reacts with developmentally regulated polysialic acid units of glycoproteins in neural and extraneural tissues whole-genome random sequencing and assembly of haemophilus influenzae rd genome sequence of the human malaria parasite plasmodium falciparum large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution a universal vaccine for serogroup b meningococcus life with 6000 genes the global distribution and population at risk of malaria: past, present, and future funding for malaria genome sequencing the genome sequence of the malaria mosquito anopheles gambiae global transposon mutagenesis and a minimal mycoplasma genome transgenic anopheline mosquitoes impaired in transmission of a malaria parasite the genome of the kinetoplastid parasite, leishmania major phase ii meningococcal b vesicle vaccine trial in new zealand infants identification of critical staphylococcal genes using conditional phenotypes generated by antisense rna a common rna motif in the 3 0 end of the genomes of astroviruses, avian infectious bronchitis virus and an equine rhinovirus dna sequencing. a plan to capture human diversity in 1000 genomes ensembl genomes: extending ensembl across the taxonomic space genome sequences of the human body louse and its primary endosymbiont provide insights into the permanent parasitic lifestyle a novel coronavirus associated with severe acute respiratory syndrome vectorbase: a data resource for invertebrate vector genomics molecular basis of triclosan activity the genomes online database (gold) in 2009: status of genomic and metagenomic projects and their associated metadata a technological update of molecular diagnostics for infectious diseases mechanisms of avoidance of host immunity by neisseria meningitidis and its effect on vaccine development detection in 2009 of the swine origin influenza a (h1n1) virus by a subtyping microarray the epidemiology and control of malaria identification of a universal group b streptococcus vaccine by multiple genome screen a new mechanism for anaerobic unsaturated fatty acid formation in streptococcus pneumoniae malaria control with transgenic mosquitoes effect of sequence variation in meningococcal pora outer membrane protein on the effectiveness of a hexavalent pora outer membrane vesicle vaccine genomic resources for invertebrate vectors of human pathogens, and the role of vectorbase the mosquito genome-a breakthrough for public health third-generation sequencing fireworks at marco island a catalog of reference genomes from the human microbiome genome sequence of aedes aegypti, a major arbovirus vector mapping and sequencing the human genome mapping our genes-genome projects: how big? how fast? tick genomics: the ixodes genome project and beyond the nih human microbiome project identification of vaccine candidates against serogroup b meningococcus by whole-genome sequencing leucine-rich repeat protein complex activates mosquito complement in defense against plasmodium parasites the genome of simian virus 40 the meaning and impact of the human genome sequence for microbiology natural malaria infection in anopheles gambiae is regulated by a single genomic control region vaccinology in the genome era characterization of a novel coronavirus associated with severe acute respiratory syndrome nucleotide sequence of bacteriophage phi x174 dna microbial ecology of the gastrointestinal tract database resources of the national center for biotechnology information gemina, genomic metadata for infectious agents, a geospatial surveillance pathogen database vaccine against group b neisseria meningitidis: protection trial and mass vaccination results in cuba history of microbial genomics characterization of the 1918 influenza virus polymerase genes complete genome sequence of neisseria meningitidis serogroup b strain mc58 a core gut microbiome in obese and lean twins viral discovery and sequence recovery using dna microarrays platensimycin is a selective fabf inhibitor with potent antibiotic properties the human genome project: past, present, and future antibacterial targets in fatty acid biosynthesis the application of computational methods to explore the diversity and structure of bacterial fatty acid synthase inhibiting bacterial fatty acid synthesis key: cord-266960-kyx6xhvj authors: temple, mark d. title: real-time audio and visual display of the coronavirus genome date: 2020-10-02 journal: bmc bioinformatics doi: 10.1186/s12859-020-03760-7 sha: doc_id: 266960 cord_uid: kyx6xhvj background: this paper describes a web based tool that uses a combination of sonification and an animated display to inquire into the sars-cov-2 genome. the audio data is generated in real time from a variety of rna motifs that are known to be important in the functioning of rna. additionally, metadata relating to rna translation and transcription has been used to shape the auditory and visual displays. together these tools provide a unique approach to further understand the metabolism of the viral rna genome. this audio provides a further means to represent the function of the rna in addition to traditional written and visual approaches. results: sonification of the sars-cov-2 genomic rna sequence results in a complex auditory stream composed of up to 12 individual audio tracks. each auditory motive is derived from the actual rna sequence or from metadata. this approach has been used to represent transcription or translation of the viral rna genome. the display highlights the real-time interaction of functional rna elements. the sonification of codons derived from all three reading frames of the viral rna sequence in combination with sonified metadata provide the framework for this display. functional rna motifs such as transcription regulatory sequences and stem loop regions have also been sonified. using the tool, audio can be generated in real-time from either genomic or sub-genomic representations of the rna. given the large size of the viral genome, a collection of interactive buttons has been provided to navigate to regions of interest, such as cleavage regions in the polyprotein, untranslated regions or each gene. these tools are available through an internet browser and the user can interact with the data display in real time. conclusion: the auditory display in combination with real-time animation of the process of translation and transcription provide a unique insight into the large body of evidence describing the metabolism of the rna genome. furthermore, the tool has been used as an algorithmic based audio generator. these audio tracks can be listened to by the general community without reference to the visual display to encourage further inquiry into the science. modern computers have had a great impact on biological experimentation and data analyses to reveal otherwise hidden patterns in complex data. this is apparent in the field of genomic data analyses. the viral genome of the first patient suffering from covid-19 was submitted to genbank [1] on 5 january 2020 some weeks after the first patient had been hospitalised in december 2019 [2] . within 4 months over 4.7 million people worldwide had tested positive to the virus and the disease referred to as covid-19 with approximately 315,000 deaths reported by johns hopkins university [3] . during this time a large body of evidence has arisen regarding rna sequence homology to other sars like virus strains [4, 5] and these studies may help identify targets for immune recognition. this paper demonstrates that sonification of rna sequence data may also be useful to understand how the genome functions. the audio is generated using two approaches. the rules governing gene expression have been applied to the process of generating a linear audio stream similar to the expression of a linear sequence of amino acids. these methods are based on our previous approach to sonify dna sequences [6] . these methods have been improved upon to include multi-layering of related audio tracks, and the inclusion of audio that is representative of sequence metadata. additionally, a real time animated display (as shown in fig. 1 ) of both the biological process and the notes being generated has been implemented. these displays are important since the ability to sequence dna has vastly outpaced tools for their visualisation [7] . the real-time visual animation is an important addition since with sonification data alone is it difficult to relate the auditory display to the underlying sequence information. the combination of the auditory and visual displays is more informative than either display in isolation. in these displays the auditory and visual output are produced by the same events, since the sequence is processed in a linear fashion, and it is thought that the multisensory integration improves the perception of each [8] . the systematic and reproducible representation of data as sound is increasingly becoming a adjunct to the traditional visualization techniques of data inspection and analysis [9, 10] . in recent years auditory displays have become more popular to represent complex biological phenomena. a systematic review of over 150 sonification project highlighted the importance of pitch and spatial auditory dimensions in the auditory display [11] . within the domain of molecular biology the properties of amino acid residues [12] and protein folding [13] have been sonified by a combination of musical techniques and sound effects. more recently researchers have generated musical scores representing amino acid sequences of protein structures and note sequences from short amino acid sequences [14] . recently these authors applied their approach to sonifying the amino acid sequence and structure of the spike protein of sars-cov-2. genomic data has also provided a good candidate for sonification. these studies include sonification of the spectral properties of dna, molecular analyses [15, 16] and a preliminary investigation into rna structures [17, 18] . gene expression data has been sonified to discriminate between differentially expressed genes [19, 20] and chip-seq data [21] . in the realm of cancer progression, epigenomic data has been sonified to investigate the importance of methylation [22] . it has also been suggested that audio may be useful to interpret tomography of human adipose and tumor tissue samples [23] . microbial ecology data has been sonified into musical style by mapping rows of numerical data to chords [24, 25] , towards the end of communicating complex results to people not specialised in the field. previous studies into dna sonification for sequence analyses [6] demonstrated that mutations in repetitive dna sequences such as telomer or alphoid dna could be detected by ear alone and that coding regions could be distinguished from noncoding regions. the sars-cov-2 rna genome does not contain extensive repetitive sequences except for the 3′ poly-a tail, hence this sequence provided more of a challenge for display. given that the rna genome is almost 30,000 kb in length it would be abrasive and fatiguing to the ear to use harsh or dissonant tones for the entire a b fig. 1 the animated display. panel a shows the sliding window of the animated display in translation mode. key features of the animated display are labelled such as the translated peptide sequences and the frame in which they occur, the presence of start and stop codons are highlighted in green and red, respectively. the location of the audio play-head is represented to coincide with the peptidyl-transferase centre of the ribosome. the sonified audio is generated as the sars-cov-2 genome sequence passes through the play-head. the direction in which the ribosome moves relative to the rna sequence is indicated. panel b shows the animated display in transcription mode. the newly synthesised minus rna strand is shown below the genome sequence with the 3′ extended nucleotide shown in the play-head. the direction in which the replicase protein complex moves in relation to genome sequence is indicated auditory display. hence the decision was made to use more musical tones to generate the audio. the web tool described in this paper [26] operates in two modes that broadly represents translation or transcription. the audio display is generated using algorithms based on biological rules to generate sound at the play-head. the play-head substitutes for a ribosome during translations mode or the rna replicase/polymerase during transcription mode. a complex auditory stream was generated by overlaying up to 12 layers of audio (as summarised in table 1 ). each layer of audio is derived from an rna motif directly or metadata was used to flag the region of sequence to be sonified. additionally, prior to the start of each gene sequence an ascending run of 8 notes is triggered. this scale pattern is independent of the raw sequence data and is based solely on metadata relating to sequence position. this provides an audio cue to anticipate the upcoming gene coding sequence. the most fundamental building block of rna is an individual nucleotide and these were sonified as one of four individual notes whereas di-nucleotides were sonified as one of 16 notes and together these were panned left and right in the auditory display. another characteristic of nucleic acid sequences which is often used as a metric of genome status is the gc content which is often represented as a ratio. typically in coronavirus the count of u is above average and c is below average whereas a is preferred over g [27] leading to a relatively low gc ratio. in our approach two gc ratios were determined within a sliding window of 10 or 100 nucleotides respectively across the entire genome. each time the gc ratio changed by an increment or decrement of 0.1 a note was generated and these were panned against each other in stereo. when there is no change between two adjacent features in an audio stream, the first instance of the feature was allowed to play for a longer period of time rather than generating another instance of the same note. this approach provides a brief pause in the audio layer and provides an opportunity for another layer to be distinguished in the auditory display. together these four audio tracks create an audio landscape that can be heard across the entire auditory display of the genome. these rna features are not specific to either transcription or translation nor are they specific to a particular region of the genome. other sonified genome features were layered over this sonified landscape. in the translation mode, codons represent an important feature of rna and these were sonified as 20 notes when representing translation into amino acids. no distinction was made between the start methionine or that which occurs in the body of the peptide sequence. additionally, stop codons were sonified as an additional note since these are highly significant in the function of the genome. overlapping codons in each of the three reading frames were sonified during translation to detect orf's in either frame. an important consideration in the modelling of translation was to use the start and stop codons in each reading frame to trigger or halt the audio derived from other codons. additionally, in the visual display the audible codons were shown using the one letter amino acid representation. using this simple method all gene sequences reported in the genbank metadata were accurately represented in both the audio and visual displays. additionally, all open reading frames throughout the rna genome are shown and sonified. however, only open reading frames that correlate with the known metadata (gene sequences) were labelled in the visual display. this is consistent with prior approaches of mapping either individual bases [28] , codons [29] or amino acids [30, 31] to musical notes in a manner inspired by the genetic code or codon usage during translation. in the display representing transcription, codons per se were not considered. instead tri-nucleotide features were considered for sonification, however, these were considered to be positioned adjacent in the sequence rather than overlapping. given that there are 64 different tri-nucleotides it is not possible to use a traditional scale. a traditional piano consists of 7 octaves plus a minor third (88 notes). given that there are 7 scale notes in an octave it would require over 9 octaves to accommodate 64 trinucleotides. using synthesised notes could overcome this limitation but this would entail playing shrill high pitched notes that would be grating to the ear. therefore, linear mapping of 64 codons to individual notes was avoided. in the transcription display, tri-nucleotides were mapped to 16 individual notes since only the first and third position in each was considered. since trinucleotides play no functional role in the process of transcription there was no loss of information content using this approach and the audio could be designed to complement the single nucleotides and di-nucleotides in the audio stream and avoid the mapping to shrill notes. additionally, tri-nucleotides were not mapped to start or stop functionality and these are audible throughout the entire genome. their occurrence had no further effect in the auditory display. metadata specific to the coronavirus sars-cov-2 sequence was used to supplement the audio generated from the intrinsic characteristics of the rna sequence. audio from un-translated sequences between the open reading frames were mapped to an audio stream at a reduced tempo so that they were more clearly distinguished from the coding regions. additionally, the viral genome is known to contain 10 transcriptional regulatory sequences (trs) and five known stem loop (sl) structures known to play a role in the function of the genome [32] and their occurrence was sonified. these conserved motifs were sonified and since they often occurred in the untranslated regions the audio from these two were panned in stereo. the genome codes for a large polyprotein from a large open reading frame. this polyprotein is thought to be cleaved into 16 individual polypeptides (often referred to as nsp proteins) and the occurrence of the known cleavage sites was sonified. in addition to generating a short burst of notes, cleavage regions were also used to pause the progression of the play-head for a second or so by slowing the tempo to one tenth of the coding tempo. this effectively highlights the transition from one nsp sequence to the next. the occurrence of three or more identical nucleotides was also sonified since these are easy to detect by eye and their sound may help the user to keep track of where they are in the display. audio generated from each of these sequence motifs and metadata were combined to create a complex auditory display to represent either transcription or translation. as the audio is played a sliding window of 60 nucleotides is shown on the screen. at any point in time the first nucleotide in the visual play-head can be heard in the auditory display. other sequence features are determined relative to the position of this nucleotide. to play the entire genome takes approximately 96 min in translation mode which corresponds to approximately five nucleotides per second. this is slower than cellular translation which is thought to proceed at approximately 30 nucleotides per second [33] , however, to play this any faster makes it difficult to interpret due to the shortened duration between each note and a different algorithm would need to be devised. in transcription mode the full display lasts 120 min since the number of nucleotides played per second is a little slower, this approach was taken to clearly distinguish it from translation mode. three sets of interactive buttons (summarised in table 2 ) have been provided for each sonified feature so that each can be selected directly, for example a gene sequence or trs can be selected and played directly without having to play through the proceeding sequence data. these buttons change to a red colour as the respective feature is displayed. in this study, auditory streams were paired and played as stereo layers. audio that plays consistently throughout the entire genome were played at low frequency and transient data was highlighted at a higher frequency register to make them more prominent. in addition to simply considering the basic construction of pitch and separation, the data was harmonised to make it more listenable. the root tone and third note of the scale were played across two octaves with the limited 4 note mono-nucleotide sonification to establish a strong harmonic landscape throughout the playback. the drone generated from the gc content (which is sometimes invariant for periods of time) was also used to reinforce the foundation of the basic scale harmony. the g or c bases, as nucleotide, di-nucleotide or trinucleotides were each matched to higher octaves and a and u were mapped to lower octaves. this was done consistently between these audio streams in an attempt to harmonise the otherwise random note selection based on sequence information. an exception to this principle was made for start and stop codons which were mapped to higher pitches than gc rich codons so that their occurrence was easily perceived in the auditory display (since higher pitched notes are perceived to be louder). given that these codons are used to trigger and halt individual audio streams this approach further emphasises the occurrence of an open reading frame. the wider note range of the codons (20 notes) were used to introduce leading tones that often sound more dissonant and add complexity to the harmonic spectrum. this allows them to be easily discerned above the landscape tones of the simpler motifs. lastly, less frequent audio from dispersed regions of the genome e.g. trs or stem-loop (sl) motifs were pitched at the highest octave ranges or more dissonant notes within the diatonic scale to highlight their occurrence. all of this was done within a mode of the diatonic major scale. translation was played in bb aeolian (bb, c, db, eb, f, gb, ab) whereas transcription was played in c lydian (c, d, e, f#, g, a, b) . the parameters for mapping of each rna feature into an audio stream is summarised in table 3 . these choices are arbitrary and in later iterations of the tool it may be possible to choose the scale modes and key of choice. the ionian mode mode of the major scale was avoided since this is generally considered to be happy sounding and inappropriate for the data. each nucleotide generates a note on every beat whereas each di-nucleotide generates a note every second beat. each codon (in an orf) generates a not every third beat. together these notes are syncopated to create a characteristic sound during peptide translation that is distinct from the surrounding untranslated region. audio from the gc tracks are only triggered when the gc ratio changes by an increment of 0.1. if a note sequence has identical adjacent notes then the length of the note is extended rather than being repeated. this creates space and clarity for other notes layered in the auditory display. translation of the genomic rna leads to expression of a large polyprotein following ribosome binding to the 5′ prime untranslated region. however, from this genomic template the subsequent genes downstream from the polyprotein cannot be directly expressed presumably due to the stop codon at the end of the gene. in the display the sonification also stops at this point, however, play can be resumed to inspect the downstream sequence. additionally, the tempo of the untranslated regions are slower than that of the coding regions so that the tempo increases as the play-head (in place of the ribosome) reads into a gene sequence. this was implemented to help the user distinguish between different sequence types during the display of translation. one of the more interesting characteristics of the viral genome is the phenomena of discontinuous transcription whereby a template switch occurs during the synthesis of sub-genomic negative-strand rna's [5] . various mechanisms have been proposed to explain how the transcription regulatory sequences (trs) are involved in the synthesis of positive strand sub-genomic rna from various negative strand intermediates [34] . trs sequences are located in the untranslated regions between the genes and one model suggests that these facilitate transcription skipping to the trs sequence located in the 5′ untranslated region. this process is driven by complementary interactions between trs regions to add a copy of the leader sequence to form sub-genomic rna species. in these sub-genomic rna's the polyprotein sequence has been omitted and ribosome binding at 5′ end can read through and express the contiguous downstream gene sequence [35] . this functional behaviour of the rna has been built into the auditory and visual display. by default, the process of auditory translation runs from the 5′ end through to the stop codon at the end of the polyprotein, whereas transcription runs the full length of the rna beginning at the 3′ end. a toggle switch, labelled 'translate as sub-genomic rna' has been implemented to change these behaviours. when the toggle switch is selected during the transcription mode, the play-head will skip from any upcoming trs region to the leader trs1 located in the 5′ region (mimicking the behaviour of the rna replicase). subsequently in translation mode with the toggle activated the play-head will, by way of example, skip from the leader trs1 (omitting the polyprotein) through to the trs2 region adjacent to the start of the s protein. whilst the metadata use to drive this behaviour does not change the characteristics of the sound, it does change the selection which regions are sonified. the website does not rely on a server and instead the entire rna sequence is downloaded into the client browser when the page is loaded. all code is written in javascript and runs within the client browser. the react framework was used to create the environment state whereby each iteration of state represents a sliding window to the next base. redux was also used to help manage state. audio is generated in real time within the client browser using tone.js. the reactronica framework [36] was used to further manage audio within the environment state. translation of the viral polyprotein is known to be subject to a frameshift mutation and since this does not follow the normal rules of gene expression a conditional expression was used to change the display for that instance so that the translated protein in frame 2 shifts to frame 1 in both the visual and auditory display. to understand the function of the viral plus rna strand the information needs to be processed in the 5′ to 3′ direction during translation and in the reverse 3′ to 5′ direction for transcription (whereby nucleotides are extended to the newly synthesised minus strand at the 3′ end). in this study an auditory display of the sequence was generated with a sliding window moving in either direction. processing of information within the sliding window was used to generate a synchronised auditory and visual display. this is advantageous since it mimics the behaviour of biological processes within the cell. to further emulate translation the generation of audio was triggered by start codons and silenced by stop codons. furthermore, the visual display shows all possible peptide sequences and these are aligned with the rna sequence being processed. from the sequence data alone the tool was able to detect and display all known open reading frames and metadata was used solely to label these in the display. other open reading frames were detected throughout the genome in the displays, however, since these are not downstream of an in-frame ribosome binding site no claim is made that these are actually translated. high resolution analysis of gene expression in coronavirus genomes has detected ribosome protected fragments which map to non-canonical orf's, these may be novel protein-coding orfs and short regulatory uorfs. the tool highlights the occurrence of one such uorf of 30 nucleotides (including the stop codon) in the 5′ untranslated region downstream of trs1 [35] that is not documented in the genbank metadata. an image of the raw wave files and their relationship to the sequence information for this region are shown in fig. 2 . non-standard uorf's such as this have been detected as translation products in rna sequencing and ribosome profiling experiments which allude to the complexity of gene regulation [37] . for this reason, all open reading frames are included in the display. this uorf is clearly represented in additiaonal file 1: example 1, supplementary file 'sonification untranslated ends' (mp3 file) whereby at 26 s into the auditory display of the 5′ untranslated region a high-pitched start codon introduces a short sequence of layered audio that is punctuated a few seconds later by another high-pitched note as the layered audio ends. this can also be observed in example 1 (mp4 file) as a nine amino acid residue sequence in reading frame 2. the 5′ untranslated region is also characterised by the distinctive sound of the trs1 sequence at 16 to 19 s into the audio display. fig. 2 multitrack wave files representing a portion of an auditory display. these tracks play in unison to generate the auditory display and each represent approximately 80 nucleotides beginning at nucleotide position 65. this sequence is located in the 5′ untranslated region and includes a trs region and a uorf. each audio stream was generated from a different algorithm, only nucleotides that gave rise to audio are shown (the entire nucleotide sequence is shown in track 2). in track 1, each nucleotide generates a note for every beat unless it is a repeat of the previous in which case the length of the note is extended. in track 2, each di-nucleotide generates a note every second beat. in tracks 3 and 4, audio from the gc track is only triggered when the gc ratio changes by an increment of 0.1. each change in the gc ratio is indicated by a plus (+) or minus (−) symbol on the wave files. in track 5, only codon sequences beginning with a start codon (aug) are shown through to the next stop codon (e.g. uaa). isolated stop codons also give rise to a note. this track is a compilation of audio form three sub-tracks each representing a different reading frame and notes in this track are panned left, centre or right, respectively. track 6 represents the audio generated from metadata that indicates the location of a trs region. additionally, the consensus sequence within this region is coloured purple in the visual display. track 7 represents audio generated by the occurrence of three nucleotides of the same type. other data tracks are not represented since no audio was generated in these during processing of this sequence of the genome. additionally, the amino acid sequence of the orf is shown in the codon track 5 similarly, three short orf's are apparent in the 3′ untranslated region of example 1 beginning at 1 min 31 s following the high-pitched repetitive pattern of the sl region. these two untranslated regions were manually played one after the other during the same auditory display using the navigation buttons. since they are both characterised by the absence of long open reading frames they provide a good introduction to the basic sound of the auditory display over which the highlighted notes from other rna features will be layered. the additional file 2: example 2 'sonification utr to surface glycoprotein' (supplementary file) represents the sonification of a sub-genomic rna. for this run the 'translate as sub-genomic rna' checkbox was selected to mimic translation from one of the products of discontinuous transcription, a process upon which viral gene expression is reliant. sonification of the entire genome in either direction results in an auditory display lasting up to 2 h in duration. selecting the 'translate as sub-genomic rna' checkbox results in a shorter auditory display since shorter regions of rna are processed. example 2 again plays from the beginning of the plus strand sequence (as does example 1), however in this display the play-head skips from trs1 to trs2 and immediately into the orf of the surface glycoprotein (skipping a portion of the untranslated region and skipping all of the ~21,000 bp of the polyprotein sequence). the display highlights that the prior discussed uorf is skipped in the 5′ leader of the sub-genomic rna from which the genes downstream of the polyprotein are translated. the audio diverges from example 1 after 23 s or so since the layered sound of the surface glycoprotein (an open reading frame) begins and continues to play for the rest of the display. portions of the two stereo waveforms of the display from example 1 and 2 are shown in fig. 3 . to the left of the cursor both stereo waveforms are essentially the same whereas to the right of the cursor the audio displays have clearly diverged as different sequences were processed beyond trs1. the visual display shows that this pattern continues for approximately 100 amino acid residues. whilst this may only be an artefact of the analyses rather than an undocumented protein sequence it does demonstrate the auditory display is capable of detecting unusual features in the genome. it is also worth noting that frame shift mutations do occur in the polyprotein sequence through a process that is not fully understood giving rise to a protein sequence that does not follow canonical gene expression patterns. the tool highlights the position of other relatively long open reading frames within the display so that they can be considered in the analysis of genome function. the nucleocapsid phosphoprotein sequence is followed by the orf 10 sequence which is about one third the length of this parallel orf. this analysis also highlights one of the properties of the auditory display which is that data in the three possible reading frames give rise to a triplet note pattern whereas data in two reading frames gives rise to a duplet note pattern (e.g. from 1 min 9 s through to 1 min 17 s). these note patterns make it easier to distinguish the features in the auditory display. in the last 1 min and 30 s of the display the genome alternates between gene sequences, transcription regulatory sequences, orf10, stem loop structures and untranslated regions. these features have been further annotated in the video file with circles and arrows to emphasise their occurrence in the combined visual and auditory display. in the additional file 4: supplementary example 'sonification sub-genomic rna' the auditory display represents the process of transcription. the tempo and scale patterns used for these displays are distinct from those used to represent translation. additionally, no attention was paid to the occurrence of open reading frames or codon usage patterns since these pay no role in genome replication or transcription. metadata relating to sl and trs elements were sonified, however, cleavage information relating to the polyprotein modification did not seem relevant. the resulting auditory display is therefore simpler than that arising from translation. this can be heard in example 4 which begins with sonification of the poly-a tail. in this example the play-head skips from trs10 through to trs1 which models the behaviour of discontinuous transcription. there is a check box on the page to switch between normal genome replication (whereby the entire genome would be sonified) and discontinuous transcription. the additional files 5, 6 and 7: examples 5 to 7 in the supplementary files include regions already describe in the previous auditory displays. however, in these examples various streams of audio that contribute to the auditory display have been toggled on and off. checkboxes are provided on the web page to facilitate this on the left-hand side of the note display table. the reason for this is two-fold. it provides a method to delineate how each feature of the rna genome contributes to the auditory display. for instance, the sound of a trs element or open reading frames could be highlighted (soled) or excluded (muted) from the overall sound of the translation display. this provides a better understanding of how the auditory displays are constructed. secondly for those who are less interested in the science of coronavirus and who are more interested in algorithmic music generation these tools can be used to compose and modify the inherent audio stream. the first of these, example 5 'remix utr through to polyprotein' , highlights the contribution that gc content makes to the audio stream since these features are soloed at the beginning of the display. at one minute into the display audio from the translated amino acids are also toggled on or off to highlight their contribution. example 6 'remix orf10 to the poly-a tail' highlights the off-beat syncopation between the dinucleotides playing every second beat against codons playing every third beat. lastly example 7 highlights how important it is to continually sonify individual nucleotides across the sequence, since this provide a sonic landscape to overlay the other features. to emphasise this the individual bases were soloed at the beginning and excluded at the end of the display. all previously mentioned example files have been uploaded as supplementary files. in addition to using the tool to navigate and inspect the function of the genome, the tool has been used to generate isolated audio in both translation and transcription modes. a playlist of four tracks has been uploaded to soundcloud. these audio tracks are to be listened to without reference to the visual display. the intention of this is to engage the non-specialised community with the concept of 'the sound of the coronavirus genome' and hopefully encourage people to delve a little deeper into the ideas behind the concept. without the context of the display and without a clear understanding of the molecular biology of rna virus the audio has to engage purely on its own sonic qualities-as an example of algorithmic music. in translation mode two auditory displays were prepared, the first (covid-19 translation polyprotein) plays through to the end of the polyprotein lasting 1 h and 8 min, covering approximately 21,500 nucleotides. the second audio track from a sub-genomic rna (covid-19 translation discontinuous) skips the polyprotein entirely to the untranslated region prior to the surface glycoprotein and then plays through to the 3′ poly-a tail. this piece covers approximately 8500 nucleotides and lasts 27 min. in addition, two audio tracks were generated representing transcription/ rna synthesis. the track representing genome replication (covid-19 transcription) last 2 h. the track representing discontinuous transcription (covid-19 transcription discontinuous) skips between trs10 and trs1 lasts only 1 min and 47 s. this paper extends prior work whereby dna was sonified using the rules of gene expression to generate an auditory display. previously an individual algorithm was used to produce an individual stream of audio from either a nucleotide, a di-nucleotide or codons and it was concluded that the sonification of codons was the most useful to identify mutations in repetitive dna or to distinguish coding regions from non-coding regions [6] . here we layer up to 12 layers of audio, each relating to an rna feature of interest. these include metadata to layer rna features such as consensus sequences in trs regions, sl regions, cleavage sites in the polyprotein and interspersed untranslated regions between characterised orf's. this approach produces a more detailed and rich auditory display which acts as a viable complement to an animated visual display. metadata was also used to affect the behaviour of the display to mimic what is known to occur during the coronavirus life cycle. the polyprotein is the only product to be translated from the genomic rna since this is thought to be the only orf that has access to the ribosomal binding site in the 5′ untranslated region. to mimic this the default behaviour of the tool is to stop at the in-frame stop codon at the end of the polyprotein. the tool can be restarted at the adjacent untranslated region or elsewhere using the navigation buttons. the default behaviour in transcription mode is to read through the entire genome sequence from end to end to mimic genome replication to produce a new coronavirus associated with human respiratory disease in china a sequence homology and bioinformatic approach can predict candidate targets for immune responses to sars-cov-2 the molecular biology of coronaviruses an auditory display tool for dna sequence analysis org: a serverless web tool for dna sequence visualization biases in visual, auditory, and audiovisual perception of space polyphonic sonification of electrocardiography signals for diagnosis of cardiac pathologies a novel sonification approach to support the diagnosis of alzheimer's dementia a systematic review of mapping strategies for the sonification of physical quantities using non-speech sounds to convey molecular properties in a virtual environment melody discrimination and protein fold classification sonification based de novo protein design using artificial intelligence, structure prediction, and analysis using molecular modeling autoregressive modeling and feature analysis of dna sequences molecular music: the acoustic conversion of molecular vibrational spectra supplementary material for "browsing rna structures by interactive sonification browsing rna structures by interactive sonification a short treatise concerning a musical approach for the interpretation of gene expression data gene expression music algorithm-based characterization of the ewing sarcoma stem cell signature chromas from chromatin: sonification of the epigenome musical patterns for comparative epigenomics sonification of optical coherence tomography data and images microbial bebop: creating music from complex dynamics in microbial ecology more of an art than a science: using microbial dna sequences to compose music real-time audio and visual display of the covid-19 genome on the biased nucleotide composition of the human coronavirus rna genome basically musical the sound of the dna language convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold open access which fosters wider collaboration and increased citations maximum visibility for your research: over 100m website views per year your research ? snare dance: a musical interpretation of atg9 transport to the tubulovesicular cluster conversion of amino-acid sequence in proteins to classical music: search for auditory patterns the structure and functions of coronavirus genomic 3' and 5' ends translatomics: the global view of translation continuous and discontinuous rna synthesis in coronaviruses viral and cellular mrna translation in coronavirus-infected cells high-resolution analysis of coronavirus gene expression by rna sequencing and ribosome profiling aidan temple setup the react and redux frameworks and helped with technical aspects of javascript programming. kaho chung developed the reactronica framework used to implement tone.js within react. the complementary minus strand. a toggle switch has been implemented to mimic discontinuous transcription and in the first instance the polymerase will jump from trs10 to trs1 in the 5′ leader region. this can be overridden using the navigation buttons but if another trs region is encountered by the play-head it will also jump to the trs1 in the leader region. in translation mode the same toggle causes the ribosome play-head to skip the polyprotein and skip from trs1 to trs2 and into the surface glycoprotein sequence. from this point the play-head will continue to the 3′ end reading through the remainder of the genome. all other stop codons will be sonified but they will not influence the progression of the play-head. the auditory display in combination with realtime animation provide a unique insight into the large body of evidence describing the metabolism of the rna genome. this provides another useful tool in the domain of genome browsers to further understand the complex function of the viral genome. project name real-time audio and visual display of the covid-19 genome.project home page https ://coron aviru s.dnaso nific ation .org/.operating system: platform independent.programming language javascript (ecmascript 6).other requirements none. license gnu gplv3. any restrictions to use by non-academics: no restrictions. supplementary information accompanies this paper at https ://doi.org/10.1186/s1285 9-020-03760 -7.additional file 1. example 1 sonification untranslated ends. additional file 3. example 3 sonification of the nucleocapsid phosphoprotein. additional file 5. example 5 remix utr through to polyprotein.additional file 6. example 6 remix orf10 to the poly-a tail. abbreviations sl: stem-loop; trs: transcription regulatory sequences; orf: open reading frame. mdt devised the project and algorithms, wrote the code for the sonification website and put together all aspect of this manuscript. this work was made possible through funding from the school of science, western sydney university. the funding body played no role in the design or conclusion of this study. source code is available on https ://githu b.com/markt emple /coron aviru s-sonifi cati on. audio tracks generated by the tool for a non-specialised audience are available on a soundcloud playlist. this playlist includes four tracks. these are, covid-19 translation polyprotein, covid-19 translation discontinuous, covid-19 transcription, and covid-19 transcription discontinuous. this playlist is available on https ://sound cloud .com/templ emark /sets/the-sound -of-the-coron aviru s. not applicable. the authors declare that they have no competing interests.received: 18 may 2020 accepted: 16 september 2020 key: cord-014461-2ubh9u8r authors: nelson, oranmiyan w.; garrity, george m. title: genome sequences published outside of standards in genomic sciences, july october 2012 date: 2012-10-10 journal: stand genomic sci doi: 10.4056/sigs.3416907 sha: doc_id: 14461 cord_uid: 2ubh9u8r the purpose of this table is to provide the community with a citable record of publications of ongoing genome sequencing projects that have led to a publication in the scientific literature. while our goal is to make the list complete, there is no guarantee that we may have omitted one or more publications appearing in this time frame. readers and authors who wish to have publications added to subsequent versions of this list are invited to provide the bibliographic data for such references to the sigs editorial office. aeromonas aquariorum, sequence accession bafl01000001 through bafl01000036, and ap012343 [78] aerococcus viridans ll1, sequence accession ajtg00000000 [79] bacillus anthracis h9401, sequence accession cp002091.1 ( chromosome), cp002092.1.1 (plasmid pxo1), and cp002093.1 (plasmid pxo2) [80] bacillus atrophaeus c89, sequence accession ajrj00000000 [81] bacillus cereus nc7401, sequence accession ap007209 (chromosome), ap007210 (plasmid standards in genomic sciences pnccld), ap007211 (plasmid pnc1, 48 kb), ap007212 (plasmid pnc2, 5 kb), ap007213 (plasmid pnc3, 4 kb), and ap007214 (plasmid pnc4, 3 kb) [82] bacillus siamensis kctc 13613 t , sequence accession ajvf00000000 [83] bacillus sp. strain 5b6, sequence accession ajst00000000 [84] bacillus sp. strain 916, sequence accession afsu00000000 [85] citreicella aestuarii strain 357, sequence accession ajkj00000000 [86] clostridium beijerinckii strain g117, sequence acceession akwa00000000 [87] corynebacterium pseudotuberculosis strain 1/06-a, sequence accession cp003082 [88] enterobacter sp. isolate ag1, sequence accession akxm00000000 [89] enterococcus faecalis d32, sequence accession cp003726 through cp003728 [90] enterococcus faecalis strain np-10011, sequence accession ab712291 [91] enterococcus hirae (streptococcus faecalis) atcc 9790, sequence accession cp003504 (chromosome), nc_015845 (plasmid ptg9790) [92] "geobacillus thermoglucosidans" tno-09.020, sequence accession ajjn00000000 [93] lactococcus garvieae ipla 31405, sequence accession akfo00000000 [94] lactobacillus mucosae lm1, sequence accession ahit00000000 [95] lactobacillus rossiae dsm 15814, sequence accession akzk00000000 [96] paenibacillus polymyxa osy-df, sequence accession aipp00000000 [97] pediococcus pentosaceus strain ie-3, sequence accession cahu01000001 through cahu01000091 [98] pelosinus fermentans a11, sequence accession akvm00000000 [99] pelosinus fermentans b4, sequence accession akvj00000000 [99] pelosinus fermentans jbw45, sequence accession akvo00000000 [100] pelosinus fermentans r7, sequence accession akvn00000000 [101] planococcus antarcticus dsm 14505, sequence accession ajyb00000000 [102] pseudomonas stutzeri strain jm300, sequence accession cp003725 [103] rhodococcus sp. strain dk17, sequence accession ajlq00000000 [104] staphylococcus aureus strain lct-sa112, sequence accession ajlp00000000 [105] staphylococcus capitis qn1, sequence accession ajtg00000000 [106] staphylococcus equorum subsp. equorum mu2, sequence accession cajl01000001 to cajl01000030 [107] staphylococcus hominis zbw5, sequence accession akgc00000000 [108] staphylococcus saprophyticus subsp. saprophyticus m1-1, sequence accession ahkb00000000 [109] streptococcus mutans gs-5, sequence accession cp003686 [110] streptococcus pyogenes m1 476, sequence accession ap012491 [111] streptococcus salivarius ps4, sequence accession ajfw00000000 [112] streptococcus thermophilus strain mn-zlw-002, sequence accession cp003499 [113] ureibacillus thermosphaericus strain thermo-bf, sequence accession ajik00000000 [114] phylum tenericutes mycoplasma leachii strain pg50 t , sequence accession cp002108.1 [115] mycoplasma mycoides subsp. mycoides, sequence accession cp002107.1 [115] mycoplasma wenyonii strain massachusetts, sequence accession cp003703 [116] phylum actinobacteria actinomyces massiliensis strain 4401292 t , sequence accession akio00000000 [117] bifidobacterium animalis subsp. lactis b420, sequence accesion cp003497 [118] bifidobacterium animalis subsp. lactis bi-07, sequence accesion cp003498 [118] bifidobacterium bifidum strain bgn4, sequence accession cp001361 [119] brevibacterium massiliense strain 541308 t , sequence accession cajd00000000 [120] corynebacterium bovis dsm 20582, sequence accession aenj00000000 [121] corynebacterium diphtheriae biovar intermedius nctc 5011, sequence accession ajvh00000000 [122] corynebacterium pseudotuberculosis strain 1/06-a, sequence accession cp003082 [123] corynebacterium pseudotuberculosis strain 3/99-5 sequence accession cp003152.1 [124] corynebacterium pseudotuberculosis strain 42/02-a, sequence accession cp003062 [124] microbacterium yannicii, sequence accession cajf01000001 through cajf01000067 [125] micromonospora lupini lupac 08, sequence accession caie01000001 [126] mycobacterium bolletii strain m24, sequence accession ajly00000000 [127] mycobacterium intracellulare clinical strain mott-36y, sequence accession cp003491 [128] mycobacterium massiliense m18, sequence accession ajsc00000000 [129] mycobacterium massiliense strain go 06, sequence accession cp003699 [130] mycobacterium massiliense strain m154, sequence accession ajma00000000 [131] mycobacterium tuberculosis rgtb327, sequence accession cp003233 [132] mycobacterium tuberculosis mtb423, sequence accession cp003234 [132] parascardovia denticolens ipla 20019, sequence accession akii00000000 [133] saccharothrix espanaensis dsm 44229 t , sequence accession he804045 [134] streptomyces auratus strain agr0001, sequence accession ajgv00000000 [135] "streptomyces cattleya" dsm46488 t , sequence accession fq859185 and fq859184 [136] streptomyces globisporus c-1027, sequence accession ajuo00000000 [137] streptococcus mutans gs-5, sequence accession cp003686 [138] streptomyces sp. strain aa1529, sequence accession alap00000000 [139] streptomyces sulphureus l180, sequence accession ajtq0000000 [140] phylum spirochaetes borrelia crocidurae, sequence accession cp003426 (chromosome), cp003427 to cp003465 (plasmids) [141] treponema sp. strain jc4, sequence accession jq783348 [142] phylum bacteroidetes flavobacterium sp. strain f52, sequence accession akzq00000000 [143] fusobacterium nucleatum subsp. fusiforme atcc 51190 t , sequence accession akxi00000000 [144] "imtechella halotolerans" k1 t , sequence accession ajju00000000 [145] standards in genomic sciences phage clp1, sequence accession jn051154 [158] pseudomonas aeruginosa siphophage mp1412, sequence accession jx131330 [159] pseudomonas aeruginosa temperate phage mp29, sequence accession eu272036 [160] pseudomonas aeruginosa temperate phage mp42, sequence accession jq762257 [160] pseudomonas phage φ-s1, sequence accession jx173487 [161] siphophage mp1412, sequence accession jx131330 [162] staphylococcus aureus bacteriophage gh15, sequence accession jq686190 [163] vibrio vulnificus bacteriophage ssp002, sequence accession jq692107 [164] african bovine rotaviruses rva/cowwt/zaf/1603/2007/g6p, sequence accession s9(vp7) jn831209, s4(vp4) jn831210, s6(vp6) jn831211, s1(vp1) jn831212, s2(vp2) jn831213, s3(vp3) jn831214, s5(nsp1) jn831204, s8(nsp2) jn831205, s7(nsp3) jn831206, s10(nsp4) jn831207, s11(nsp5) jn831208 [165] african bovine rotaviruses rva/cowwt/zaf/1604/2007/g8p, sequence accession s9(vp7) jn831220, s4(vp4) jn831221, s6(vp6) jn831222, s1(vp1) jn831223, s2(vp2) jn831224, s3(vp3) jn831225, s5(nsp1) jn831215, s8(nsp2) jn831216, s7(nsp3) jn831217, s10(nsp4) jn831218, s11(nsp5) jn831219 [165] african bovine rotaviruses rva/cowwt/zaf/1605/2007/g6p, sequence accession s9(vp7) jn831231, s4(vp4) jn831232, s6(vp6) jn831233, s1(vp1) jn831234, s2(vp2) jn831235, s3(vp3) jn831236, s5(nsp1) jn831226, s8(nsp2) jn831227, s7(nsp3) jn831228, s10(nsp4) jn831229, s11(nsp5) jn831230 [165] avian leukosis virus, sequence accession jx254901 [166] avian influenza virus h3n2, sequence accession jx175250 through jx175257 [167] avian influenza virus h5n2, sequence accession jq990145 through jq990152 [168] avian-like h4n8 swine influenza, sequence accession jx151007 through jx151014 [169] avian paramyxovirus, sequence accession jq886184 [170] avian tembusu-related virus strain wr, sequence accession jx196334 [171] bluetongue virus serotype 9, sequence accession jx003687 to jx003696 [172] bluetongue virus serotype 16, sequence accession [173] bombyx mori nucleopolyhedrovirus, sequence accession jq991009 [174] bovine viral diarrhea virus 2, sequence accession jf714967 [175] bovine foamy viruses, sequence accession jx307861 [176] canine noroviruses, sequence accession fj692500 and fj692501 [177] chicken anemia virus, sequence accession jx260426 [178] chikungunya virus, sequence accession jx088705 [179] chinese virulent avian coronavirus gx-yl5, sequence accession hq848267 [180] chinese virulent avian coronavirus gx-yl9, sequence accession hq850618 [180] coxsackievirus b4, sequence accession jx308222 [181] enterovirus c (hev-c117), sequence accession jx262382 [182] genotype 4 hepatitis e virus strain, sequence accession jq993308 [183] h10n8 avian influenza virus, sequence accession jq924786 to jq924793 [184] h9n2 subtype influenza virus fjg9, sequence accession jf715008.1, jn869514.1 through jn869520.1 [185] . herpes simplex virus 1 strain mckrae, sequence accession jx142173 [186] human coronavirus nl63, sequence accession jx104161 [187] human g10p rotavirus, sequence accession ab714258 through ab714268 [188] ikoma lyssavirus, sequence accession jx193798 [189] korean sacbrood viruses amsbv-kor19, sequence accession jq390592 [190] korean sacbrood viruses amsbv-kor21, sequence accession jq390591 [190] mitochondrion of frankliniella occidentalis, sequence accession jn835456 [191] new circular dna virus from grapevine, sequence accession jq901105 [192] novel porcine epidemic diarrhea virus, sequence accession jx112709 [193] pararetrovirus, sequence accession jq926983 [194] parechovirus, sequence accession jx050181 [195] peste des petits ruminants virus, sequence accession jx217850 [196] polyomavirus, sequence accession jq412134 [197] porcine circovirus 2b strain cc1, sequence accession jq955679 [198] porcine circovirus type 2 (pcv2), sequence accession jx294717 [199] porcine epidemic diarrhea virus strain aj1102, sequence accession jx188454 [200] porcine [204] waterfowl aviadenovirus goose adenovirus 4, sequence accession jf510462 [205] plant genomes plants cpdna of smilax china, sequence accession nc_015104 [206] elodea canadensis, sequence accession jq310743 [207] ogura-type mitochondrial genome, sequence accession ab694743 [208] fungus aspergillus oryzae strain 3.042, sequence accession akhy00000000 [209] rhodosporidium toruloides mtcc 457, sequence accession ajmj00000000 [210] animal genomes helicoverpa armigera, sequence accession hq613271 [211] plasmids plasmidincn plasmid prsb201, sequence accession jn102341 [212] plasmidincn plasmid prsb203, sequence accession jn102342 [212] plasmidincn plasmid prsb205, sequence accession jn102343 [212] plasmidincn plasmid prsb206, sequence accession jn102344 [212] complete genome sequence of the hyperthermophilic cellulolytic crenarchaeon "thermogladius cellulolyticus" 1633 complete genome sequence of methanomassiliicoccus luminyensis, the largest genome of a human-associated archaea species isolated from a deep-sea hydrothermal sulfide chimney on the juan de fuca ridge complete genome sequence of leptospirillum ferrooxidans strain c2-3, isolated from a fresh volcanic ash deposit on the island of miyake sequence analysis of a complete 1.66 mb prochlorococcus marinus med4 genome cloned in yeast genome sequence of acinetobacter sp. strain ha, isolated from the gut of the polyphagous insect pest helicoverpa armigera draft genome sequence of the hydrocarbon-degrading and emulsan-producing strain acinetobacter venetianus rag-1 t genome sequence and mutational analysis of plant-growth-promoting bacterium agrobacterium tumefaciens ccnwgs0286 isolated from a zinc-lead mine tailing draft genome sequence of alcaligenes faecalis subsp. faecalis ncib 8687 (ccug 2071) genome sequence of pectin-degrading alishewanella agri, isolated from landfill soil genome sequence of pectin-degrading alishewanella aestuarii strain b11t, isolated from tidal flat sediment genome sequence of thermotolerant bacillus methanolicus: features and regulation related to methylotrophy and production of l-lysine and l-glutamate from methanol draft genome sequence of the sulfur-oxidizing bacterium "candidatus sulfurovum sediminum" ar, which belongs to the epsilonproteobacteria genome sequence of bartonella birtlesii, a bacterium isolated from small rodents of the genus apodemus complete genome sequence of brucella abortus a13334, a new strain isolated from the fetal gastric fluid of dairy cattle complete genome sequence of brucella canis strain hsk a52141, isolated from the blood of an infected dog genome sequence of brucella melitensis s66, an isolate of sequence type 8, prevalent in china genome sequences of brucella melitensis 16m and its two derivatives 16m1w and 16m13w, which evolved in vivo complete genome sequence of the endophytic bacterium burkholderia sp. strain kj006 draft genome sequence of the soil bacterium burkholderia terrae strain bs001, which interacts with fungal surface structures revised genome sequence of burkholderia thailandensis msmb43 with improved annotation complete genome sequence of the opportunistic food-borne pathogen cronobacter sakazakii es15 genome sequence of the rice pathogen dickeya zeae strain zju1202 genome sequence of the plant growth-promoting bacterium enterobacter cloacae gs1 genome sequence of enterococcus faecium clinical isolate lct-ef128 genome sequence of enterobacter radicincitans dsm16656t, a plant growth-promoting endophyte genome sequence of escherichia coli j53, a reference strain for genetic studies draft genome sequence of escherichia coli lct-ec106 complete genome sequence of klebsiella oxytoca e718, a new delhi metallo-β-lactamase-1-producing nosocomial strain complete genome sequences of six strains of the genus methylobacterium genome sequence of methylobacterium sp. strain gxf4, a xylem-associated bacterium isolated from vitis vinifera l. grapevine complete genome sequences of methylophaga sp. strain jam1 and methylophaga sp. strain jam7 genome sequence of radiation-resistant modestobacter marinus strain bc501, a representative actinobacterium that thrives on calcareous stone surfaces genome sequence of mycobacterium massiliense m18, isolated from a lymph node biopsy specimen genome sequence of a neisseria meningitidis capsule null locus strain from the clonal complex of sequence type 198 genome sequence of novosphingobium sp strain rr 2-17, a nopaline crown gall-associated bacterium isolated from vitis vinifera l. grapevine complete genome sequence of providencia stuartii clinical isolate mrsn 2154 whole-genome shotgun sequence of the sulfur-oxidizing chemoautotroph pseudaminobacter salicylatoxidans kct001 genome sequence of pseudomonas aeruginosa strain sjtd-1, a bacterium capable of degrading long-chain alkanes and crude oil genome sequence of the lactate-utilizing pseudomonas aeruginosa strain xmg genome sequence of the rice pathogen pseudomonas fuscovaginae cb98818 draft genome sequence of arctic marine bacterium pseudoalteromonas issachenkonii pamc 22718 draft genome sequence of high-siderophore-yielding pseudomonas sp. strain hys genome sequence of the polychlorinated-biphenyl degrader pseudomonas pseudoalcaligenes kf707 complete genome sequence of the naphthalene-degrading pseudomonas putida strain nd6 genome sequence of pseudomonas putida strain sjte-1, a bacterium capable of degrading estrogens and persistent organic pollutants draft genome sequence of pseudomonas sp. strain m47t1, carried by bursaphelenchus xylophilus isolated from pinus pinaster genome sequence of the moderately halotolerant, arsenite-oxidizing bacterium pseudomonas stutzeri ts44 genome sequence of ralstonia sp. strain pba, a bacterium involved in the biodegradation of 4-aminobenzenesulfonate genome sequences for six rhodanobacter strains, isolated from soils and the terrestrial subsurface, with variable denitrification capabilities genome sequence of rickettsia conorii subsp. caspia, the agent of astrakhan fever genome sequence of rickettsia australis, the agent of queensland tick typhus draft genome sequence of rickettsia sp. strain meam1, isolated from the whitefly bemisia tabaci genome sequence of rickettsia conorii subsp. israelensis, the agent of israeli spotted fever draft genome sequence of the antagonistic rhizosphere bacterium serratia plymuthica strain pri-2c draft genome sequence of serratia marcescens strain lct-sm213 complete genome sequence of the broad-host-range strain sinorhizobium fredii usda257 genome sequence of sphingobium indicum b90a, a hexachlorocyclohexane-degrading bacterium genome sequence of stenotrophomonas maltophilia pml168, which displays baeyer-villiger monooxygenase activity draft genome sequences of eight salmonella enterica serotype newport strains from diverse hosts and locations whole-genome sequences and comparative genomics of salmonella enterica serovar typhi isolates from patients with fatal and nonfatal typhoid fever in papua new guinea draft genome sequence of serratia sp. strain m24t3, isolated from pinewood disease nematode bursaphelenchus xylophilus draft genome sequence of a psychrotolerant sulfur-oxidizing bacterium, sulfuricella denitrificans skb26, and proteomic insights into cold adaptation genome sequence of xanthomonas campestris jx, an industrially productive strain for xanthan gum draft genome sequence of yersinia pestis strain 2501, an isolate from the great gerbil plague focus in xinjiang genome sequence of a novel human pathogen, aeromonas aquariorum genome sequence of aerococcus viridans ll1 complete genome sequence of bacillus anthracis h9401, an isolate from a korean patient with anthrax draft genome sequence of the sponge-associated strain bacillus atrophaeus c89, a potential producer of marine drugs complete genome sequence of bacillus cereus nc7401, which produces high levels of the emetic toxin cereulide draft genome sequence of the plant growth-promoting bacterium bacillus siamensis kctc 13613t isolated from a cherry tree genome sequence of the plant growth-promoting rhizobacterium bacillus sp. strain 916 draft genome sequence of citreicella aestuarii strain 357, a member of the roseobacter clade isolated without xenobiotic pressure from a petroleum-polluted beach draft genome sequence of butanol-acetone-producing clostridium beijerinckii strain g117 complete genome sequence of corynebacterium pseudotuberculosis strain 1/06-a, isolated from a horse in north america draft genome sequences of enterobacter sp. isolate ag1 from the midgut of the malaria mosquito anopheles gambiae complete genome sequence of the porcine isolate enterococcus faecalis d32 complete genome sequence of bacteriophage bc-611 specifically infecting enterococcus faecalis strain np-10011 genome sequence of enterococcus hirae (streptococcus faecalis) atcc 9790, a model organism for the study of ion transport, bioenergetics, and copper homeostasis complete genome sequence of geobacillus thermoglucosidans tno-09.020, a thermophilic sporeformer associated with a dairy-processing environment genome sequence of lactococcus garvieae ipla 31405, a bacteriocin-producing, tetracycline-resistant strain isolated from a raw-milk cheese genome sequence of lactobacillus mucosae lm1, isolated from piglet feces draft genome sequence of lactobacillus rossiae dsm 15814t draft genome sequence of paenibacillus polymyxa osy-df, which coproduces a lantibiotic, paenibacillin, and polymyxin e1 genome sequence of pediococcus pentosaceus strain ie-3 draft genome sequences for two metal-reducing pelosinus fermentans strains isolated from a cr(vi)-contaminated site and for type strain r7 draft genome sequence of pelosinus fermentans jbw45, isolated during in situ stimulation for cr(vi) reduction draft genome sequences for two metal-reducing pelosinus fermentans strains isolated from a cr(vi)-contaminated site and for type strain r7 genome sequence of the antarctic psychrophile bacterium planococcus antarcticus dsm 14505 genome sequence of pseudomonas stutzeri strain jm300 (dsm 10701), a soil isolate and model organism for natural transformation draft genome sequence and comparative analysis of the superb aromatic-hydrocarbon degrader rhodococcus sp. strain dk17 whole-genome sequence of staphylococcus aureus strain lct-sa112 genome sequence of staphylococcus capitis qn1, which causes infective endocarditis genome sequence of staphylococcus equorum subsp. equorum mu2, isolated from a french smear-ripened cheese whole-genome sequence of staphylococcus hominis, an opportunistic pathogen draft genome sequence of staphylococcus saprophyticus subsp. saprophyticus m1-1, isolated from the gills of a korean rockfish, sebastes schlegeli hilgendorf, after high hydrostatic pressure processing complete genome sequence of streptococcus mutans gs-5, a serotype c strain complete genome sequence of streptococcus pyogenes m1 476, isolated from a patient with streptococcal toxic shock syndrome complete genome sequence of streptococcus salivarius ps4, a strain isolated from human milk complete genome sequence of streptococcus thermophilus strain mn-zlw-002 draft genome sequence of ureibacillus thermosphaericus strain thermo-bf, isolated from ramsar hot springs in iran complete genome sequences of mycoplasma leachii strain pg50t and the pathogenic mycoplasma mycoides subsp. mycoides small colony biotype strain gladysdale complete genome sequence of mycoplasma wenyonii strain massachusetts draft genome sequence of actinomyces massiliensis strain 4401292t complete genome sequences of probiotic strains bifidobacterium animalis subsp. lactis b420 and bi-07 complete genome sequence of the probiotic bacterium bifidobacterium bifidum strain bgn4 draft genome sequence of brevibacterium massiliense strain 541308t draft genome sequence of corynebacterium bovis dsm 20582, which causes clinical mastitis in dairy cows draft genome sequence of corynebacterium diphtheriae biovar intermedius nctc 5011 complete genome sequence of corynebacterium pseudotuberculosis strain 1/06-a, isolated from a horse in north america complete genome sequences of corynebacterium pseudotuberculosis strains 3/99-5 and 42/02-a, isolated from sheep in scotland and australia, respectively genome sequence of microbacterium yannicii, a bacterium isolated from a cystic fibrosis patient genome sequence of micromonospora lupini lupac 08, isolated from root nodules of lupinus angustifolius draft genome sequence of mycobacterium bolletii strain m24, a rapidly growing mycobacterium of contentious taxonomic status complete genome sequence of mycobacterium intracellulare clinical strain mott-36y, belonging to the int5 genotype genome sequence of mycobacterium massiliense m18, isolated from a lymph node biopsy specimen complete genome sequence of mycobacterium massiliense annotated genome sequence of mycobacterium massiliense strain m154, belonging to the recently created taxon mycobacterium abscessus subsp. bolletii comb. nov whole-genome sefquences of two clinical isolates of mycobacterium tuberculosis from kerala, south india genome sequence of parascardovia denticolens ipla 20019, isolated from human breast milk complete genome sequence of saccharothrix espanaensis dsm 44229t and comparison to the other completely sequenced pseudonocardiaceae insights into fluorometabolite biosynthesis in streptomyces cattleya dsm46488 through genome sequence and knockout mutants draft genome sequence of streptomyces globisporus c-1027, which produces an antitumor antibiotic consisting of a nine-membered enediyne with a chromoprotein complete genome sequence of streptococcus mutans gs-5, a serotype c strain draft genome sequence of the marine streptomyces sp. strain aa1529, isolated from the yellow sea draft genome sequence of the marine actinomycete streptomyces sulphureus l180, isolated from marine sediment complete genome sequence of borrelia crocidurae draft genome sequence of treponema sp. strain jc4, a novel spirochete isolated from the bovine rumen draft genome sequence of flavobacterium sp. strain f52, isolated from the rhizosphere of bell pepper (capsicum annuum l. cv. maccabi) draft genome sequence of fusobacterium nucleatum subsp. fusiforme atcc 51190 t genome sequence of the halotolerant bacterium imtechella halotolerans k1 t genome sequence of a novel actinophage pis136 isolated from a strain of saccharomonospora sp complete genome sequence of aeromonas hydrophila phage cc2 complete genome sequence of bacteriophage bc-611 specifically infecting enterococcus faecalis strain np-10011 complete genome sequence of bacteriophage ssu5 specific for salmonella enterica serovar typhimurium rough strains genome sequence of blattabacterium sp. strain bgiga, endosymbiont of the blaberus giganteus cockroach complete genome sequence of caulobacter crescentus bacteriophage φcbk complete genome sequence of celeribacter bacteriophage p12053l complete genome sequence of croceibacter bacteriophage p2559s complete genome sequence of cronobacter sakazakii temperate bacteriophage phies15 complete genome sequence of marinomonas bacteriophage p12026 complete genome sequence of phytopathogenic pectobacterium carotovorum subsp. carotovorum bacteriophage pp1 complete genome sequences of two persicivirga bacteriophages, p12024s and p12024l genome sequence of the phage clp1, which infects the beer spoilage bacterium pediococcus damnosus complete genome sequence of pseudomonas aeruginosa siphophage mp1412 complete genome sequences of two pseudomonas aeruginosa temperate phages, mp29 and mp42, which lack the phage-host crispr interaction genome sequence of the broad-host-range pseudomonas phage φ-s1 complete genome sequence of pseudomonas aeruginosa siphophage mp1412 complete genome sequence of staphylococcus aureus bacteriophage gh15 complete genome sequence of vibrio vulnificus bacteriophage ssp002 whole genome sequence analyses of three african bovine rotaviruses reveal that they emerged through multiple reassortment events between rotaviruses from different mammalian species complete genome sequence of an avian leukosis virus isolate associated with hemangioma and myeloid leukosis in egg-type and meat-type chickens genome sequence of a novel reassortant h3n2 avian influenza virus in southern china complete genome sequence of an h5n2 avian influenza virus isolated from a parrot in southern china complete genome sequence of an avian-like h4n8 swine influenza virus discovered in southern china complete genome sequence of a novel avian paramyxovirus complete genome sequence of avian tembusu-related virus strain wr isolated from white kaiya ducks in fujian complete genome sequence of bluetongue virus serotype 9: implications for serotyping complete genome sequence of bluetongue virus serotype 16 of goat origin from india genome sequence of a bombyx mori nucleopolyhedrovirus strain with cubic occlusion bodies complete genome sequence of a bovine viral diarrhea virus 2 from commercial fetal bovine serum complete genome sequences of two novel european clade bovine foamy viruses from germany and poland complete genome sequences of novel canine noroviruses in hong kong complete genome sequence analysis of a recent chicken anemia virus isolate and comparison with a chicken anemia virus isolate from human fecal samples in china complete genome sequence of a chikungunya virus isolated in guangdong complete genome sequences of two chinese virulent avian coronavirus infectious bronchitis virus variants complete genome sequence of a recombinant coxsackievirus b4 from a patient with a fatal case of hand, foot, and mouth disease in guangxi complete genome sequence of a novel human enterovirus c (hev-c117) identified in a child with community-acquired pneumonia complete genome sequence of the genotype 4 hepatitis e virus strain prevalent in swine in jiangsu province, china, reveals a close relationship with that from the human population in this area complete genome sequence of an h10n8 avian influenza virus isolated from a live bird market in southern china complete genome sequence of a novel h9n2 subtype influenza virus fjg9 strain in china reveals a natural reassortant event characterization and complete genome sequence of human coronavirus nl63 isolated in china whole genome sequence analyses of three african bovine rotaviruses reveal that they emerged through multiple reassortment events between rotaviruses from different mammalian species complete genome sequence of ikoma lyssavirus analysis of the complete genome sequence of two korean sacbrood viruses in the honey bee, apis mellifera the complete mitochondrial genome sequence of the western flower thrips frankliniella occidentalis (thysanoptera: thripidae) contains triplicate putative control regions genome sequence of methylobacterium sp. strain gxf4, a xylem-associated bacterium isolated from vitis vinifera l. grapevine complete genome sequence of porcine epidemic diarrhea virus strain aj1102 isolated from a suckling piglet with acute diarrhea in china complete genome sequence of a novel pararetrovirus isolated from soybean complete genome sequence of a novel type of human parechovirus strain reveals natural recombination events complete genome sequence of a peste des petits ruminants virus recovered from wild bharal in tibet complete genome sequence of a polyomavirus isolated from horses complete genome sequence of a novel field strain of rearranged porcine circovirus type 2 in southern china complete genome sequence of a novel field strain of rearranged porcine circovirus type 2 in southern china complete genome sequence of porcine epidemic diarrhea virus strain aj1102 isolated from a suckling piglet with acute diarrhea in china complete genome sequence of a novel porcine sapelovirus strain yc2011 isolated from piglets with diarrhea complete genome sequence of porcine reproductive and respiratory syndrome virus strain qy2010 reveals a novel subgroup emerging in china genome sequences of sat 2 foot-and-mouth disease viruses from egypt and palestinian autonomous territories (gaza strip) complete genome sequence of a street rabies virus from mexico genome sequence of a waterfowl aviadenovirus, goose adenovirus 4 jenny) xiang q-y. complete cpdna genome sequence of smilax china and phylogenetic placement of liliales -influences of gene partitions and taxon sampling complete chloroplast genome sequence of elodea canadensis and comparative analyses with other monocot plastid genomes a complete mitochondrial genome sequence of ogura-type male-sterile cytoplasm and its comparative analysis with that of normal cytoplasm in radish (raphanus sativus l.) draft genome sequence of aspergillus oryzae strain 3.042 genome sequence of the oleaginous red yeast rhodosporidium toruloides mtcc 457 complete genome sequence of a monosense densovirus infecting the cotton bollworm, helicoverpa armigera the complete genome sequences of four new incn plasmids from wastewater treatment plant effluent provide new insights into incn plasmid diversity and evolution key: cord-027316-echxuw74 authors: modarresi, kourosh title: detecting the most insightful parts of documents using a regularized attention-based model date: 2020-05-22 journal: computational science iccs 2020 doi: 10.1007/978-3-030-50420-5_20 sha: doc_id: 27316 cord_uid: echxuw74 every individual text or document is generated for specific purpose(s). sometime, the text is deployed to convey a specific message about an event or a product. other occasions, it may be communicating a scientific breakthrough, development or new model and so on. given any specific objective, the creators and the users of documents may like to know which part(s) of the documents are more influential in conveying their specific messages or achieving their objectives. understanding which parts of a document has more impact on the viewer’s perception would allow the content creators to design more effective content. detecting the more impactful parts of a content would help content users, such as advertisers, to concentrate their efforts more on those parts of the content and thus to avoid spending resources on the rest of the document. this work uses a regularized attention-based method to detect the most influential part(s) of any given document or text. the model uses an encoder-decoder architecture based on attention-based decoder with regularization applied to the corresponding weights. the main purpose of nlp (natural language processing) and nlu (natural language understanding) is to understand the language. more specifically, they are focused on not just to see the context of text but also to see how human uses the language in daily life. thus, among other ways of utilizing this, we could provide an optimal online experience addressing needs of users' digital experience. language processing and understanding is much more complex than many other applications in machine learning such as image classification as nlp and nlu involve deeper context analysis than other machine learning applications. this paper is written as a short paper and focuses on explaining only the parts that are contribution of this paper to the state-of-the art. thus, this paper does not describe the state-of-the-art works in details and uses those works [2, 4, 5, 8, 53, 60, 66, 70, 74, 84 ] to build its model as a modification and extension of the state of the art. therefore, a comprehensive set of reference works have been added for anyone interested in learning more details of the previous state of the art research [3, 5, 10, 17, 33, 48, 49, 61-63, 67-73, 76, 77, 90, 91, 93] . deep learning has become a main model in natural language processing applications [6, 7, 11, 22, 38, 55, 64, 71, 75, 78-81, 85, 88, 94] . among deep learning models, often rnn-based models like lstm and gru have been deployed for text analysis [9, 13, 16, 23, 32, 39-42, 50, 51, 58, 59] . though, modified version of rnn like lstm and gru have been improvement over rnn (recurrent neural networks) in dealing with vanishing gradients and long-term memory loss, still they suffer from many deficiencies. as a specific example, a rnn-based encoder-decoder architecture uses the encoded vector (feature vector), computed at the end of encoder, as the input to the decoder and uses this vector as a compressed representation of all the data and information from encoder (input). this ignores the possibility of looking at all previous sequences of the encoder and thus suffers from information bottleneck leading to low precision, especially for texts of medium or long sequences. to address this problem, global attention-based model [2, 5] where each of the encoder sequence uses all of the encoder sequences. figure 1 shows an attention-based model. where i = 1:n is the encoder sequences and, t = 1:m represents the decoder sequences. each of the encoder states looks into the data from all the encoder sequences with specific attention measured by the weights. each weight, w ti , indicates the attention decoder network t pays for the encoder network i. these weights are dependent on the previous decoder and output states and present encoder state as shown in fig. 2 . given the complexity of these dependencies, a neural network model is used to compute these weights. two layers (1024) of fully connected layers and relu activation function is used. where h is the state of the encoder networks, s t−1 is the previous state of the decoder and v t−1 is the previous decoder output. also, w t is the weights of the encoder state t. since w t are the output from softmax function, then, this section overviews of the contribution of this paper and explains the extension made over the state-of-the-art model. a major point of attention for many texts related analysis is to determine which part(s) of the input text has had more impact in determining the output. he length of input text could be very long combining of potentially hundreds and thousands of words or sequences, i.e., n could be very large number. thus, there are many weights (w ti ) in determining any part of output v t , and also since many of these weights are correlated, it's difficult to determine the significance of any input sequence in computing any output sequence v t . to make these dependencies clearer and to recognize the most significant input sequences for any output sequence, we apply a zero-norm penalty to make the corresponding weight vector to become a sparse vector. to achieve the desired sparsity, zero-norm (l 0 ) is applied to make any corresponding w t vector very sparse as the penalty leads to minimization of the number of non-zero entries in w t . the process is implemented by imposing the constraint of, since l 0 is computationally intractable, we could use surrogate norms such as l 1 norm or euclidean norm, l 2 . to impose sparsity, the l 1 norm, lasso [8, 14, 15, 18, 21] is used in this work, or, as the penalty function to enforce sparsity on the weight vectors. this penalty, β w t 1 , is the first extension to the attention model [2, 5] . here, β is the regularization parameter which is set as a hyperparameter where its value is set before learning. higher constraint leads to higher sparsity with higher added regularization biased error and lower values of the regularization parameter leads to lower sparsity and lesser regularization bias. the main goal of this work is to find out which parts of encoder sequences are most critical in determining and computing any output. the output could be a word, a sentence or any other subsequence. the goal is critical especially in application such as machine translation, image captioning, sentiment analysis, topic modeling and predictive modeling such as time series analysis and prediction. to add another layer of regularization, this work imposes embedding error penalty to the objective function (usually, cross entropy). this added penalty also helps to address the "coverage problem" (the phenomenon of often observed dropping or frequently repeating words --or any other subsequence --by the network). the embedding regularization is, α embedding error 2 (6) input to any model has to be a number and hence the raw input of words or text sequence needs to be transformed to continuous numbers. this is done by using one-hot encoding of the words and then using embedding as shown in fig. 3 . whereu i is the raw input text, is the one-hot encoding representation of the raw input and u i is the embedding of the i-th input or sequence. also, α is the regularization parameter. the idea of embedding is based on that embedding should preserve word similarities, i.e., the words that are synonyms before embedding, should remain synonyms after embedding. using this concept of embedding, the scaled embedding error is, or, after scaling the embedding error, which could be re-written, using a regularization parameter (α), as, where l is the measure or metric of similarity of words representations. here, for all similarity measures, both euclidean norm and cosine similarity (dissimilarity) have been used. in this work, the embedding error using the euclidean norm is used, alternatively, we could include the embedding error of the output sequence in eq. (10) . when the input sequence (or the dictionary) is too long, to prevent high computational complexity of computing similarity of each specific word with all other words, we choose a random (uniform) sample of the input sequences to compute the embedding error. the regularization parameter, α, is computed using cross validation [26] [27] [28] [29] [30] [31] . alternatively, adaptive regularization parameters [82, 83] could be used. this model was applied on wikipedia datasets for english-german translation (one-way translation) with 1000 sentences. the idea was to determine which specific input word (in english) is the most important one for the corresponding german translation. the results were often an almost diagonal weight matrix, with few non-zero off diagonal entries, indicating the significance of the corresponding word(s) in the original language (english). since the model is an unsupervised approach, it's hard to evaluate its performance without using domain knowledge. the next step in this work would be to develop a unified and interpretable metric for automatic testing and evaluation of the model without using any domain knowledge and also to apply the model to other applications such as sentiment analysis. inverse problems: principles and applications in geophysics, technology, and medicine polosukhin: attention is all you need domain adaptation via pseudo in-domain data selection multiple object recognition with visual attention neural machine translation by jointly learning to align and translate the dropout learning algorithm deep learning an unsupervised feature learning nips 2012 workshop nesta, a fast and accurate first-order method for sparse recovery learning long-term de-pendencies with gradient descent is difficult a neural probabilistic language model theano: a cpu and gpu math expression compiler audio chord recognition with recurrent neural networks a singular value thresholding algorithm for matrix completion exact matrix completion via convex optimization compressive sampling long short-term memory-networks for machine reading learning phrase representations using rnn encoder-decoder for statistical machine translation framewise phoneme classification with bidirectional lstm and other neural network architectures generating sequences with recurrent neural networks the elements of statistical learning; data mining, inference and prediction handwritten digit recognition via deformable prototypes gene shaving' as a method for identifying distinct sets of genes with similar expression patterns matrix completion via iterative soft-thresholded svd package 'impute'. cran multilingual distributed representations without word alignment advances in natural language processing long short-term memory gradient flow in recurrent nets: the difficulty of learning long-term dependencies regularization for applied inverse and ill-posed problems compositional attention networks for machine reasoning two case studies in the application of principal component principal component analysis rotation of principal components: choice of normalization constraints a modified principal component technique based on the lasso recurrent continuous translation models statistical machine translation structured attention networks statistical phrase-based translation learning phrase representations using rnn encoder-decoder for statistical machine translation conditional random fields: probabilistic models for segmenting and labeling sequence data neural networks: tricks of the trade a structured self-attentive sentence embedding effective approaches to attention-based neural machine translation learning to recognize features of valid textual entailments natural logic for textual inference encyclopedia of language & linguistics introduction to information retrieval the stanford corenlp natural language processing toolkit. computer science, acl computational linguistics and deep learning differentiating language usage through topic models effective approaches to attention based neural machine translation application of dnn for modern data with two examples: recommender systems & user recognition. deep learning summit standardization of featureless variables for machine learning models using natural language processing generalized variable conversion using k-means clustering and web scraping an efficient deep learning model for recommender systems effectiveness of representation learning for the analysis of human behavior an evaluation metric for content providing models, recommendation systems, and online campaigns combined loss function for deep convolutional neural networks a randomized algorithm for the selection of regularization parameter. inverse problem symposium a local regularization method using multiple regularization levels a decomposable attention model on the difficulty of training recurrent neural networks. in: icml on the difficulty of training recurrent neural networks how to construct deep recurrent neural networks fast curvature matrix-vector products for second-order gradient descent bidirectional recurrent neural networks continuous space translation models for phrase-based statistical machine translation continuous space language models for statistical machine translation sequence to sequence learning with neural networks google's neural machine translation system: bridging the gap between human and machine translation adadelta: an adaptive learning rate method key: cord-010499-yefxrj30 authors: yelverton, elizabeth; lindsley, dale; yamauchi, phil; gallant, jonathan a. title: the function of a ribosomal frameshifting signal from human immunodeficiency virus‐1 in escherichia coli date: 2006-10-27 journal: mol microbiol doi: 10.1111/j.1365-2958.1994.tb00310.x sha: doc_id: 10499 cord_uid: yefxrj30 a 15‐17 nucleotide sequence from the gag‐pol ribosome frameshift site of hiv‐1 directs analogous ribosomal frameshifting in escherichia coli. limitation for leucine, which is encoded precisely at the frameshift site, dramatically increased the frequency of leftward frameshifting. limitation for phenylaianine or arginine, which are encoded just before and just after the frameshift, did not significantly affect frameshifting. protein sequence analysis demonstrated the occurrence of two closeiy related frameshift mechanisms. in the first, ribosomes appear to bind leucyl‐trna at the frameshift site and then slip leftward. this is the 'simultaneous slippage’mechanism. in the second, ribosomes appear to slip before binding amlnoacyl‐trna, and then bind phenylaianyl‐trna, which is encoded in the left‐shifted reading frame. this mechanism is identicai to the‘overlapping reading’we have demonstrated at other bacterial frameshift sites. the hiv‐1 sequence is prone to frame‐shifting by both mechanisms in e. coli. ribosomes normaiiy maintain a constant reading frame from aug to the finish, but they are capabie of slipping into an alternative reading frame at an average frequency of the order of 10 " (atkins etai, 1972; j. a. gallant etai, unpubiished) . in certain special cases, much higher frequencies of ribosome frameshifting occur. these cases include production of polypeptide release factor 2 of escherichia coli, which depends upon a rightward frameshift within the coding sequence (craigen et at.. 1985; craigen and caskey, 1987; weiss et ai, 1987; curran and yarus, 1988) ; translation of the reverse transcriptase of the yeast ty element, which also depends upon a rightward frameshift (clare ef ai, 1988) ; and translation of the rna of severai retroviruses, which express gag-pol and gag-pro-pol polyproteins by means of leftward frameshifts (reviewed by hatfield and oroszlan, 1990; cattaneo, 1989) . ribosomal frameshifting in both rightward and leftward directions has also been shown to occur at certain 'hungry' codons whose cognate aminoacyi-trnas are in short supply (gallant and foley, 1980; weiss and gailant, 1983; 1986; gallant et ai, 1985; kurland and gallant, 1986) . not all hungry codons are equally prone to shift: in a survey of 21 frameshift mutations of the rllb gene of phage t4, weiss and gallant (1986) found that oniy a minority were phenotypicaily suppressible when challenged by limitation for any of several aminoacyl-trnas. the context njies governing ribosome frameshifting at hungry sites are under investigation, and have been defined in a few cases (weiss et al., 1988; gallant and lindsiey, 1992; peter et ai. 1992; koior ef a/., 1993; lindsiey and gallant, 1993) . so far these sequences do not resembie any of the naturally occurring shifty sites summarized in the first paragraph above, in order to find out whether these two categories of ribosome frameshifting are mechanisticaliy reiated, we have tested the susceptibility of a well-studied retroviral frameshift site to manipulation by aminoacyl-trna limitation in e. coli we have directed our analysis to the shifty site at the gag-pol junction of hiv-1 both because of its clinical interest, and because certain features render it convenient for analysis. in some viral systems, baroque secondary structures in the mrna downstream of the frameshift site are required to augment frameshifting levels (jacks et ai, 1988b; brierley et ai, 1989) . in the case of hiv-1, however, although a stem-loop structure might exist downstream of the frameshift site (jacks et ai, 1988a) , direct modification or elimination of the stem-loop sequence has little effect on the rate of frameshifting (madhani et ai, 1988; weiss etai. 1989) . moreover, wilson etal. (1988) demonstrated that a short (21 nucleotide) sequence of hiv-1 without the stem-loop was sufficient to direct a high level of frameshifting in heterologous in vitro systems. the site of ribosomal frameshifting at the siippery sequence u-uuu-uua has been directly established by amino acid sequencing of frameshifted proteins (jacks et ai., 1988b) , and the participation of certain aminoacyl-trnas has been clearly implicated by mutagenesis of the monotonous tract of uridines (jacks et ai, 1988b; wiison et ai, 1988) . our purpose was to discover whether the ribosomal frameshifting directed by a very short sequence in hiv-1 could be reproduced by e coli ribosomes in vivo, and, if so, whether we could alter the rate of frameshifting by regimens that change the relative abundance of key aminoacyl trnas encoded at or near the frameshift site. weiss etal. (1989) have also reported that a 52 nucleotide fragment from hiv-1 is sufficient to direct ribosomal frameshifting in an e. coli system. in this report we present evidence that a much shorter 15-17 nucleotide sequence derived from hiv-1 is sufficient to direct the same ribosomal frameshift event in e. coli as in eukaryotes. we also show that in e. coli the rate of ribosomal frameshifting on that sequence can be increased by limitation for leucine, the amino acid encoded at the frameshift site. protein sequence analysis of the product indicates the occurrence of two siightiy different mechanisms of shifting. the strategy behind the construction of our assay system for ribosomai frameshifting may be understood with reference to fig. 1 . when eukaryotic ribosomes decode the hiv mrna sequence . . . uuuuuuaggg . . ., shown as nucleotides 1-10 in fig. 1a , the adenine at position 7 appears to be read twice: first, as the third position of a leucine codon (uua) and then again as the first position in the overlapping arginine codon (agg ratner et al. (1985) . in a heterologous mammalian in w/ro translation system, most of the frameshift product has the amino acid sequence . . . asn-phe-leu-arg (jacks et al., 1988b) , where leu is encoded by positions 5-7 and arg is encoded by positions 7-9. some mutations that result in increased or decreased expression of frameshift products in a heterologous test system are shown above and below the nucleotide sequence, respectively (wilson ef a/., 1988) . 'n' signifies a mutation to any non-u base. double underlines at positions -10 and +16 mark the boundaries of a fragment that directs the synthesis of a frameshift protein product in a heterologous yeast system (wilson et at.. 1988) . the singly underlined g at position -6 is the 5' boundary of a fragment that directs the synthesis of a frameshift protein product in a mammalian iri vitro system (jacks e( al., 1988b) . b. and c. a portion of &ie mrna sequences expressed from teczframeshift alleles hiv13, hiv13-a3, hiv201, and hiv201-u7 are depicted. numbers above the nucleotide sequence correspond to analogous positions of the hiv-1 gag-po/junction. doubly underlined nucieotides mark the twundaries of sequence thai is consen/ed with respect to hiv-1. synthesis of p-gaiactosidase from the alleles requires a leftward frameshift. amino acids are shown for the mature protein, afler in vivo cleavage of the initiating w-terminal fomiyl-met residue. constructs phiv13 and phiv201. and their variants phiv13-a3 and phiv201-u7, are described in fig. 1 . (the sequence of the critical heptanucleotide at the frameshift site is shown in parenthesis after each construct's designation.) all constructs were transformed into a derivative of cp79 {rela2 thr' leu his ~ arg~ thi~) carrying a complete deletion of lacz. methods of cultivation, and enzyme and protein assay were as described previously peter etal., 1992) . cells were grown into exponential phase in m63-glucose medium supplemenled with all required amino acids plus he and val. the lac promoter was induced (2mm iptg and 2.5mm camp) for about one doubling. data are reported ± standard error of the mean, with the number of replicate induced cultures in parentheses. these values include all the unstarved control cultures from the various starvation experiments. to take place in about 10% of ribosomal transits (jacks etai. 1988b) . in hiv-1, the outcome of the leftward ribosomal frameshift is the successful production of the gag-pol fusion protein, in the assay system we have devised, the outcome of an analogous ieftward frameshift by e coli ribosomes wiii be the successful production of the enzyme p-galactosidase from genetically frameshifted alleles of the /acz gene. we have previousiy used an assay system of similar design to demonstrate that lysyl-trna starvation can amplify ribosomai frameshifting in either direction at iysine codons, given certain context ruies (gailant and peter etai. 1992; lindsiey and gailant, 1993 ). aileles to be tested were constructed by the ligatlon of paired complementary oligonucleotides into the h/ndlll-bamh\ site of pbwhoo, as described in galiant and lindsiey (1992) . figure 1 shows the sequence of the translated strand from the region of our constructs that reproduces the gag-pol frameshift signal from hiv-1. the /acz frameshift alieles carried on plasmids pi^iv13 and phiv201 are constructed so that a shift to the left by one base, as in the expression of the gag-po/fusion of hiv, is required to generate active enzyme. the two constructs both carry a short sequence identical to the region around the frameshift site in the gag-po/overlap of hiv-1 for 15 nucleotides in phiv13 and 17 nucleotides in phiv201 (see fig. 1 ); they differ slightly from one another severai bases downstream of the frameshift site. host ceils carrying either of these piasmids produce active enzyme at about 1% of the efficiency of cells carrying a control lacz* plasmid (table 1) . this basal value is close to the value (1.8%) observed by weiss et al. (1989) for frameshifting on a much longer hiv-derived sequence in a similar p-galactosidase reporter. it is also much higher than the frequency of leftward frameshifting (0.03-0.2%) we observed previously at sequences unrelated to hiv (gailant and lindsiey, 1992) . the presence of the hiv sequence in our reporter thus leads to an unusually high frequency of leftward frameshifting. modification of the critical heptanucieotide sequence from u uuu uua to u uau uua in plasmid phiv13-a3 decreased frameshifting about fivefold, while modification of the heptanucleotide to u uuu uuu in phiv201-u7 increased frameshifting by two-to threefold (tabie 1). these genetic resuits are analagous to earlier findings in other reporter systems (jaci10 kb) and discrete clusters of genes clusters, habitually possessing a characteristically atypical g/c content, at least when compared with the remainder of the genome, leads, in turn, to the individual identification within clusters of virulence-associated protein antigens. prokaryotic pais are frequently associated with trna-encoding genes, many are flanked by repeat structures, and many contain fragments of mobile genetic elements such as plasmids and phages. pais can be identified by combining analysis of nucleotide composition and phylogeny, amongst others. composition-based approaches rely on the natural variation between genome sequences from different species. regions of the genome with abnormal composition, as demonstrated by nucleotide or codon bias, may be potentially transferred horizontally. such methods are prone to inaccuracies; these result from inherent genomic sequence variation, such as is seen in highly expressed genes, and the observation that over time the sequences of genomic islands alter to mirror the composition of host genomes. evolution-based approaches seek regions that may have been transferred horizontally by comparing related species. put at its simplest: a putative genomic island present in one species, but absent from several related species, is consistent with horizontal transfer. of course, the island may have been present in the last common ancestor shared by the species compared and subsequently been lost from the other species. a less likely explanation would be that the island arose by mutation and selection in this species and no other. to decide, a body of extra evidence would need to be explored, such as the size of the pai, the mechanistic ease of deletion, the consistent presence of the island in more distantly related species, the relative pathogenicity of island-less species, and the divergence of the genome relative to that of other related species. many methods, which seek to quantify and leverage these somewhat vague notions, are now available [42] [43] [44] . such analysis at the nucleic acid level shares many features in common with approaches used to identify cpg islands in eukaryotic genomes [45] [46] [47] [48] . recently, langille et al. tested six sequence-composition genomic island prediction methods and found that islandpath-dimob and sigi-hmm had the greatest overall accuracy [49] . island path was designed to help identify prokaryotic pais, through the visualisation of common pai characteristics such as mobile element-associated genes or atypical sequence composition [50] . sigi-hmm is a very accurate sequence composition-based genomic island predictor, which combines a hidden markov model (hmm) and codon usage measurement to identify genomic islands [51] . in another work, yoon et al. coupled heuristic sequence searching methods, which aimed simultaneously to identify pais and individual virulence genes, with composition and codon-usage bias [52] . exploiting a machine learning approach, vernikos and parkhill sampled the structural features of genomic islands using a hypothesis-free, bottom-up search, with the objective of explicitly quantifying the contribution made by each feature to the overall structure of different genomic islands [53] . arvey et al. sought to identify large chromosomal regions with atypical features using a general divergence measureable to quantify the compositional difference between genomic segments [54] . islandpick is a comparative genomic island predictor, rather than a composition-based approach, that can identify very probable genomic islands and very probable non-genomic islands within investigated genomes but does require that several phylogentically related genomes are available [49] . observing pais as having a g + c composition closer to their host genome, wang et al. used so-called genomic barcodes to identify pais. these barcodes are based on the fact that the frequencies of 2-mers to 7-mers, and their reverse complement, are very stable across a whole genome when using a window size of over 1,000 bps and that this constituted a characteristic signature for genomes [55] . the ready detection of pais, as a tool in computational reverse vaccinology, has been greatly aided by the deployment of several web-based resources. a key example of a server that successfully integrates several accurate genomic island predictors is islandviewer [56] , which combines the methods: islandpick [49] , islandpath [50] , and sigi-hmm [51] and is available at the url: http://www. pathogenomics.sfu.ca/islandviewer/query.php. the gui facilitates the visualisation of genomic islands and downloading of data at the gene and chromosome levels in a variety of formats. another important, web-accessible resource is paidb or the pai database. this is a wide-ranging database of pais, containing 112 distinct pais and 889 genbank accessions present in 497 strains of pathogenic bacteria [57] . paidb may be accessed via the url: http://www.gem.re.kr/paidb. thus, alternative techniques and methodologies are required in order to select and to rank proteins likely to be protective antigens and thus candidate vaccines. below, we shall explore three key approaches: subcellular location prediction, alignment-dependent sequence similarity searching, and alignment-independent empirical statistical approaches. in this section, we consider, perhaps, the clearest and cleanest way to identify potential new antigens in any microbial genome to alignment-dependent sequence similarity searching. there are two complimentary but distinct ways of identifying the immunogenicity of a protein from its sequence. one is to look for significant similarity to proteins of known immunogenicity. this idea seems so straightforward as to be almost facile. the other approach is somewhat less obvious conceptually but almost as straightforward logistically and involves seeking to identify antigens as proteins without discernible sequence similarity to any host protein. let us turn to the first of these two alternatives. let us begin by stating or rather reiterating the obvious. if we know the sequence of an existing antigen or antigens, we can use sequence searching to find similar sequences in the target genome [58, 59] . any candidate antigens selected by this process can then be selected for further verification and validation. the same old, familiar caveats apply here: are chosen thresholds appropriate? are high-scoring matches an artefact or are they real and meaningful? the litany of such conditions is all too familiar to anyone well versed in sequence similarity searching. clearly, when a sequence search is run, using blast or fasta3, for example, an enormously long list of nearly identical proteins might ensue, or one that does not get any hits at all, or almost any intervening result might be obtained. as reflective practitioners, we must judge which result can be classified as useful and which cannot, and in so doing, identify sets of suitable thresholds, above which we expect usefulness and below which we might anticipate little or no utility. thresholds are contingent upon the sequence family studied, as well as being dependent solely on the problem investigated. thus heuristically identified cut-offs are desirable, but much thinking and empirical investigation are required to select appropriate values. of course, the process adumbrated above presupposes that sufficient antigenic protein sequences are known. compilation of this data is the role of the database. recently, extensive literature mining, coupled with factory-scale experimentation, has created many functional immunology databases, although databases, such as syfpeithi [60, 61] , focussing on cellular immunology-primarily mhc processing, presentation, and t cell recognition-have existed for 15-20 years. arguably, the best extant database is the hiv molecular immunology database [62] , although clearly the depth of the database is at the expense of generality and breadth. other recent databases include mhcbn [63, 64] and epimhc [65] , amongst many others. two databases, warrant particular attention: antijen [66] , formerly known as jenpep [67, 68] ; and iedb [69] . implemented as a relational postgresql database, antijen integrates a wideranging set of data items, much of which is not stored by other databases. in addition to the kind of cellular immunological information familiar from syfpeithi, such as mhc binding and t cell data, antijen additionally archives b cell epitopes and also includes a significant stockpile of quantitative data: kinetic, thermodynamic, as well as functional, including measurements of immunological peptide-protein and protein-protein interactions. the iedb database is considerably more extensive than other equivalent database systems, benefiting from the input of 13 dedicated epitope sequencing projects. iedb has come to eclipse other work in this area. although both antijen and iedb are full of epitope-focussed information of many flavours, they remain incomplete concerning immunogenic antigens. fortuitously, specific antigen-orientated-rather than epitope-focusseddatabases are starting to be available. arguably, the most obvious and most unambiguous example of an antigen is virulence factor (vf): proteins, such as toxins, able to induce disease directly by attacking a host. analysis of known pathogens has allowed recurring vf systems of 40+ distinct proteins. often, sets of vfs exist as discrete, distinct genome-encoded pais, as well as being more widely spread through the genome. clearly, antigens do not need to be vfs in order to be immunogenic and thus candidates for subunit vaccines. instead, they need only be accessible to the immune system. they do not need to directly or indirectly mediate infection. thus, other databases are needed which capture, collate, and archive the burgeoning plethora of antigen-orientated data. recently, we have helped developed a very different database: antigendb [70] . it contains over 500 antigens collated from the primary scientific literature, as well as other sources. another related database system has been christened violin (vaccine investigation and online information network) [71] , which allows straightforward curation and the analysis and comparison of research data across diverse pathogens in the context of human medicine, animal models, laboratory model systems, and natural hosts. as we outline above, in addition to identifying sequence similarity to known antigens, another idea gaining ground is that the immunogenicity of an antigen is solely determined by the absence of similarity to host proteins. some think this is the prime determinant of potential protein immunogenicity [72, 73] . such ideas are supported by the belief that immune systems are actively educated to lack reactivity to self-proteins [74] , a process-often termed "immune tolerance"-which is generated via epitope-specific mechanisms [75, 76] . what we really want is a meaningful measure of the "foreignness" of a protein correlating with its immunogenicity. usually, "evolutionary distance" substitutes for "foreignness." clearly, such an evolutionary distance must be specified in terms of biomacromolecular structures or sequences. but, is this practically useful for selecting candidate vaccines? another way to formulate this idea is to say that the probability that a protein is immunogenic is exclusively a product of its dissimilarity, at the whole-sequence or sequence-fragment level, to each and every protein contained within the host proteome. most search software is well matched to this problem. in terms of fragment length, the typical length of an epitope might seem logical, since the epitope is the molecular moiety typically recognised during the initial phase of an immune response. yet, even at the epitope level-say a peptide of 8-16 amino acid residues-even a single conservative mutation or mismatch in an otherwise identical match might prove significant. single sequence alterations may totally abrogate or significantly enhance neutralising antibodies binding or recognition by the machinery of cellular immunology. we have attempted to benchmark sequence similarity and correlate it with immunogenicity in order to explore the potential of this idea in a quantitative fashion. to that end, we examined the differences between sets of antigens and non-antigen using sequence similarity scores. we looked specifically at sets of 100 known non-antigenic and 100 antigenic protein sequences from six sources: bacteria, viruses, fungi, and parasites, as well as allergens and tumours [77] [78] [79] , comparing pathogen sequence to those from humans and mice using blast [80] . most non-antigenic and antigenic sequences were non-redundant; implying a lack of homologues between pathogens and host proteomes, although certain parasite antigens, such as catalases and heat shock proteins, had a much greater level of similarity. we were not able to determine a suitable and appropriate threshold based on the hypothesis of non-redundancy to the host's proteome, suggesting that this is not a viable solution to vaccine antigen identification. however, rather than looking at nucleic acid sequences, or at protein sequences using an alignment-based approach, a new set of techniques, based upon alignmentfree techniques, has been and is being developed; as this approach begins to show significant potential, we shall examine it next. proteins accessible to immune system surveillance are assumed to lie external to the microbial organism or be attached to its surface rather than being sequestered and sequestrated within the cell. for bacteria, this means being located on-or in-the outer membrane surface or being secreted. thus, being able to accurately predict the physical location of a putative antigen can provide considerable insight into the likelihood that a particular protein will prove to be an immunogenic and possibly protective. there are two basic kinds of prediction method for identifying subcellular location: manual rule construction and the application of data-driven machine learning methods. data used to discriminate between compartments include sequence-derived features of the protein, such as hydrophobic regions; the amino acid composition of the whole protein; the presence of certain specific motifs; or a combination thereof. accuracy differs significantly between different methods and different compartments, mostly resulting from the deficiency and inconsistency of data used to derive models. gross overall sequence similarity is unable to predict protein sub-cellular location reliably or accurately. even nearly identical protein sequences may be found in distinct locations, while there are many proteins which exist simultaneously at several distinct locations within the cell, often having equally distinct functions at these different sites [81] . eukaryotes and prokaryotes have quite distinct subcellular compartments. the number of such compartments used in prediction studies varies. a common schema reduces prokaryotic to three compartments (cytoplasmic, periplasmic, and extracellular) and eukaryotic cells to four compartments (nuclear, cytoplasmic, mitochondrial, and extracellular). other structural classifications evince in excess ten eukaryotic compartments. ten compartments maybe a conservative estimate, such is the complex richness of sub-cellular structure. any prediction method must account for permanent, transient, and multiple locations, and, in addition, multi-protein complexes and membrane-bound organelles as possible sites. numerous signal sequences exist. several methods predict lipoproteins. the prediction of proteins translocated via the tat-dependent pathway is important but has yet to be addressed properly. however, amongst binary, single-outcome approaches, signalp is probably the most accurate and reliable method available. it uses neural networks to predict the presence and probable cleavage sites of type ii or n-terminal spase-i-cleaved secretion signal peptides [82] [83] [84] . this signal is common to both prokaryotic and eukaryotic organisms. signalp has recently been enhanced with a hmm intended to discriminate cleaved from uncleaved signal anchors. a limitation of signalp is its proclivity to over-predict: it cannot properly discriminate reliably between a number of very similar yet functionally different signal sequences, regularly predicting lipoproteins and integral membrane proteins as type ii signals. many methods have been devised capable of dividing a genome or virtualproteome between the various subcellular locations of a eukaryotic or prokaryotic cell. psort is a good example; it is a multicategory prediction procedure, comprising many different programmes [85] [86] [87] [88] . psort i predicts 17 subcellular compartments, while psort ii predicts ten different locations. ipsort deals with several compartments: chloroplast, mitochondrial, and proteins secreted from the cell, while psort-b focuses solely on predicting bacterial sub-cellular locations. another effective programme is hensbc [89] . hensbc can assign gene products to one of four different types (nuclear, mitochondrial, cytoplasmic, or extracellular) with an accuracy of about eight out of ten for gram-negative bacteria. another programme, subloc [90] , predicts prokaryotic subcellular location divided between three compartments. another programme is gpos-ploc [91] , which integrates several basic classifiers. other methods include phobius [92] , lipop 1.0 [93] , and tatp 1.0 [94] . a comparison of several such programmes, using 272 mycobacterial proteins as a gold standard [95] , showed subcellular localisation prediction and possessed high predictive specificity. we have developed a set of methods which predict bacterial subcellular location. using a set of methods for lipoprotein, tat secretion, and membrane protein prediction [96] [97] [98] [99] [100] [101] [102] , three different bayesian network architectures were implemented as software pipelines able to predict specific subcellular locations, and two serial implementations using a hierarchical decision structure, and a parallel implementation with a confidence-level-based decision engine [103] . the soluble-rooted serial pipeline performed better than the membrane-rooted predictor. the parallel pipeline outperformed the serial pipeline but was significantly less efficient. genomic test sets proved more ambiguous: the serial implementation identified 22 more of the 74 proteins of known location yet more accurate predictions are made overall by the parallel implementation. the implications of this work are clear. the complexity of subcellular structures must be integrated fully into sub-cellular location prediction. in extant studies, many important cellular organelles are not considered; different routes by which proteins can reach the same compartment are ignored; and proteins existing simultaneously at several locations are likewise discounted. clearly, combining high specificity predictors for each compartment appropriately must be the way forward [103] . many difficulties, problems, and quandaries persist; the most keenly felt is the lack of high-quality, verified, and validated datasets which unambiguously established the location of well-characterised proteins. this dearth is particularly serious for certain types of secreted protein, such as type iii secretion. in a similar manner, considerably more work is required to accurately predict the locations for proteins of viral origin; while certain studies are encouraging [104, 105] , the complexity of viral interaction with host organisms continues to confound attempts at analysis. predicting antigens in silico typically utilise bioinformatics tools. such tools can identify signal peptides or membrane proteins or lipoproteins successfully, yet the majority of algorithms tend to depend on motifs characteristic of antigens or, more generally, sequence alignment as the principal arbiter of definitive and meaningful sequence relationships. this is potentially a problem of some magnitude, particularly given the wide range of evolutionary rates and mechanisms amongst microbial proteins. certain protein families do not, however, show obvious or significant sequence similarity, despite having common biological properties, functions, and three-dimensional structures [106, 107] . thus alignment-based approaches may not always produce useful and unequivocal results, since they assume a direct sequence relationship that can be identified by simple sequence search techniques. immunogenicity, as a signature characteristic, may be encrypted within the structure and/or sequence instead. this may be encoded so cryptically or so subtlety as to completely confound or at least mislead conventional sequence alignment protocols. discovery of utterly novel and previously unknown antigens will be totally stymied by the absence of similarity to known antigenic proteins. alignment-dependent methods tend to dominate bioinformatics and, by extension, immunoinformatics. several authors have chosen to look at alternative strategies, implementing so-called alignment-independent or alignment-free techniques. the first authors to do so were mayer et al., who reported that protective antigens had a different amino acid composition compared to control groups of nonantigens [108] . such a result is unsurprising since it has long been known that the structure and sequence composition of proteins adapted to the different redox environments of different sub-cellular compartments [109] . mayer's analysis was formulated primarily in terms of univariate comparisons of antigens versus controls for different properties. subsequently, we explored bivariate comparison in terms of easily comprehensible scatter-plots. see fig. 3.3 for representative examples. what their results ably demonstrate is the potential for the discrimination of antigens and non-antigens by the appropriate selection of orthogonal descriptors. the challenge, of course, is to identify a robust choice of descriptors which are capable of extrapolating as well interpolating when used predictively. progressing beyond this type of analysis, and synergising with our other work on alignment-independent representation [110] [111] [112] [113] [114] , we have initiated the development of new methods to differentiate antigens-and thus potential vaccine candidates-and non-antigens, using more sophisticated alignment-free approach to sequence representation [115, 116] . rather than focus on epitope versus nonepitope, our approach utilises data on protective antigens derived from diverse pathogens to create statistical models capable of predicting whole-protein antigenicity. our alignment-independent method for antigen identification uses the auto cross covariance (acc) transformation originally devised by wold et al. [117, 118] to transform protein sequences into uniform vectors. the acc transform has found much application in peptide prediction and protein classification [119] [120] [121] [122] [123] [124] [125] [126] . in our method, amino acid residues are represented by the well-known and well-used z descriptors [127] [128] [129] , which characterise the hydrophobicity, molecular size, and polarity of residues. our method also accounts for the absence of complete independence between distinct sequence positions. we initially applied our approach to groups of known viral, bacterial, and tumour antigens, developing models capable of identifying antigen. extra models were subsequently added for fungal and parasite antigens. for bacterial, viral, and tumour antigens, models had prediction accuracies in the 70-89 % range [115, 116, 130] . for the parasite and fungal antigens, models had good predictive ability with 78-97 % accuracy. these models were incorporated into a server for protective antigen prediction called vaxijen [115] (url: http://www.darrenflower.info/ vaxijen). vaxijen is an imperfect but encouraging start; future research will yield significantly more insight as well-characterised protective antigens increase significantly in number [70] . as we have said, a number of bioinformatics problems are unique to the discipline of immunology: the greatest of these is the accurate quantitative prediction of immunogenicity. this chapter has in its totality been suffused and pervaded by the idea of immunogenicity and the challenge of predicting this property in silico. such an endeavour is confounding, yet exciting, and, as a key instrument in developing better, safer, more effective vaccines, is also of undisputed practical utility. successful immunogenicity prediction is at its simplest made manifest through the identification of b cell or t cell epitopes. epitope recognition, when seen as a chemical event, may be understood in terms of the relationships between apparent biological function or activity and basic physicochemical properties. delineating structure-activity or property-activity relationships of this kind is a key concern of immunoinformatics. at the other end of the spectrum, immunogenicity can be viewed is a cohesive, integrated, system property: a property of the entire and complete immune system and not a series of individual and isolated molecular recognition events. thus, the task of predicting systems-level immunogenicity is in all likelihood manifold more demanding than predicting peptide-binding say. the clinical manifestation of vaccine immunogenicity arises from the complex amalgam of many contributing extrinsic and intrinsic factors, which includes pathogen-side and host-side properties, as well as those just coming directly from proteins themselves. see fig. 3.2 . protein-side properties include the aggregation state of candidate vaccines and the possession of pamps. pathogen-side properties are clearly properties intrinsic to the pathogen, including expression levels of the antigen, the time-course of this expression, as well as its subcellular location. socalled host-side properties are innate recognition properties of host immunity, and most obviously include t cell epitopes or b cell epitopes. a bona fide candidate antigen should be available for immune surveillance and thus highly expressed, constitutively or transiently, as well as having several epitopes. a protein without immunogenicity would logically lack all or some of these characteristics. as a prediction problem, this is, to say the least, not uncomplicated; clearly consisting of a great variety of difficult-to-compute stages. in terms of mechanism, many of these stages are poorly understood. yet, each can be addressed using standard computational and statistical tools. they can all be predicted, however, presupposing, of course, the presence of relevant data in sufficient quantity. one of the strongest messages to emerge from this review is that immunogenicity is a strongly multi-factorial property: some protein antigens are immunogenic for one reason, or set of reasons, and other immunogenic proteins will be so for another possibly tangential reason or set of reasons. each such causal manifold is itself complex and potentially confusing. thus, the prediction of immunogenicity is a problem in multi-factorial prediction, and the search for new antigens is a search through a multi-factorial landscape of contingent causes and discombobulating decoys. some of the evidence will be highly precise and quantitative. the kind provided by predictive immunoinformatics, for example. this typically yields exact values for, say, the binding affinity of a peptide to a protein component of the immune system, or an unequivocal yes or no answer to the question: is this peptide sequence an epitope? however, for each such exact prediction, we have some notional associated probability concerning how reliable we regard this result. different methods evince a range of accuracy, which, in practice, equate to probabilities of reliability: we naturally have more confidence and assume a greater reliability for a highly accurate prediction versus one of average predictability, though it can still give wrong predictions and generally inaccurate predictors may work well for a specific subset of the data. other types of forms of evidence will have a distinctly more anecdotal flavour. take, for example, the case of bacterial exotoxins. together with endotoxins, such as lps, and so-called superantigens, exotoxins form the principal varieties of toxin secreted by pathogenic bacteria. exotoxins have evolved to be the most toxic substances known to science: in terms of the median lethal dose, botulinum toxin-the active ingredient of botox and causative agent of botulism, amongst others-is about ten times as lethal as radioactive isotope polonium-210 and a million times more deadly than mainline poisons, such as arsenic or potassium cyanide. virtually, all such potent bacterial exotoxins comprise two functionally distinct subunits, either separate proteins or distinct domains, usually denoted a and b. the a subunit is habitually an enzyme, such as a protease, which modifies specific protein targets, thus disrupting key cellular processes with host cells. the b subunit is a protein which binds to host cell surface lipids or proteins, enabling the toxin to be internalised efficiently. the high specificity of this dual action lends exotoxins much of their remarkable lethality. exotoxins are also extremely immunogenic, inducing the immune systems to produce high-affinity neutralising antibodies against them, and thus make excellent targets for vaccinology. a toxoid-a toxin which has been treated or inactivated, often by formaldehyde-is in essence a form of subunit vaccine and, as such, requires adjuvant to induce adequate immune responses. vaccines targeting tetanus and diphtheria, which usually need boosting every decade, are based on toxoids, albeit typically combined with pertussis toxin acting as an adjuvant. poisoning by exotoxins, on the other hand, requires treatment with antitoxin comprising preformed antibodies. however, and say that we were offered a newly sequenced pathogen genome, is such a classification for ab toxins helpful when trying to identify a potential exotoxins? the answer is neither yes nor is it no, but lies somewhere between these extremes. assuming we had extant knowledge or a reliable method predicting the presence of structural and functionally distinct domains, this very simple ruleof-thumb would become a useful tool for eliminating large numbers of possible toxin molecules. it would not directly identify an antigen but would enormously reduce the workload inherent in their discovery. as well as needing more and more reliable predictors, we also need a way of combining the information we gather from any set of reliable predictors to which we have access. thus, when analysing a pathogen genome, what we seem to need, at least in order to identify immunogenic proteins, is both a set of reliable and robust tools and a cohesive expert system within which to embed them. such systems, albeit still at a relatively crude and faltering level, do exist. because there is an implicit hierarchy of one prediction being based on others, there is a need to balance and judge different pieces of probabilistic evidence. an effective expert system should be capable of such a feat. to a first approximation, an expert system is a computer programme that undertakes tasks that might otherwise be prosecuted by a human expert ostensively by simulating the apparent judgement and behaviour of an individual or organization with expertise and experience within a particular discipline. an expert system might make financial forecasts, or play chess; it might diagnose human illnesses or schedule the routes of delivery vehicles. to create an expert system, one first needs to analyse human experts and how they make decisions, before translating this into rules that a computer can follow. such a system leverages both a knowledge base of accumulated expertise and a set of rules for applying such distilled knowledge to particular situations in order to solve problems. sophisticated expert systems can be updated with new knowledge and rules and can also learn from the success of its prediction, again mirroring the behaviour of properly performing experts. at the heart then of an expert system is the need to combine evidence in order to reach decisions. combining evidence, and reaching a decision based on that combined evidence, is no easier in the laboratory, be that virtual or actual, than it is in the court room. the problem of combining evidence is encountered across the disciplines, and various solutions have arisen in these different areas. within bioinformatic prediction, a particular variety of evidence combination, so-called meta-prediction, is a now a well-established strategy [131, 132] . this approach seeks to amalgamate the output of various predictors, typically internet servers, in an intelligent way so that the combined result is more accurate than any of those coming from a single predictor. indeed, combining results from multiple prediction tools does often increase overall accuracy. a consensus strategy was first proposed by mallios [133] , who combined syfpeithi [60, 61, 134] , propred [135, 136] , and the iterative stepwise discriminant analysis meta-algorithm [137] [138] [139] . multipred [140] integrates hmms and artificial neural networks (ann). six mhc class ii predictors were combined by dai and co-workers [141] [142] [143] basing its overall prediction on the probability distributions of the different scores. trost et al. have used a heuristic method to address class i peptide-mhc binding [144] . wang et al. [145] applied a consensus method to calculate the median rank of the top three predictive methods for each mhc class ii protein initially evaluated so as to rank all possible 8-, 9-, and 10-mers from one protein. this rank was used to identify the top 1 % of peptides from each protein. in probabilistic reasoning, or reasoning with uncertainty, there are many ways to represent espoused beliefs-or, in our domain, predictions-that effectively encode the uncertainty of propositions. these include fuzzy logic and the evidential method, among many others. for quantitative data, information fusion, in its various guises [146] , is one robust route to effective combination. another requires us to enter the world of bayesian statistics, or, at least, a special thread within it. bayes theory, and the ever-expanding strand of statistics devolving from it, is concerned primarily with updating or revising belief in the light of new evidence, while so-called dempster-shafer theory [147] is concerned not with the conditional probabilities of bayesian statistics but with the direct combination of evidence. it extends the bayesian theory of subjective probability, by replacing bayesian probabilities with belief functions that describe degrees of belief for one question in terms of probabilities for another and then combines these using dempster's rule for merging degrees of belief when based on independent lines of evidence. such belief functions may or may not have the mathematical properties of probabilities but are seemingly able to combine the rigor of probability theory with the flexibility of rule-based approaches. several expert systems of different flavours and hues have now become available within the vaccinology arena. sundaresh et al. developed a specialist software package for the analysis of microarray experiments that could easily be classified as an expert system and used it in the area of reverse vaccinology. this package, which was written in the open-source statistical package r, was used to help analyse a variety of complex microarray experiments on the bacteria f. tularensis, a category a bio-defense pathogen [148] . this programme implements a two-stage process for diagnostic analysis: selection of antigens based on significant immune responses coupled with differential expression analysis, followed by classification of measured antigen responses using a combination of k-means clustering, support vector machines, and k-nearest neighbours. we have already discussed vaxijen [115, 116, 130] , and the related server epijen [149] , which combines various methods for identifying epitopes within extant proteins. these two servers can also be classified as vaccine-related expert systems. nerve is another expert system, which has been developed to help automate aspects of reverse vaccinology [150] . using nerve, the prioritisation of potential candidate antigens consists of several stages: prediction of subcellular localisation; is the antigen an adhesion?; identification of membrane-crossing domains; and comparison to pathogen and human proteomes. candidates are filtered then ranked and putative antigens graded by provenance and its predicted immunogenicity. the web-based expert system, dynavacs [151] , was developed to facilitate the efficient design of dna vaccines and is available in the url: http://miracle. igib.res.in/dynavac. it takes a structured approach for vaccine design, leveraging various key design parameters, including the choice of appropriate expression vectors, safeguarding efficient expression through codon optimization, ensuring high levels of translation by adding specific sequence signals, and engineering of cpg motifs as adjuvant mechanisms exacerbating immune responses. it also allows restriction enzyme mapping, the design of primers, and lists vectors in use for known dna vaccines. vaxign is another expert system developed to help facilitate vaccine design [152] . vaxign undertakes dynamic vaccine target prediction from sequence. methodologically, it combines protein subcellular location prediction with prediction of transmembrane helices and adhesins, analysis of the conservation to human and/or mouse proteins with sequence exclusion from the genomes of nonpathogenic strains, and prediction of peptide binding to class i and class ii mhc. as a test, vaxign has been used to predict vaccine candidates against uropathogenic escherichia coli. however, nerve and its various and varied siblings are tasked with such a confounding and difficult undertaking that they are obliged to fall somewhat short of what is required. an obvious first step in tackling the greater problem is to address first subcellular location prediction. then, we can look at antigen presentation, modelling for each component step, before building these into a fully functional model. we can also develop empirical approaches-such as vaxijen [115, 116, 130] . we must also factor in antibody-mediated issues, properly address pamps, post translational danger signals, expression levels, the role of aggregation, and the capacity of molecular adjuvants to enhance the innate immunogenicity to usable levels. see fig. 3. 2. the value of vaccines is not yet unchallenged. however, most reasonable people would, in all probability, agree that they are a good thing, albeit with a few minor provisos. the idea underlying all vaccines is a strong and robust one: it is in the reification-that is, the realisation, manifestation, and instantiation-of this abstract concept that the trouble lies, if indeed trouble there is. existing vaccines are by no means perfect; again, most sensible and well-informed people would no doubt acknowledge this also. one might argue that their intrinsic complexity, and the highly empirical nature of their discovery over decades, and the fraught nature of their manufacture, has much to answer in this regard. why should this be? in part, it is due to the extreme complexity of immune response to an administered vaccine, which is largely specific to each individual or at least is different in different sub-groups within the totality of the vaccinated population. the immune responses is comprised, at least for whole-pathogen vaccines, of the adaptive immune response to multiple b cell and t cell epitopes as well as the responses made by the innate immune responses to diverse molecular structures, principally pamps. when one considers also the degree to which such a repertoire of responses is augmented and modified by the action of additives, be they designed to increase the durability and stability of vaccines or be they adjuvants, which are intended to raise the level of immune reactions. add in stochastic and coincidental phenomena, such as reversion to pathogenicity, and we can see immediately that navigating our way through the vaccine minefield is no easy task. all such problems engendered by this intrinsic complexity are themselves compounded by our comparatively weak understanding of immunological mechanisms, since, if we understood the mechanism of responses well enough, we could and would have designed our vaccines to circumvent these issues. part of the answer to this cacophony of conflicting and confounding quandaries is the newly emergent discipline of vaccinomics. a proper understanding of the relationships between gene variants and vaccine-specific immune responses may help us to design the next generation of personalised vaccines. vaccinomics addresses this issue directly. it seeks to identify genetic factors mediating or moderating vaccine-induced immune responses, which are known to be extremely variable within population. much data indicate that host genetic polymorphisms are key determinants of innate and adaptive response to vaccination. hla genes, non-hla genes, and genes of the innate immunity all contribute, and do so in many ways, to the variation observed between individuals for immune responses to microbial vaccines. vaccinomics offers many techniques that can help illuminate these diverse phenomena. principal amongst these are population-based gene/snp association studies between allele or snp variation and specific responses, supplemented by the application of next-generation sequencing technology and microarray approaches. yet, and for all this nay-saying and gainsaying, vaccines and vaccination have demonstrated their worth time after time; yet, to justify the continuing faith we invest in them, new and better ways of making safer and more focussed vaccines must be found. most current vaccines work via antibody-mediated mechanisms; and most target viruses and the diseases they cause. unfortunately, the stock of such disease targets is dwindling. low-hanging fruit has long since been cut down. only fruit that is well out of reach remains. vaccines based on apcs and peptides are new but unproven strategies; most modern vaccine development relies instead on effective searches for vaccine antigens. one of the clearest points to emerge from such work is that there are many competing concepts, thoughts, and ideas that may confound or help efficient identification of immune reactive proteins. certain such ideas we have outlined. some are indisputably persuasive, even compelling, yet many strategies-and the technical approaches upon which they are based-have singly failed to deliver on their promise. long ago, and based on his lifetime's experience of all things immunological, professor peter cl beverley sketched out a paradigm for protein-focussed vaccine development, which we have formalised further, and which schema is summarised in fig. 3.4 . some of his factors overlap with the factors from fig. 3 .2. he identified many of the factors that potentially contribute to the immunogenicity of proteins, be they of pathogen origin or another source entirely, and also other features which might make proteins particularly suitable for becoming candidate vaccines. of these, some are as-yet beyond prediction, such as the attractiveness for apcs or the inability to down-regulate immune responses. the status of proteins as evasins is currently only possibly addressable through sequence similarity-based approaches and likewise for the attractiveness for uptake by apcs is again, though possible there exist motifs, structural or sequence, which could be identified. currently, the dearth of relevant data precludes prediction of such properties; and, while it is possible to predict some of these properties with some assurance of success, and others are predictable but only incidentally, overall, we are still some way from realising the dream embodied in fig. 3 failure occurs for simple reasons: we deal with simplified abstractions and cannot hope to capture all that which is required for prediction by looking superficially at a single factor. protein immunogenicity comes instead from the dynamic combination of innumerable contributing factors. this is by no means a facile or easily solved informatics conundrum. a vaccine candidate should have epitopes that the host recognises, be available for immune surveillance, and be highly expressed. factors mediating protein immunogenicity are many; possession of b or t cell epitopes, post-translational danger signals, sub-cellular location, protein expression levels, and aggregation state amongst them. predicting such diverse, complex, confounding properties is-and remains-a challenge. vaccine antigens, once discovered, should, ultimately, and with appropriate manipulation, together with an apt, apposite, and appropriate delivery system and the right choice of adjuvant, become first a candidate for clinical trials, before, hopefully, progressing to regulatory approval. we require an integrative, systemsbiology approach to solve this problem. no single approach can be applied universally and with success; what we crave is the full integration of numerous equally wakefield's article linking mmr vaccine and autism was fraudulent computer-aided biotechnology: from immuno-informatics to reverse vaccinology harnessing bioinformatics to discover new vaccines new vaccines against tuberculosis bioinformatics for vaccinology lessons learned concerning vaccine safety vaccines: the real issues in vaccine safety target the fence-sitters an american tragedy'. the cutter incident and its implications for the salk polio vaccine in new zealand 1955-1960 the cutter incident, 50 years later poliomyelitis following formaldehyde-inactivated poliovirus vaccination in the united states during the spring of 1955. ii. relationship of poliomyelitis to cutter vaccine bioinformatics for vaccinology vaccine-derived poliovirus (vdpv): impact on poliomyelitis eradication advances in predicting and manipulating the immunogenicity of biotherapeutics and vaccines the use of genomics in microbial vaccine development post-genomic vaccine development microbial genomes and vaccine design: refinements to the classical reverse vaccinology approach biotechnology and vaccines: application of functional genomics to neisseria meningitidis and other bacterial pathogens complete genome sequence of neisseria meningitidis serogroup b strain mc58 identification of vaccine candidates against serogroup b meningococcus by whole-genome sequencing a universal vaccine for serogroup b meningococcus identification of vaccine candidate antigens from a genomic analysis of porphyromonas gingivalis use of a whole genome approach to identify vaccine molecules affording protection against streptococcus pneumoniae infection identification of a universal group b streptococcus vaccine by multiple genome screen functional selection of vaccine candidate peptides from staphylococcus aureus whole-genome expression libraries in vitro discovery of a novel class of highly conserved vaccine antigens using genomic scale antigenic fingerprinting of pneumococcus with human antibodies immunodominant francisella tularensis antigens identified using proteome microarray a burkholderia pseudomallei protein microarray reveals serodiagnostic and cross-reactive antigens antibody-protein interactions: benchmark datasets and prediction tools evaluation benchmarking b cell epitope prediction: underperformance of existing methods prediction of mhc-peptide binding: a systematic and comprehensive overview in silico tools for predicting peptides binding to hlaclass ii molecules: more confusion than conclusion on evaluating mhc-ii binding peptide prediction methods evaluation of mhc-ii peptide binding prediction servers: applications for vaccine research a critical cross-validation of high throughput structural binding prediction methods for pmhc limitations of ab initio predictions of peptide binding to mhc class ii molecules t cell receptor recognition of a 'super-bulged' major histocompatibility complex class i-bound peptide high resolution structures of highly bulged viral epitopes bound to major histocompatibility complex class i. implications for t-cell receptor engagement and t-cell immunodominance have we cut ourselves too short in mapping ctl epitopes? a long, naturally presented immunodominant epitope from ny-eso-1 tumor antigen: implications for cancer vaccine design identification and characterization of pathogenicity and other genomic islands using base composition analyses a novel strategy for the identification of genomic islands by comparative analysis of the contents and contexts of trna sites in closely related bacteria mobilomefinder: web-based tools for in silico and experimental discovery of bacterial genomic islands cpgcluster: a distance-based algorithm for cpg-island detection cpgif: an algorithm for the identification of cpg islands identifying cpg islands by different computational techniques cpg_mi: a novel approach for identifying functional cpg islands in mammalian genomes evaluation of genomic island predictors using a comparative genomics approach islandpath: aiding detection of genomic islands in prokaryotes score-based prediction of genomic islands in prokaryotic genomes using hidden markov models a computational approach for identifying pathogenicity islands in prokaryotic genomes resolving the structural features of genomic islands: a machine learning approach detection of genomic islands via segmental genome heterogeneity prediction of pathogenicity islands in enterohemorrhagic escherichia coli o157:h7 using genomic barcodes islandviewer: an integrated interface for computational identification and visualization of genomic islands towards pathogenomics: a web-based resource for pathogenicity islands identification and characterization of a novel family of pneumococcal proteins that are protective against sepsis functional genomics of pathogenic bacteria syfpeithi: database for searching and tcell epitope prediction syfpeithi: database for mhc ligands and peptide motifs hiv sequence databases mhcbn 4.0: a database of mhc/tap binding peptides and t-cell epitopes mhcbn: a comprehensive database of mhc binding and non-binding peptides epimhc: a curated database of mhcbinding peptides for customized computational vaccinology antijen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data jenpep: a novel computational information resource for immunobiology and vaccinology jenpep: a database of quantitative functional peptide data for immunology the immune epitope database 2.0 antigendb: an immunoinformatics database of pathogen antigens violin: vaccine investigation and online information network epitopic peptides with low similarity to the host proteome: towards biological therapies without side effects peptimmunology: immunogenic peptides and sequence redundancy primer: mechanisms of immunologic tolerance recent advances in immune modulation cutting edge: contributions of apoptosis and anergy to systemic t cell tolerance discriminating antigen and non-antigen using proteome dissimilarity iii: tumour and parasite antigens discriminating antigen and non-antigen using proteome dissimilarity ii: viral and fungal antigens discriminating antigen and non-antigen using proteome dissimilarity: bacterial antigens gapped blast and psi-blast: a new generation of protein database search programs single proteins might have dual but related functions in intracellular and extracellular microenvironments locating proteins in the cell using targetp, signalp and related tools improved prediction of signal peptides: signalp 3.0 a comprehensive assessment of n-terminal signal peptides prediction methods wolf psort: protein localization predictor secreted protein prediction system combining cj-sphmm, tmhmm, and psort psort-b: improving protein subcellular localization prediction for gram-negative bacteria psort: a program for detecting sorting signals in proteins and predicting their subcellular localization predicting protein subcellular locations using hierarchical ensemble of bayesian classifiers based on markov chains subloc: a server/client suite for protein subcellular location based on soap gpos-ploc: an ensemble classifier for predicting subcellular localization of gram-positive bacterial proteins advantages of combined transmembrane topology and signal peptide prediction-the phobius web server prediction of lipoprotein signal peptides in gram-negative bacteria prediction of twin-arginine signal peptides validating subcellular localization prediction tools with mycobacterial proteins toward bacterial protein sub-cellular location prediction: single-class discrimminant models for all gram-and gram+ compartments multi-class subcellular location prediction for bacterial proteins alpha helical trans-membrane proteins: enhanced prediction using a bayesian approach beta barrel trans-membrane proteins: enhanced prediction using a bayesian approach a predictor of membrane class: discriminating alpha-helical and beta-barrel membrane proteins from non-membranous proteins tatpred: a bayesian method for the identification of twin arginine translocation pathway signal sequences lippred: a web server for accurate prediction of lipoprotein signal sequences and cleavage sites combining algorithms to predict bacterial protein sub-cellular location: parallel versus concurrent implementations predicting the subcellular localization of viral proteins within a mammalian host cell virus-ploc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells structure and sequence relationships in the lipocalins and related proteins structural relationship of streptavidin to the calycin protein superfamily analysis of known bacterial protein vaccine antigens reveals biased physical properties and amino acid composition adaptation of protein surfaces to subcellular location hierarchical classification of g-protein-coupled receptors with data-driven selection of attributes and classifiers gpcrtree: online hierarchical classification of gpcr function optimizing amino acid groupings for gpcr classification on the hierarchical classification of g protein-coupled receptors proteomic applications of automated gpcr classification vaxijen: a server for prediction of protective antigens, tumour antigens and subunit vaccines identifying candidate subunit vaccines using an alignment-independent method based on principal amino acid properties dna and peptide sequences and chemical processes multivariately modeled by principal component analysis and partial least-squares projections to latent structures principal property-values for 6 nonnatural amino-acids and their application to a structure activity relationship for oxytocin peptide analogs peptide binding to the hla-drb1 supertype: a proteochemometrics analysis proteochemometrics mapping of the interaction space for retroviral proteases and their substrates proteochemometrics analysis of substrate interactions with dengue virus ns3 proteases generalized modeling of enzyme-ligand interactions using proteochemometrics and local protein substructures rough set-based proteochemometrics modeling of g-protein-coupled receptor-ligand interactions improved approach for proteochemometrics modeling: application to organic compound-amine g protein-coupled receptor interactions melanocortin receptors: ligands and proteochemometrics modeling proteochemometrics modeling of the interaction of amine g-protein coupled receptors with a diverse set of ligands peptide quantitative structureactivity-relationships, a multivariate approach multivariate parametrization of 55 coded and non-coded amino-acids new chemical descriptors relevant for the design of biologically active peptides. a multivariate characterization of 87 amino acids bioinformatic approach for identifying parasite and fungal candidate subunit vaccines jafa: a protein function annotation meta-server metamqap: a meta-server for the quality assessment of protein models a consensus strategy for combining hla-dr binding algorithms prediction of hla-a2-restricted ctl epitope specific to hcc by syfpeithi combined with polynomial method propred analysis and experimental evaluation of promiscuous t-cell epitopes of three major secreted antigens of mycobacterium tuberculosis propred: prediction of hla-dr binding sites predicting class ii mhc/peptide multi-level binding with an iterative stepwise discriminant analysis meta-algorithm class ii mhc quantitative binding motifs derived from a large molecular database with a versatile iterative stepwise discriminant analysis meta-algorithm iterative stepwise discriminant analysis: a meta-algorithm for detecting quantitative sequence motifs neural models for predicting viral vaccine targets building a meta-predictor for mhc class ii-binding peptides a probabilistic meta-predictor for the mhc class ii binding peptides a meta-predictor for mhc class ii binding peptides based on naive bayesian approach strength in numbers: achieving greater accuracy in mhc-i binding prediction by combining the results from multiple prediction tools a systematic assessment of mhc class ii peptide binding predictions and evaluation of a consensus approach combination of fingerprint-based similarity coefficients using data fusion connectionist-based dempster-shafer evidential reasoning for data fusion from protein microarrays to diagnostic antigen discovery: a study of the pathogen francisella tularensis epijen: a server for multistep t cell epitope prediction nerve: new enhanced reverse vaccinology environment dynavacs: an integrative tool for optimized dna vaccine design vaxign: the first web-based vaccine design program for reverse vaccinology and applications for vaccine development enzymes, metabolites and fluxes key: cord-011565-8ncgldaq authors: elworth, r a leo; wang, qi; kota, pavan k; barberan, c j; coleman, benjamin; balaji, advait; gupta, gaurav; baraniuk, richard g; shrivastava, anshumali; treangen, todd j title: to petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics date: 2020-06-04 journal: nucleic acids res doi: 10.1093/nar/gkaa265 sha: doc_id: 11565 cord_uid: 8ncgldaq as computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. in recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. for instance, sketching algorithms such as minhash have seen a rapid and widespread adoption. these techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. we also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. we then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions. thanks to advances in sequencing technology, the amount of next-generation sequencing data for genomics has increased at an exponential pace over the last decade. while this explosion of data has yielded unprecedented oppor-tunities to answer previously unanswered questions in biology, it also creates new challenges. for instance, a key challenge is in designing new algorithms and data structures that are capable of handling analyses on such large and numerous datasets (table 1) . one approach for solving this big data problem is the development and adoption of probabilistic algorithms and data structures. when applying probabilistic methods to genomic analyses, input sequences are frequently decomposed into sets of overlapping subsequences with length k, referred to as k-mers. this large set of k-mers is then compressed into matrices using techniques from compressed sensing and sketching. genomic analyses such as clustering and taxonomic classification can be performed directly on the compact matrices ( figure 1 ). in this paper, we review the great strides that have already been made in these areas and look forward to future possibilities. many novel probabilistic and signal processing approaches for handling these massive amounts of genetic data have been previously reviewed (1) (2) (3) (4) (5) . for instance, in (1) a comprehensive review was performed covering probabilistic algorithms and data structures such as minhash (6) and locality sensitive hashing (lsh) (7) , count-min sketch (cms) (8) , hyperloglog (9) and bloom filters (10) . this review includes extensive details of how these data structures work, supporting theory behind each of them, as well as a brief discussion of their applications. however, the genomics applications for each approach were not thoroughly covered. other more biologically motivated reviews include a review of compressive algorithms in (2) and (125) <0.001 metahit (126) 0.001 tara oceans (127) 0.007 terragenome (128) 0.010 jgi img (129) 0.017 human microbiome project (130) 0.043 the european nucleotide archive (ena) (131) 0.379 ncbi sequence read archive (132) 33.848 sketching approaches in (3) . in (2) , techniques are covered such as the burrows-wheeler transform (bwt) (11) , the fm-index (12) , and other techniques based around exploiting redundancy in large datasets. a more in depth discussion of many of these topics can also be found in (3, 4) includes a thorough review of compressed string indexes, lsh via sketches, cms, bloom filters, and minimizers (13) , with accompanying applications in genomics for each. while many techniques focus on efficient ways to represent a dataset, the compressed sensing (cs) technique from signal processing exploits the sparsity of signals for their efficient acquisition and interpretation. cs's measurement efficiency often translates to significant reductions in cost and time. cs has previously found biomedical applications in microscopy (14) and rapid mri acquisition (15) . in this review, we summarize the essentials of cs, relate the technique to the other probabilistic data structures and algorithms, discuss relevant recent advances, and highlight corresponding applications in metagenomics. we direct interested readers to (16) for further discussion of the core concepts of cs and to the seminal works of (17) and (18) for more thorough analyses. most recently, a comprehensive review of sketching algorithms in genomics was performed in (5) . this review covers approaches like minhash, bloom filters, cms, hy-perloglog, the biological applications and implementations of each, and even includes a set of live, interactive notebooks with code examples of each approach. given the wealth of previously performed reviews on these topics, we refer readers to the works above for more in depth explanations of these approaches along with their applications, implementations, and theory. instead, we include only a brief review of these fundamental methodologies, followed by more recent advances in these areas, and finally their applications to metagenomics. previous studies have often neglected more novel applications in metagenomic data given the new challenges it poses. metagenome sequencing and analysis not only complicates established fundamental problems in comparative genomics but also adds entirely new problems. therefore, we focus on how the aforementioned techniques can overcome unique hurdles in metagenomics. recently, more attention has been given to the study of probabilistic algorithms (19) as a means to circumvent the widening gap between the explosion of data and our computing capabilities. algorithms based on hashing and sketching (20) (21) (22) (23) (24) (25) have been extensively used in the theoretical computer science and database literature for reducing the computations associated with processing massive webscale datasets (26) (27) (28) (29) (30) . hashing algorithms are typically associated with a random hash function that takes the input (usually the data vector) and outputs a discrete value. usually, this output serves as a (small memory) fingerprint which, being discrete, can be used for 'smart' indexing. these indices are most notably used for sub-linear time near-neighbor searches (31, 32) . sketching algorithms work by creating a dynamic probabilistic data structure popularly known as a sketch (33) . the sketch is a small memory summary of a given set of items, which typically requires logarithmic memory for summarizing them (34) . these sketches can support dynamic updates (35) and the dynamic query operation which returns an approximate estimate for a quantity of interest. to begin, we perform a concise overview of core probabilistic data structures and algorithms ( figure 2 ). we then include a review of a wide array of more recent variations, extensions, and recent advancements of these fundamental methodologies. finally, we include a more in depth discussion on promising applications to genomic and metagenomic data. (1) locality sensitive hashing (lsh) was first introduced to solve the nearest neighbor search (nns) problem in high dimensions (7) . lsh functions are a subset of hash functions that seek to hash similar input values to the same hash values. essentially, for an lsh function f, if two input items x 1 and x 2 are very similar to each other, then applying the lsh function to both should cause them to collide (f(x 1 ) = f(x 2 )) with high probability. the main idea behind efficient retrieval is to use f to structure the data as an efficient dictionary or hash table by indexing data point x i with key f(x i ). given any query q, f(q) naturally becomes a favorable key for lookup. this is because any x j with the same key will have f(q) = f(x j ), and hence, is likely to have high similarity with query q. (2) minhash is arguably one of the most popular lsh functions for genomic and metagenomic data. min-hash takes a set as input and outputs a set of integer hash values. specifically, minhash applies p different hash functions to each element in a set and returns the minimal hash values from each of the p hash functions as the sketch of the set. the probability that two sets have the same minimal hash values is equal to the percentage of common elements in the union of both sets. as a consequence, we can quickly approximate the similarity between two sets by simply computing the ratio of the number of minhash collisions between the sets and the total number of minhashes. with minhash we can compute a small approximate summary of each set, referred to as a sketch, and then calculate the similarity of any two sets as the distance between their sketches. sequencing data are often conveniently represented as sets of tokens (or k-mers). as a result, minhash is frefigure 1 . overview of applying probabilistic data structures and compressed sensing in metagenomic sequence analysis. given a set of sequences, each sequence is usually first decomposed into a series of consecutive k-mers. then the probabilistic algorithm compresses the k-mers into sketches. the sketches can be analyzed to evaluate characteristics of the input sequences, such as sequence similarity. in compressed sensing (cs), the aggregate k-mer frequencies for the whole sample are treated as measurements. elements of a database (e.g. microbial genomes) have individual k-mer frequency distributions that are stored in columns of a matrix. cs finds the elements of the database that comprise the sample measurements. quently used to quickly compare the similarity between two large sequencing datasets by applying the p hash functions to their k-mers. (3) minimizers are another widely used technique within the family of lsh-algorithms to reduce the total number of k-mers for sequence comparison applications. a minimizer is a representative sequence of a group of adjacent k-mers in a string and can help memory efficiency by storing a single minimizer in lieu of a large number of highly similar k-mers. minimizers will sample the sequence by choosing the smallest (lexicographically, for instance) k-mer within a sliding window. in figure 2 , the minimizer portion demonstrates the sliding window that moves across the sequence, creating the set of minimizer k-mers for the sequence by taking the smallest k-mers within the window as it slides. the choice of the window length w and k-mer size k of the minimizers are parameters that can be adjusted for the application. several techniques employ hashing to compress the representation of a dataset. from these new representations, information can be rapidly queried. (1) bloom filter (bf) is a data structure that compresses a set while still being able to query if an element exists in the set. the sketch for a bf is a bit array of w bits. the bits are given an initial value of 0. to record an element into the sketch, p different hash functions are used to map the input element to p different positions in the array. after evaluating the hash functions, the bf sets the bits to 1 at all mapped positions. to search for an element, the query element is hashed by the same p hash functions. then, every bit that the hash values map to in the bf are checked. if any bit value of the mapped locations are not equal to 1, the input element is definitely not in the set. if all the mapped bits are 1, the element is likely in the set. this result can also be caused by random hash collisions while inserting other elements. thus, the bf can have false positives. ultimately, bfs can quickly evaluate the presence of a given element using very little memory. (2) hyperloglog is designed to estimate the number of distinct elements in a set using minimal memory. the essence of hyperloglog is to keep track of the count of the maximum number of leading zeros in the binary representation of each element in the set. if the maximum number of leading zeros observed is n, a crude estimate for the number of distinct elements in the set is 2 n . this style of cardinality estimation only works for data distributed uniformly at random, so each element passes through a hash function before being evaluated and incorporated into an extremely compact sketch for the set. the process of cardinality estimation based on leading zeroes can have a high variance, so the hy-perloglog sketch distributes the hashed elements into multiple counters, whose harmonic mean yields a final cardinality estimation (after correcting for using multiple counters and hash collisions). but this memory is still logarithmic in the total number of distinct elements. on the other hand, calculating the exact cardinality requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. alternatively, condensed representations may summarize the structure of the dataset by analyzing the frequency of components of the set. new datapoints that are assumed to exhibit the same structure can be efficiently acquired. . count-min sketch: three pairwise independent hash functions are applied to each k-mer. each hash function is responsible for a row in the sketch and maps the hash values to the bins in its row. to encode an element into the sketch, the count-min sketch increases the numeric value in the mapped bins. to return the number of occurrences of a given k-mer, it hashes the k-mer using the same hash functions and returns the smallest value. bloom filter: it initiates all the values in the array as 0. to record the presence of a k-mer in the dataset, it maps k-mer to the bits in the bloom filter using three pairwise independent hash functions, and then it changes the mapped bits from 0 to 1. minimizer: given a sequence, it can be compressed into a list of minimizers. to do that, a window slides across the sequence. in each window, the sequence inside the window is decomposed into k-mers. a minimizer is selected among the list of k-mers for the window at each position. hyperloglog: each k-mer is represented by a hash value with length 9. the first three bits of a hash value is used to locate a register and the last 6 bits are saved in the corresponding register. the maximum number of leading zeros among all the values, that are stored in the register, is used to estimate the cardinality of each register. (1) compressed sensing is a signal processing technique that enables the acquisition of high-dimensional signals from low-dimensional measurements by leveraging the sparsity of many natural signals (16) (17) (18) . sparse signals have only a few nonzero elements. in metagenomics, a signal of interest may be the relative abundance of microbes in a sample. these signals are sparse because only a small fraction of all known species are present (i.e. have nonzero abundance) in any given sample. figure 3 illustrates the process of cs in this context. the cs problem can be represented concisely with linear algebra: y = x where an m × n sensing matrix captures an n-dimensional signal x with m linear measurements that are stored in y. sparse recovery algorithms find the sparsest x that obeys y = x either through a convex relaxation (e.g. a lasso regression (16)) or a greedy algorithm (e.g., matching pursuit (36) (37) (38) (39) ). theory shows that cs can make very efficient use of linear measurements; m scales logarithmically with n (17,18). (2) count-min sketch (cms) is a specialized cs algorithm where the projection matrix is a structured (0-1) random matrix derived from cheap universal hash functions. due to this carefully designed matrix, it is possible to compute the projection y = x as well as perform recovery of x from y without materializing the matrix in memory and instead only use a few universal hash func-tions, each of which needs only two integers. as a result, we get a provably logarithmic memory algorithm for compressing x and recovering its heavy elements. the cms is popular for estimating the frequencies of different elements in a data set or stream. the cms algorithm is remarkably simple and has a striking similarity with the bloom filter. the cms is a matrix with w columns and d rows. it can be thought of as a collection of d bloom filters, one for each row, each using a single hash function. the only difference is that we use counters in cms instead of bits in bloom filters. given an input data element x to the cms, it is hashed by d independent hash functions. each of the d hash functions generates a hash value hash d (x) within range w and increments the numeric value stored at column hash d (x) row d. querying the count of an element consists of simply taking the minimum of the counters that the element hashes to in the cms. a tremendous amount of study and followup work has been performed by the scientific community to improve the fundamental probabilistic data structures and algorithms. here, we give a brief overview of relevant variations, extensions, and recent advancements to the methodologies described above. there has been a significant advancement in improving the computing cost of minhash, which became a central tool in bioinformatics after the introduction of mash (40) and other toolkits that then followed (41, 42) . minhash requires p hash functions, and p passes over the data to compute p signatures. recently, using a novel idea of densification (43) (44) (45) , densified-minhash was developed. densified-minhash only requires one hash function and one pass over the set to generate all the p signatures of the data with identical statistical properties as p independent minhash, for any given p. several improvements have been made for efficiently computing weighted minhash as well (46) , where the elements of sets are allowed to have importance weight. these recent advances have made it possible to convert data into minhashes in the same cost as data reading, which, otherwise, was the main bottleneck step. genomic applications also use many lsh functions beyond minhash. simhash (47) was invented by google to find near-duplicates over large string inputs using cosine similarity. it was shown in (48) that for sequence and string datasets minhash is provably and empirically superior to simhash, even for cosine similarity. b-bit minwise hashing is a variation of minhash that saves only the lowest b bits of each hashed value (49) . it requires less memory to store each hash code and can be used to accurately estimate the similarities among high-dimensional binary data. sectional minhash (s-minhash) (50) includes information about the location of k-mers or tokens in a string to improve duplicate detection performance. universal (or random) hash functions seek to quickly and uniformly map inputs to hash codes. universal hash functions are important building blocks for the cms, bloom filter, hash table, and other fundamental data structures. murmurhash (https://sites.google.com/site/murmurhash, accessed march 2020) is a very well-known universal hash that has been widely used in many bioinformatic software packages, including mash (40) . although previous murmurhash versions were vulnerable to hash collision, murmurhash3 (https: //github.com/aappleby/smhasher/wiki/murmurhash3, accessed march 2020) is a good general-purpose function that is particularly well-suited to large binary inputs. however, there are other options such as xxhash (https://github.com/cyan4973/xxhash, accessed march 2020), which can be faster than murmurhash, and cityhash (https://opensource.googleblog.com/2011/04/ introducing-cityhash.html, accessed march 2020). city-hash is relevant to genomics because it is optimized for strings. it outperforms murmurhash for short string inputs but is appropriate for any length input. farmhash is the successor to cityhash and also focuses on improved string hashing performance (https://opensource.googleblog.com/ 2014/03/introducing-farmhash.html, accessed march 2020). nthash (51) is a specialized dna hashing function. it recursively calculates the hash values for the consecutive k-mers in a given sequence. while nthash can be faster than xxhash, cityhash and murmurhash, it is only appropriate for sequence data. minimal perfect hash functions (mphf) and perfect hash functions (phf) map inputs to a set of hash codes without any collisions. a phf maps n inputs, or keys, to a set of >n hash codes, some of which are unused. an mphf maps n inputs to n codes. although mphfs have been used to improve many bioinformatics applications, such as the quasi-dictionary (52), the mphf construction process is often resource-intensive. critically, all of the inputs must be known in advance to construct an mphf, and many construction methods based on hypergraph peeling fail to scale. bbhash is an mphf construction method that was introduced to scale to massive key sets (53) . bbhash is constructed by a simple procedure that maps each key to a fixed-size bit array using a universal hash. if two keys collide in the bit array, the corresponding location is set to 1. otherwise, the bit remains 0. this recursive process is repeated with all of the colliding keys until there are no more collisions. due to the simplicity of the algorithm, bbhash construction is much faster at the scale typically encountered in genomics. mphfs are usually used to implement fast, read-only hash tables with constant-time lookups. however, clever open addressing schemes can also be used to achieve similar query performance without knowing the key set in advance. rather than avoid hash collisions, open addressing attempts to rearrange elements in the hash table for optimal performance. for instance, hopscotch hashing (54) ensures that a key pair is always found within a small neighborhood of its hash code. since only a small collection of consecutive buckets need to be searched when a query is issued, hopscotch hashing has very strong query-time performance. robin hood hashing (55) is another open addressing method. the key feature of this algorithm is that it minimizes the distance between the hash code location and the actual key-value pair, reducing worst-case query time. cuckoo hashing (56) uses two hash functions and guarantees that the element will always be found at one of the two hash indices. some fundamental advances in lsh have also been seen with minimizers. traditionally, minimizer selection is executed according to lexicographic order. however, this procedure may cause 'over-selection' where more k-mers than necessary become minimizers. instead, researchers recently proposed to select minimizers from a set of k-mers based on a universal hitting set or a randomized ordering (57) . if minimizers are picked from the universal hitting sets, which are the minimum sets of k-mers that cover every possible llong sequence (58) , the expected number of minimizers in a given sequence would decrease. there is also recent progress in techniques to rapidly characterize datasets. hyperloglog has risen to prominence recently thanks to its ability to efficiently count distinct elements in large data sets and databases. many new algorithms have since been developed based on hyper-loglog to adapt to different scenarios. for instance, hyper-loglog++ (59) was introduced to reduce the memory usage and increase the estimation accuracy for an important cardinality range. sliding hyperloglog (60) adds a sliding window to the original algorithm for more flexible queries, but it requires more memory storage. bloom filters are attractive because they can substantially compress a dataset, but this approach can return false positive answers. cascading bloom filters (61,62) improve the accuracy of the standard bloom filter. a cascading bloom filter recursively creates child bloom filters to store the false positives from a parent bloom filter. this reduces the false positive rate (fpr) of the overall system at a small memory cost. an alternative fpr reduction strategy is the kmer bloom filter (kbf) (63) . each k-mer in a sequence overlaps with its adjacent k-mers by k − 1 base pairs. therefore, the existence of two k-mers in a sequence is not independent, and the presence of a particular k-mer in the bloom filter can be verified by the co-occurrences of its neighbors. based on this information, kbf lowers the fpr by checking, for instance, the query's eight possible neighboring kmers (four to the left and four to the right). if none of the query's neighbors exist in the bloom filter, kbf rejects the query as a false positive. there are also many algorithms built around the generalized bloom filter data structure. these methods give the bloom filter different functions, but maintain its simplicity and memory-efficiency. the counting bloom filter (cbf), for instance, was developed to detect whether the count of an element is below a certain threshold (64) . the only difference between the bf and cbf is that when adding an element, all the counters for that element increase by 1. the spectral bloom filter (sbf) (65) functions similarly to a cbf, but the sbf only increases the minimum value in the table when inserting an element. this modification causes sbf to have a lower error rate when compared to the cbf. nucleic acids research, 2020, vol. 48, no. 10 5223 in addition to extensions and variations of fundamental methods, recent advances have developed by combining several core data structures and techniques. for instance, race (66) is an algorithm to downsample sets of genetic sequences while preserving metagenomic diversity. race replaces the universal hash function in the cms with an lsh function. using minhash, race can identify frequent clusters of sequences rather than frequent elements. since race is robust to sequence perturbations, it can be used to implement diversity sampling. by adjusting the lsh collision properties, race can create a sampled set of sequences that retains metagenomic diversity while substantially downsampling a data stream. the race diversity sampling algorithm is attractive because it can downsample accurately with high throughput, low memory overhead, and only one online pass through the dataset. for each sequence in an input stream, race checks to see whether the sequence belongs to a frequent cluster. this is done by replacing the minimum operation in the cms with an average over the count values. due to a deep connection between race and kernel density estimation, the average is a measure of the number of nearby sequences in the dataset, otherwise known as a density estimate. if the density is low, then race has not seen many similar sequences and the sequence is kept. otherwise, the sequence is discarded. in theory and practice, race attempts to select a constant number of sequences from each cluster. when minhash is properly tuned to differentiate between species, the clusters in the race algorithm correspond to different species in the dataset. as a result, race provides a fast, online and robust way to downsample sequence datasets while retaining important metagenomic properties. another important development comes from the cms and bloom filters. rambo (repeated and merged bloom filter) (67) is a recent development in multiple set compression for fast k-mer and genetic sequence search. the rambo data structure is inspired by the cms, but the goal is to report the sequence containment status rather than sequence frequency. rambo consists of a set of b × r bloom filters. rather than maintain one bloom filter for each set of k-mers, rambo uses a 2-universal hash function to randomly merge k datasets into b groups ( 2 ≤ b k ) so that each group has approximately k/b datasets. each partition is compressed using a bloom filter. this process is independently repeated r times with different partitions. to determine which sets contain a query sequence, rambo queries each bloom filter. because the groupings are random, each repetition reduces the number of candidates by the factor 1/b until only the correct datasets are reported at the end of the algorithm. the key insight is that with this approach, rambo can determine which datasets contain a given k-mer or sequence using far fewer bloom filter queries, yielding a very fast sublinear-time sequence search algorithm (68) . rambo also inherits many desirable features from the cms and the bloom filter. this includes a low false positive rate, zero false negative rate, cheap update process for streaming inputs, fast query time, and a simple systems-friendly data structure that is straightforward to parallelize. in addition to methods that enable the scalable processing of high dimensional data, there are fundamental extensions of and considerations for cs that enable its efficient acquisition. while applications of cs are constrained to those where the sparsity assumption is appropriate, seemingly irrelevant signals may have a hidden sparse representation in some basis. for example, jpeg image compression exploits the fact that natural images can be sparsely represented (or at least approximated) in a discrete cosine basis (a cousin of the fourier transform). when the sparsity basis is known in advance, the canonical cs problem can be reformulated from y = x to y = s where s is the sparse representation of x in the basis defined by the columns of . this transformation was recently demonstrated in transcriptomics (69) and may soon find an analogous application in metagenomics. aside from signal sparsity, cs also imposes constraints on the sensing matrix. specifically, must adequately preserve signals' separation distances; highly distinct ndimensional signals should not be forced into close proximity in m-dimensional space once projected by (70, 71) . while gaussian and other classes of random matrices have been shown to work well in the general case, recent techniques indicate that can be iteratively optimized for a given task by simulating measurements and sparse recovery of signals (72) . however, as we discuss below, practitioners generally do not have full control of in most applications. in metagenomics, the values in are constrained by the nucleic acid content of natural organisms. because each chosen sensor makes up a row of , a new algorithm can select m sensors (e.g. k-mers or probes) from a set of options to optimize the properties of for cs (73) . very recent techniques in cs are also exploring how to merge machine learning with cs. given a dataset, recent work indicates that both the sensing matrix and the procedure that recovers x from y = x can be learned from specially designed deep neural networks (74) (75) (76) (77) , even in cases where the signal's sparsity structure is nonlinear. datasets in metagenomics are known to be highly structured and could thus be positively impacted by these recent advances in cs in the near future. most, if not all, of the approaches described above have found their way into previously published bioinformatics methods. however, method development to date has been primarily focused on genome sequencing for a single individual or isolate genome. findings suggesting links between microbiomes, such as the human gut microbiome, and human disease (78, 79) has led to increased metagenomic sequencing. the rapid growth of this type of sequencing, where the set of reads is from a complex community of organisms, adds additional complexity and new challenges to fundamental comparative genomics problems. here we list a core set of these fundamental problems faced when performing metagenomic sequence analysis: (i) sequence resemblance, (ii) sequence containment, (iii) sequence classification, (iv) sequence downsampling, (v) sequence profiling, (vi) sequence probe design. for each problem, we discuss the role of the previously described approaches and newer tools incorporating recent advances (table 2) . one of the recent breakthroughs in the area of large-scale biological sequence comparison is in the use of localitysensitive hashing, or specifically minhash and minimizers, for efficient average nucleotide identity estimation, clustering, genome assembly, and metagenomic similarity analyses. mash. in response to the high computational expense of large-scale sequence similarity calculations, researchers have begun to apply probabilistic approaches such as using minhash to approximate the similarity between sequences (6). in the seminal work of mash (40) , it was shown that minhash could be used as an extremely efficient estimator for genome similarities in both speed and resource use. it was also shown how mash could be applied to similarity estimates between entire metagenomes. in addition, mashtree has experimented with building phylogenetic trees based on the genomic similarity estimated using mash (80) . these and other applications led to a quick and widespread adoption of mash throughout the research community for rapid sequence similarity calculations. despite representing a paradigm shift, one of the shortcomings of minhash is that its similarity estimation is most accurate when the two sets have similar sizes and their intersection region is large (81) . in the paper (82), the authors also point out that the genomic similarity estimated via jaccard distance is sensitive to the data set size. another limitation of minhash applied to metagenomics is that large amounts of rare k-mers can dominate the sample sketches. these k-mers which only occur a few times could be the result of sequencing errors as well as being actual rare species present in a metagenome. we will now review several other recent bioinformatic tools that have accelerated sequence similarity in the era of terabyte-scale datasets. bindash (83) , like mash, takes in sequences, compresses them into sketches and then compares sketches to estimate the genome similarities. specifically, bindash focuses on accelerating the sketch construction and sketch comparison time. to do this, bindash uses the b-bit onepermutation minhash algorithm to compress sequences. given a sequence, bindash first decomposes the sequence into k-mers. each k-mer of the sequence is hashed by one predefined hash function. the hash values of k-mers are then pooled together into b buckets. after all the k-mers are hashed and then grouped into b buckets, bindash selects the smallest hash value from each bucket and stores the b lowest bits of each selected hash value as the sketch of a sequence. to account for potentially empty buckets, the sketch process is optimized by the densification operation as mentioned in the previous section. the sketch similarities are then estimated using jaccard indices based on the b · b bit sketch. the experiments show that, comparing to mash, bindash can characterize the same data set with less error, less memory used and faster speed. dashing. the recently introduced work of dashing uses hyperloglog (hll) sketching to approximate genomic distances (84) . one main motivation behind dashing is to improve the similarity estimation accuracy across input sequence datasets with different sizes. dashing represents the first time that hll has been applied to estimate the overall similarity between sequence samples. given that hll is used to estimate set cardinality, to use hll to estimate genomic sequence similarities you must estimate the intersection of the two sequence data sets' k-mers, then estimate the cardinality of this intersection set. dashing first sketches the k-mers of each given sequence data set using hll. it then creates a union sketch using basic register maximum operations between the two hll sketches. now, having access to the set cardinality of both independent sets, as well as the union set size, the inclusion-exclusion principle yields the set cardinality of the intersection between the two sequence datasets. the hll set cardinality calculations of dashing are estimated using a maximum-likelihood-based approach, which has higher accuracy than the traditional corrected harmonic mean estimation approach. dashing is able to sketch metagenomes faster than previous approaches, but it requires more cpu time to calculate the genomic distances. in the end, comparing to mash, dashing has faster speed, higher accuracy and a lower memory footprint. finch rare k-mers can distort the estimation of sequence comparisons and inter-metagenomic distances. to solve this problem, finch (85) uses minhash with a larger sketch size in order to evaluate the abundance of each k-mer. it then decides thresholds based on estimated abundances to filter out low abundance k-mers. it also removes k-mers with unequal frequencies of forward and reverse sequences. by deleting erroneous or rare k-mers, finch can estimate the distances between metagenomic samples robustly. finch also reports including correction for sequencing depth biases. hulk estimates the similarities among metagenomic samples while taking k-mer frequencies into account (86) . in hulk, a metagenomic sample is sketched via histogram sketching (87) into a final histosketch, which preserves k-mer frequency information. to build a histoskech for a given metagenome, reads are first decomposed into k-mers and then streamed in a distributed fashion into independent count-min sketch counters. once a large number of reads have been counted, hulk sends the cms data to be histosketched and resets the cms counts to initial values. in order to create the final histosketch, hulk first summarizes the count-min sketch counters into a k-mer spectrum and then applies consistent weighted sampling (https://www.microsoft.com/en-us/research/publication/ consistent-weighted-sampling/, accessed march 2020) methods. hulk can successfully cluster metagenome samples based on similarity between histosketches as well as being a faster approach than that of naive k-mer counting. kwip is yet another recent approach that tries to improve the accuracy of estimating sequence dataset similarity via k-mer weighted inner product (kwip) (88) . kwip first uses khmer (89) , which is a k-mer counting software relying on count-min sketch, to compress each metagenomic read sample into a sketch. each sketch is an array consisting of m bins. each bin is responsible for counting the number of occurrences of some of the k-mers (with collisions) in the sample. to calculate the distance between two sam-nucleic acids research, 2020, vol. 48, no. 10 5225 table 2 . metagenomics software based on probabilistic and signal processing algorithms. six main application areas are highlighted: containment, downsampling, probe design, profiling, resemblance and taxonomic classification. speed indicates the relative computational speed of cpu operations, memory the relative maximum ram used during index construction/query steps and year the publication year. more ' 's means better time and memory efficiency. less ' 's indicate more resource intensive tools. performance estimates using only literature based comparison are marked in gray (' '). the stars (1-5) correspond roughly to time (days, hours, minutes, seconds and milliseconds) and memory (>64gb (server), >16gb (workstation), >1gb, >16mb and <16mb). datasets used were shakya et al. (133) biobloom tools and opal were indexed using the training data provided by opal which is much smaller than the dbs other tools use. metamaps is a classifier specifically for long read sequences as compared to the other tools in the category. the datasets and results for each tool can be found at https://gitlab.com/treangenlab/hashreview ples, each of the m bins is assigned a weight to be used in a weighted inner product. in order to assign weights to individual bins, kwip first counts the number of non zero bins across all of the n samples. an m length vector containing these frequencies is then used by kwip to create another m length vector converting the frequency values to a new value based on shannon entropy. this entropy conversion causes bins that have k-mers present in roughly half of the samples to be heavily weighted versus bins that have k-mers present in all or none of the samples (which get a weight of zero). genetic similarity is then approximated by the kwip distance. the kwip distance is calculated using the inner product between two sample sketches, with each bin weighted by the shannon entropy for that bin. the authors show that kwip can produce more accurate results than mash, especially for metagenomic samples with low divergence. of note, kwip is specifically designed to create a distance matrix from multiple samples, using all samples in the sketching process, as opposed to comparing individual sketches for individual samples like most other methods discussed here. order min hash (omh) introduces a new way of sketching a sequence that estimates the edit distance of the sequences. (90) unlike most other hashing based techniques for similarity calculations, which treat all the k-mers without respect to the order in which they occur, omh preserves the k-mer ordering in its sketching process. the sketch for a given sequence consists of n vectors of length l. each of the n vectors contains l representative k-mers, which are selected according to a pre-defined permutation function, and whose relative ordering is maintained from the original sequence. the distance calculation uses the weighted jaccard distance, where the number of appearances of a k-mer are taken into account. sourmash (42) is closely related to mash and based on minhash. it modifies the sketching procedure such that the sketch size can be of variable length for different sequences. in their approach, the size of the sketch is based around the number of unique k-mers unlike the fixed size min-hash sketch. additionally, sourmash includes functionalities such as k-mer frequency calculations as well as a sequence containment method that combines the sequence bloom tree and minhash methodologies. searching for the containment of a read, gene fragment, gene, operon, or genome within a metagenomic sample or sequence database is a frequent computational task in bioinformatics. this is an open challenge for two key reasons: first, the size of metagenomic and sequence repositories are on the scale of terabytes to petabyes. thus, methods able to quickly eliminate all the non-matching sequences in the database are crucial. second, sequences evolve over time and rarely, if ever, will be an exact match especially as metagenomes and sequence databases contain a huge amount of sequence diversity. methods that tolerate mismatches and indels have much improved sensitivity compared to methods that require more strict exactly matching sequences to satisfy containment. despite the breakthroughs made via bloom tree inspired structures in sequence search, these approaches are not without drawbacks. first, they have to make a trade-off between falsepositives and the filter size due to the inherent limitations of the bloom filter. second, they commonly lack flexibility; once the filter size is determined, they cannot be changed based on the size of the input sequences. no matter how many k-mers a sequence has, they all have to be sketched into a fixed size array. finally, as the size of the input data increases, the precision of the bloom filter-based sequence search typically declines. we will now review a few recent approaches that have tackled this important task in computational biology. sequence bloom tree (sbt) (91) is a binary tree where each node in the tree is a bloom filter. an sbt is used to index large sequence databases for efficient containment check of a query sequence within the database sequences or datasets. to construct an sbt, each sequence or dataset is added one by one, beginning with adding the first dataset as the root of the sbt. for each additional sequence or dataset, you first compute the bloom filter for the contained k-mers, and then scan from the root of the sbt to the leaves, inserting the dataset's representative bloom filter at the bottom of the tree. at each bifurcation, the insertion traversal follows the path of the child with the closest hamming distance similarity to the bloom filter for the current dataset. after insertion is finished, the new dataset's bloom filter is added as a leaf node, and each node in the sbt contains the union of the bloom filters of its children. to be specific, if a k-mer is present in node u, it should also exist in all the direct ascending nodes' bloom filters from u to the root. therefore, as a bloom filter gets closer to the root, it becomes more populated and the false-positive rate of the bloom filter is higher (a process known as saturation). querying for sequence containment proceeds by querying each node's bloom filter, starting with the root, and determining if enough k-mers are contained from the query's k-mers. if the bloom filter contains enough of the query's k-mers, then each child node's bloom filter is queried for containment. the process proceeds until each sequence or dataset containing the query at the leaves of the sbt is determined. split sequence bloom tree (ssbt) (92) were implemented to quickly search short transcripts within a large database. although the ssbt was originally designed for rna-seq data, it can be adapted to other sequence containment problems just like sbts. the ssbt is an improvement over the sequence bloom tree (sbt) data structure (91) . similar to sbts, each sequence or dataset in the database is inserted into the ssbt by traversing from the root of the tree to the bottom. the ssbt is also a binary tree, but each node has two bloom filters instead of one. the first filter, called the similarity filter, saves k-mers shared by all the datasets in the subtree under a particular node. the second filter, named the remainder filter, stores the k-mers that are not universally shared among all the datasets but are specific to at least one dataset in the subtree for a node. the union of the similarity filter and the remainder filter is a single bloom filter for the node similar to the nodes of an sbt. ssbt is a clever re-organization of sbt resulting in accuracy similar to an sbt but with reduced space occupancy and search time. bigsi represents a significant advance in sequence containment search; bigsi was introduced to allow efficient search for a query sequence among a large bacterial and viral genome database (93) . it also relies on bloom filters to solve this problem. but, instead of using a tree-like structure (e.g. sbt), bigsi employs a flat bloom filter-based data structure. bigsi first indexes the reference datasets, where these datasets are raw fastq read datasets or assemblies from which to search for the presence of a query sequence. to index the reference datasets, bigsi first extracts a set of non-redundant k-mers from each dataset, and then builds a corresponding bloom filter. after this initial step, bigsi then concatenates all the bloom filters together. bigsi compresses the whole database into a matrix, in which each column is a bloom filter for a given dataset. to conduct an exact search of a sequence, bigsi is expected to find the index of all the k-mers of the query sequence inside the matrix. for inexact search, as referenced above, bigsi just needs to find the index for a subset of the k-mers present in a sequence of interest. bigsi can also dynamically update the size of the sketch based on the amount of input datasets. when new datasets arrive, bigsi can add a new column to the matrix for each new dataset. rambo (68) is a very recent method which also allows indexing new sequences and new datasets in a streaming fashion. contrary to bigsi, which has o(k) (k is the number of datasets) query time, rambo is sublinear in query time with a slight increase in memory. mash screen (94) was developed to determine which reference sequences are contained within a metagenomic sample using minhash, though the methodology is also presented as a method for sequence similarity. similar to meta-pallette (described below), it uses references found to be contained in a metagenome to describe the metagenome's taxonomic composition, but does not classify individual reads. mash screen first converts a reference sequence and a given metagenomic sample into two sets of k-mers a and b. following that, mash screen compresses the set of refrepresents the fraction of k-mers in the sketch of a contained in b, and is referred to as the containment index. finally, the containment index is converted to a score that approximates sequence similarity. this final score is referred to as the mash containment score. the presence or absence of one or more reference sequences in a metagenomic sample is then determined by this mash containment score. an example is given, for instance, of searching for a set of reference viral sequences in hundreds of metagenomes by calculating the mash containment score between each reference and each metagenome. metagenomic sequence classification software typically uses reads to search against known genomes and perform lowest common ancestor based taxonomic classification. as the size of the reference databases (terabytes to petabytes) and the number of reads (10s of millions to billions) in metagenomic samples increase, it becomes computationally intractable to perform exhaustive comparison of all kmers in the reads against all k-mers within the reference databases, opening the door for efficient new tools. tools like kraken (95) and diamond (96) were two of the first ultra efficient tools for fast metagenomic classifications. we now review a few recently developed approaches for metagenomic sequence classification. krakenuniq is built based on kraken and its main goal is to decrease the false-positive read classification rate (97) . compared to kraken, one of the additional features of krakenuniq is that the number of unique k-mers of each taxon is recorded while processing all reads of a metagenomic data set. krakenuniq uses hyperloglog to efficiently estimate these unique k-mer counts. by tracking the number of unique k-mers for a taxa alongside the coverage for that taxa across all the reads in a metagenome, krak-enuniq can identify likely false-positive read classifications caused by events such as sample contamination, lowcomplexity regions, and contaminated database sequences. kraken2 substantially reduces memory usage, while simultaneously gaining a significant boost in classification speed, when compared with kraken 1 (98) . this advancement in memory use and speed comes from using a compacted hash table that stores lca assignments for hashed minimizers of k-mers instead of a table storing lca assignments for all k-mers as in kraken 1. while this hash table saves significant memory, it comes at a small specificity and accuracy cost given that it only stores pairs of minimizers and lcas which are further subsampled through hashing. this hashing process includes adding spaced seed masking to the minimizer before hashing. the size of this new compact hash table can be specified by the user, with smaller sizes reducing the memory footprint and increasing speed but lowering classification accuracy. when compared with other state of the art tools, kraken2 ultimately provides similar or better classification accuracy alongside its memory and speed improvements. biobloom tools (bbt) (99) is novel in that it applies a multi-index bloom filter (mibf) to the sequence classification problem. the mibf is a bloom filter-like data structure that consists of three arrays. the first array serves as a traditional bloom filter, recording the existence of hashed items in a set. the second array, named the rank array, tracks the number of non zero bits stored in the first bloom filter array at certain intervals (by default, the number of non zeros every 512 bits in the bloom filter is stored). to reduce memory usage, the rank array is ultimately interleaved with the first bloom filter. the third array, also referred as the id array, saves the integer identifiers (ids) for reference sequences inserted into the mibf. these ids allow the mibf to additionally store associated taxonomic classification information for entries so as to be used as a classifier. for each reference sequence, bbt hashes spaced seeds into the mibf rather than contiguous k-mers. spaced seeds, unlike k-mers, allow mismatches between the references and the queries which can increase the sensitivity of approximate sequence search (100) . to classify a given read, spaced seeds from the read are looked up in the bloom filter. the rank array is then used to help retrieve ids from the id array. ultimately, the retrieved ids lead to a final taxonomic classification. to reduce the false positive rate, bbt makes use of nearby spaced seeds within adjacent sliding windows, referred to as frames, when performing its classifications. bbt also intelligently populates the id array in multiple passes such that the effects of data loss from hash collisions is minimized. ganon (101) focuses on quick database indexing in order to ensure usage of the most up to date sequence database data to accurately classify reads. many existing tools apply static, out-of-date versions of databases to assign reads. this approach can miss, for instance, classifications for species that have been newly sequenced and very recently added to existing databases. to overcome this problem, ganon employs interleaved bloom filters (ibf) (102) to index up-to-date reference genomes efficiently. an ibf is an array of length b · n. it encompasses b bloom filters of length n. to index the references, ganon first groups the sequences into clusters. these clusters should roughly mirror different groups for a given taxonomic rank such as different species or strains. it then sketches each cluster into a single bloom filter. lastly, all the bloom filters are interleaved into one ibf. reads are classified that pass a minimum threshold for the number of matches found within the read and the references. if a given read can map to multiple references, an optional lowest common ancestor (lca) approach can be applied. metamaps was designed to perform classification on noisy long read data including making both classifications and abundance estimates down to the strain level (103) . metamaps classifies long reads by mapping them to reference genomes. given that reads could map to many closely related references, metamaps simultaneously performs mapping as well as estimating the community composition of a metagenome sample. thus, when determin-ing the probability of mapping a read to a reference, the probability is a combination of both a probabilistic mapping quality to the reference as well as the estimated abundance of the reference's taxonomic unit in the sample. to quickly find mapping locations for reads across all reference genomes, an efficient probabilistic approach is used that generates initial candidate mappings using minimizers followed by a winnowed-minhash statistical modelling approach for further ani estimation (104) . the read mappings and metagenome abundance estimates are then iteratively updated through an expectation-maximization (em) algorithm. metaothello (105) is one of the latest efforts in improving the classification speed of metagenomic classification. similar to kraken2, metaothello reports significant improvements in both memory use and speed when compared to, for instance, kraken 1. metaothello applies the recently developed l-othello data structure to speed up the process, which is a hashing based classifier. metaothello uses k-mers that act as signatures for taxa to make its classifications. a kmer is a signature for a taxon if it is only present in that taxon or that taxon's subtree, and nowhere else in the tree of life (it is taxon specific). metaothello indexes all reference sequences, finds all taxon signature k-mers and their taxonomic mappings, and populates an l-othello data structure that efficiently maps from signature k-mers to taxa. the l-othello, once built, maintains two arrays a and b populated with binary values. when looking up a k-mer's taxa mapping in the l-othello, the k-mer is hashed by two hash functions h a and h b that map to the matching positions in a and b. the final corresponding taxa value t for the k-mer is calculated through a bit-wise xor operation of the two values found in a and b. thus the classification step of metaothello operates similarly to other approaches. a query sequence is decomposed into its constituent k-mers and the corresponding taxa for each k-mer is looked up using the l-othello data structure. then, differing from other approaches, metaothello uses a windowed approach to make the final classification. for a given taxonomic rank, the classification takes into account the maximum number of contiguous taxa assignments that all occur consecutively within the query sequence. opal (106) is an lsh-based metagenomic classifier that uses low density parity check (ldpc) codes. the rationale for using an ldpc lsh approach is to ensure even coverage for all of the positions in the k-mer while using as few hash functions as possible. the authors highlight that this is the first application of low-density lsh in bioinformatics. the rationale for using low-density lsh is that it will avoid coverage bias issues and offer increased accuracy when using long k-mers. in addition to newer more efficient methods for analyzing large metagenomic data sets, a parallel effort has been emerging that instead reduces the data set size first before running further downstream analyses. intelligently down sampling, for instance, a read data set can dramatically speed up any further computations performed, while ideally preserving the important characteristics of the metagenome. another alternative approach to analyze less data than a full metagenome would be to restrict sequencing to a small subset of regions in the metagenome such as the 16s rrna. this sequencing approach, referred to as metabarcoding (107) or amplicon sequencing, can help to simplify other downstream tasks such as community profiling and taxonomic assignments of reads. here, however, we consider only the recent computational approaches that shrink large metagenomic datasets previously generated or in an online streaming fashion. diginorm (108) is a cms-based method for downsampling shotgun sequencing data. diginorm is a streaming algorithm that can select a small set of reads from a large dataset using relatively few computational resources without substantial information loss. this improves the speed of downstream tasks. diginorm begins by finding the frequencies of all k-mers in a sequence using a cms. if the median frequency value is larger than a threshold, usually 20, the sequence is discarded. this process discards reads with k-mers that have already been observed in other reads. since rare reads have many rare k-mers, they will have a lower median count than common reads and will be kept. an easy-to-use python implementation is provided in the khmer package. bignorm (109) is an extension of the ideas behind diginorm. bignorm obtains better downsampling performance by including additional information, such as quality scores and common error modalities, when determining whether to accept a read. while bignorm is still based on k-mer abundance counts and the cms, the decision threshold is based on a weighted summary of k-mer counts rather than simply the median. the decision process attempts to remove bias in diginorm that may incorrectly accept a read. for instance, bignorm attempts to differentiate between rare k-mers caused by single substitution errors and authentic uncommon reads. while diginorm and bignorm are both efficient streaming algorithms, bignorm is implemented in c++ and uses parallelism to achieve faster processing times. race (66) is a recent downsampling method based on lsh and the cms. rather than consider explicit k-mer abundance statistics, race is based on jaccard similarity. diginorm and bignorm both discard reads which contain many k-mers that have already been observed. race discards reads that have a high jaccard similarity with many observed reads. while these decision criteria are similar, density estimation with jaccard similarity is incredibly efficient using the race algorithm. quikr/wgsquikr (110, 111) are cs-based approaches that leverage differences in bacterial k-mer frequencies to recover the relative abundances of bacteria in complex samples. the setup of the cs problem is similar to our depiction in figure 3 . in quikr, each column of the sensing matrix is populated with the 6-mer frequency profile of a bacterial species' 16s gene. sequence measurements across a whole sample are converted to raw 6-mer frequencies (y) from which the sparse combination of species can be recovered using cs with sparsity-based optimization. quikr was soon followed up with wgsquikr (110) that leveraged the same core method except with 7-mer analysis of whole-genome shotgun sequencing data. at the time of publication, these techniques achieved competitive accuracy with orders of magnitude improvement in speed over state-ofthe-art read-by-read classifiers. however, they were limited to genus-level taxonomic depth and exhibited difficulty in recovering rare organisms. metapallette (112) takes a cs-inspired approach similar to wgsquikr for metagenomic community reconstruction with a few subtle but significant differences. the authors define a matrix a created from k-mers of database reference genomes, known as the common k-mer training matrix. this matrix is analogous to the sensing matrix in cs, but a stores pairwise similarities of reference genomes based on shared k-mers. a is able to be efficiently constructed for long k-mers by using bloom count filters. ultimately, the relative taxa abundances x is recovered from the aggregate sample k-mer counts y by solving ax = y for a sparse x. while we only discuss a single a, x and y here, metapallete in fact creates multiple a and x for different values of k for k-mers (30 and 50) . the authors also augment a with artificial 'hypothetical organisms' of similar k-mer profiles. the use of long k-mers and the mathematical representation of unknown organisms enables metapallette to classify even novel organisms at the strain level. mission (113) is a hybrid compressed sensing and hashing-based approach. specifically, mission uses a count-sketch data structure and will acquire the heavy hitters from the data and apply stochastic gradient descent to update the data structure. the sparsity of the features keeps the top heavy hitters while setting the rest to zero. this algorithm was used for metagenomic classification on the dataset from (114) and showed how many features of the data would be adequate relative to performance. metagenomic sequencing has opened the gate for biologists to detect novel or rare organisms in different environments. however, detection with high sensitivity can demand extensive sequencing runtimes to capture novel fragments among the innumerable metagenomic background data (115) . to circumvent these challenges, single stranded nucleic acid probes can enrich or sense dna fragments by binding to intended target strands. many software packages have been developed for designing probes for a specific target genome, but generating probes for metagenomic analysis is difficult because of the uneven and diverse sequences in metagenomic samples. capturing rare sequences while excluding highly similar sequences is challenging. therefore, metagenomics requires probe design techniques that scale well with the number of organisms found in samples. catch is a newly developed method to design optimal probes for targeted microbial enrichment to facilitate downstream detection in sequencing (116) . this approach is particularly important for viral detection in samples with low titers; without probe-based enrichment, low abundance viruses may evade detection. moreover, catch pursues a set of probes that can scalably capture the full diversity of viruses. catch first yields a set of candidate probes from the input sequences and then collapses the probes with high similarity using lsh. specifically, it detects nearly-identical probes through either hamming distance or minhash, and then removes the similar candidate probes. to make sure that the final set of probes encapsulates the diversity of the input sequences, catch computes the smallest set of probes needed to cover the whole set of target sequences. catch treats this as a set cover problem and solves it using the canonical greedy solution (117) . ultimately, thousands of probes are chosen to cover the targets based upon the optimization criteria. insense while catch focuses on probe design for enrichment of target sequences in a complex sample before metagenomic sequencing, applying cs permits another workflow with orders of magnitude fewer probes at the cost of some taxonomic depth. if a sample is known to be vsparse, i.e. contain a subset of v or fewer of the n possible microbes, cs can be applied with m = o(vlog(n/v)) mismatch-tolerant dna probes. the sensing matrix is populated by the expected number of binding events between each probe (in rows) and each target organism (in columns). these nonspecific probes can be thought of as directly measuring the abundance of soft-matching k-mers. proof-of-concept work was first explored in a cs microarray (csm) format (118) . the same principle has also been demonstrated for sensing bacterial pathogen genomes at species resolution in bulk solution with less than a dozen fluorescent, random dna probes (119) . although fewer probes can be resolved in bulk solution compared to a microarray (m is limited), such an approach may find applications in rapid infection diagnostics where the species library is constrained to pathogens (n is much smaller) and patient samples are very sparse with at most a few unique species (120) . given a set of possible microbes (library), a set of probes, and the simulated hybridization behavior between them, a subset of probes can be selected with the insense algorithm (73) . insense optimizes for the incoherence of , a common quality metric for cs sensing matrices, with a convex relaxation. this cs approach bypasses sequencing by capturing information directly from probe-target hybridization events, and it will be exciting to see how it performs in real patient and environmental samples. if can be accurately predicted from probe and target sequences, it is plausible that future applications can synergize with sequencing databases by automatically updating based on known trends in microbial evolution. however, nonspecific hybridization mandates a thorough understanding of the library of possible species and perhaps careful sample processing; out-oflibrary, unexpected nucleic acids that interact with nonspecific probes would corrupt the measurements and downstream sparse recovery. despite the nascent state of metagenomic sequencing and analysis, its accelerated adoption has led to both an explosion in available data as well as an ever increasing demand for new data analysis methodologies. in this survey, we have covered what we believe to be a core set of fundamental probabilistic data structures and algorithms that are uniquely positioned to tackle the burgeoning growth of metagenomic data, as well as the added nuances of anal-yses involving a diverse community contained inside of a metagenome. despite the relative youth of the field of metagenomics, many fast methods have already emerged that can be used or were designed for this area. for instance, as seen in table 2 , methods like bindash and dashing are being developed in an effort to further accelerate sequence similarity estimations beyond the speed of the seminal mash tool. similarly, recent advances like bigsi, rambo, and ssbt are opening the door to petabyte-scale sequence searches among vast sequencing datasets. however, continued breakthroughs are still needed to better handle metagenomic-specific intricacies such as sequencing error, low abundance community members, and uneven coverage. in addition, probabilistic approaches as discussed in this paper generally come with an accompanying set of pros and cons. for instance, most bloom filter algorithms involve a fundamental trade-off between memory, query cost, and quality. standard bloom filters balance the size of the bit array with the possibility of false positives. the tradeoff is implicit for any algorithm using this data structure. the fpr can be reduced by choosing the right number of hash functions, which may increase query time, or by making assumptions about the input data, as with kmer bloom filters. cascading bloom filters provide an alternative way to trade query time and memory for fpr at the expense of a more complex hierarchical structure. additionally, cs approaches come with their own set of tradeoffs. while cs confers measurement efficiency for cost and time savings, it is inherently database-dependent. for instance, in some of the applications we discussed, the sensing matrix was precomputed by leveraging a sequence database (sequences at a specified position, k-mer frequencies, response to a set of probes etc.). similarly, the discovery of sparse representations requires a training set of signals. this requirement for a dataset becomes limiting in chaotic applications such as the identification of rapidly evolving organisms either through vertical or horizontal gene transfer. such novel differences that real-world samples may exhibit would likely be treated as noise in sparse recovery and ignored until the database is updated. cs is therefore likely limited to applications that exhibit an acceptable level of stability in the dataset. more generally, while the cs technique is provably robust to errors (noise) in the lowdimensional measurements y, any errors in the signal x are amplified by the factor n/m (121) . in metagenomics, measurement noise may be attributed to whether an expected nucleic acid fragment in the sample generates a read during sequencing, and signal noise could be the result of unforeseen mutations or contamination. in applications featuring significant signal noise, the ratio n/m controls the tradeoff between the efficiency of the measurement process and signal-to-noise ratio degradation. in addition to all of the considerations directly involved in the inner workings of the discussed methods, there are many considerations surrounding these methods that can also greatly affect both their accuracy and scale. while we have discussed various tradeoffs involved in probabilistic approaches, many of these tradeoffs involve carefully selected hyper parameters. to a non expert user of the methods, it may not be obvious how to set the various parameters for each method, and even advanced users may struggle to find the truly optimal parameter settings derived from underlying theory. another consideration is in the modeling of processes such as natural genome evolution. many k-mer based approaches and hashing techniques are initially developed in a way that is blind to underlying biological processes such as evolutionary drift which gradually introduces point mutations, insertions, and deletions into closely related genomes that otherwise might be identical. conversely, phylogenetic methods which explicitly model events like drift and recombination have been slow to incorporate recent advances discussed in this survey. considerations can also be given to the actual data collection procedures, such as how the dna sequencing is performed. one new advance on the sequencing side of metagenomics is the concept of genome skimming (122) , which is a technique to lightly sequence metagenomic samples. similarly, metabarcoding (107) or amplicon sequencing can reduce metagenomic data by only sequencing a small set of amplified regions, potentially speeding up and simplifying downstream analyses. a final consideration surrounding newer methodologies is that of the sequence databases that nearly all metagenomics tools rely on for sequence classification. while recent advances in probabilistic data structures and algorithms may drastically shrink computational requirements, these speedups can be easily offset and even outpaced by exponential growth in sequence databases that these algorithms must interact with. new methods should also seek to overcome challenges such as database quality issues such as misassembled or mislabelled genomes or sets of reads. following methodologies such as simple uniform random downsampling and more intelligent downsampling like diginorm (123) , recent advances like the race method (66) attempt to address the need to shrink databases and remove contaminants and error, while preserving biologically important characteristics like diversity. probabilistic data structures for big data analytics: a comprehensive review computational biology in the 21st century: scaling with compressive algorithms sketching and sublinear data structures in genomics computational solutions for omics data when the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data on the resemblance and containment of documents approximate nearest neighbors: towards removing the curse of dimensionality an improved data stream summary: the count-min sketch and its applications hyperloglog: the analysis of a near-optimal cardinality estimation algorithm space/time trade-offs in hash coding with allowable errors fast and accurate short read alignment with burrows-wheeler transform opportunistic data structures with applications reducing storage requirements for biological sequence comparison compressive fluorescence microscopy for biological and hyperspectral imaging sparse mri: the application of compressed sensing for rapid mr imaging compressive sensing decoding by linear programming compressed sensing randomized algorithms the random projection method sampling techniques for kernel methods a random sampling based algorithm for learning the intersection of half-spaces adaptive sampling methods for scaling up knowledge discovery algorithms randnla: randomized numerical linear algebra finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions an algorithmic theory of learning: robust concepts and random projection dimensionality reduction by random projection and latent semantic indexing random projection trees and low dimensional manifolds experiments with random projection linear regression with random projections on the resemblance and containment of documents approximate nearest neighbors: towards removing the curse of dimensionality the space complexity of approximating the frequency moments data streams: models and algorithms mining data streams: a review signal recovery from random measurements via orthogonal matching pursuit iterative thresholding for sparse approximations cosamp: iterative signal recovery from incomplete and inaccurate samples from denoising to compressed sensing mash: fast genome and metagenome distance estimation using minhash viral coinfection analysis using a minhash toolkit large-scale sequence comparisons with sourmash optimal densification for fast and accurate minwise hashing densifying one permutation hashing via rotation for fast near neighbor search improved asymmetric locality sensitive hashing (alsh) for maximum inner product search (mips) simple and efficient weighted minwise hashing similarity estimation techniques from rounding algorithms in defense of minhash over simhash hashing algorithms for large-scale learning sectional minhash for near-duplicate detection nthash: recursive nucleotide hashing a resource-frugal probabilistic dictionary and applications in bioinformatics fast and scalable minimal perfect hashing for massive key sets hopscotch hashing robin hood hashing improving the performance of minimizers and winnowing schemes designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm sliding hyperloglog: estimating cardinality in a data stream over a sliding window using cascading bloom filters to improve the memory usage for de brujin graphs fast lossless compression via cascading bloom filters improving bloom filter performance on sequence data using k-mer bloom filters an improved construction for counting bloom filters spectral bloom filters diversified race sampling on data streams applied to metagenomic sequence analysis repeated and merged bloom filter for multiple set membership testing (msmt) in sub-linear time sub-linear sequence search via a repeated and merged bloom filter (rambo): indexing 170 tb data in 14 hours efficient generation of transcriptomic profiles by random composite measurements the restricted isometry property and its implications for compressed sensing a simple proof of the restricted isometry property for random matrices adaptive compressed sensing mri with unsupervised learning insense: incoherent sensor selection for sparse signals a data-driven and distributed approach to sparse signal representation and recovery the sparse recovery autoencoder learned d-amp: principled neural network based compressive image recovery deepcodec: adaptive sensing and recovery via deep convolutional neural networks nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection clinical metagenomics generating wgs trees with mashtree variant tolerant read mapping using min-hashing beware the jaccard: the choice of similarity measure is important and non-trivial in genomic colocalisation analysis bindash, software for fast genome distance estimation on a typical personal laptop dashing: fast and accurate genomic distances with hyperloglog finch: a tool adding dynamic abundance filtering to genomic minhashing streaming histogram sketching for rapid microbiome analytics histosketch: fast similarity-preserving sketching of streaming histograms with concept drift kwip: the k-mer weighted inner product, a de novo estimator of genetic similarity the khmer software package: enabling efficient nucleotide sequence analysis locality-sensitive hashing for the edit distance fast search of thousands of short-read sequencing experiments improved search of large transcriptomic sequencing databases using split sequence bloom trees ultrafast search of all deposited bacterial and viral genomic data mash screen: high-throughput sequence containment estimation for genome discovery kraken: ultrafast metagenomic sequence classification using exact alignments fast and sensitive protein alignment using diamond krakenuniq: confident and fast metagenomics classification using unique k-mer counts improved metagenomic analysis with kraken 2 improving on hash-based probabilistic sequence classification using multiple spaced seeds and multi-index bloom filters efficient computation of spaced seeds ganon: precise metagenomics classification against large and up-to-date sets of reference sequences dream-yara: an exact read mapper for very large databases with short update time strain-level metagenomic assignment and compositional estimation for long reads with metamaps a fast approximate algorithm for mapping long reads to large reference databases a novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures metagenomic binning through low-density hashing the ecologist's field guide to sequence-based identification of biodiversity a reference-free algorithm for computational normalization of shotgun sequencing data an improved filtering algorithm for big read datasets and its application to single-cell assembly wgsquikr: fast whole-genome shotgun metagenomic classification quikr: a method for rapid reconstruction of bacterial communities via compressive sensing metapalette: a k-mer painting approach for metagenomic taxonomic profiling and quantification of novel strain variation mission: ultra large-scale feature selection using count-sketches large-scale machine learning for metagenomics sequence classification how much metagenomic sequencing is enough to achieve a given goal? capturing sequence diversity in metagenomes with comprehensive and scalable probe design a greedy heuristic for the set-covering problem compressive sensing dna microarrays universal microbial diagnostics using random dna probes polymicrobial interactions: impact on pathogenesis and human disease the pros and cons of compressive sensing for wideband signal acquisition: noise folding versus dynamic range genome skimming: a rapid approach to gaining diverse biological insights into multicellular pathogens tackling soil diversity with the assembly of large, complex metagenomes oceanic metagenomics: the sorcerer ii global ocean sampling expedition: northwest atlantic through eastern tropical pacific the ocean sampling day consortium. gigascience, 4 a human gut microbial gene catalogue established by metagenomic sequencing ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses terragenome: a consortium for the sequencing of a soil metagenome img/m v. 5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes the human microbiome project the european nucleotide archive in 2019 the sequence read archive comparative metagenomic and rrna microbial diversity characterization using archaeal and bacterial synthetic communities ncbi reference sequence (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins the views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the odni, iarpa, aro or the us government. key: cord-035033-osjy88rc authors: aydin, berkay; boubrahimi, soukaina filali; kucuk, ahmet; nezamdoust, bita; angryk, rafal a. title: spatiotemporal event sequence discovery without thresholds date: 2020-11-09 journal: geoinformatica doi: 10.1007/s10707-020-00427-6 sha: doc_id: 35033 cord_uid: osjy88rc spatiotemporal event sequences (stess) are the ordered series of event types whose instances frequently follow each other in time and are located close-by. an stes is a spatiotemporal frequent pattern type, which is discovered from moving region objects whose polygon-based locations continiously evolve over time. previous studies on stes mining require significance and prevalence thresholds for the discovery, which is usually unknown to domain experts. the quality of the discovered sequences is of great importance to the domain experts who use these algorithms. we introduce a novel algorithm to find the most relevant stess without threshold values. we tested the relevance and performance of our threshold-free algorithm with a case study on solar event metadata, and compared the results with the previous stes mining algorithms. in traditional itemset mining, frequent sequence (or sequential pattern) mining refers to discovering a set of attributes persistently appearing over time among the large number of objects [50] . a major category of sequences are event sequences, which represent the implicit relations among the categories of objects [16] . classical event sequence mining can be useful for understanding the user behavior (by mining sequences from weblogs or system traces) [37] , shopping routines of customers (by mining transaction sequences) [44] , or the efficiency of business processes (by mining time-ordered managerial and operational activities) [52] . sequential pattern mining from spatiotemporal data has received much attention in recent years due to its broad application domains such as targeted advertising, prediction for taxi services, and urban planning [15, 29, 40, 53] . the characteristics of spatiotemporal berkay aydin baydin2@cs.gsu.edu 1 georgia state university, atlanta, ga, usa sequences vary widely depending on the discovered knowledge type. most of the recent approaches focus on the point-based spatiotemporal data presumably due to its availability. however, the region-based spatiotemporal data, primarily obtained from scientific resources, has not received much attention. in this work, we focus on spatiotemporal event sequences (stes) from event datasets that contain instances with region-based geometric representations. stess are the ordered series of event types whose instances frequently demonstrate sequence generating behavior. the sequence generating behavior is characterized by the spatiotemporal follow relationship among instances, and it refers to the temporal follow relationship with spatial proximity constraints. the spatiotemporal sequences can be categorized into three classes based on the fundamental data type: sequences of trajectories from uniform groups, sequences of spatiotemporal points from mixed groups, and sequences of trajectories from mixed groups. our work considers the last category, and we mine the stess from event instances formed by the evolving region trajectories. the discovery of stess is potentially critical for the large-scale verification and prediction of scientific phenomena in a broad range of scientific fields including meteorology, geophysics, epidemiology, and astronomy [18] . the scientific phenomena such as tornadoes, propagation of epidemics, clouds, and solar events can be modeled as trajectories of continuously evolving regions. stes mining can be used for modeling the spatial and temporal relationships among different types of phenomena. later, the discovered sequence patterns can be utilized for performing large-scale verification of current knowledge, as well as the prediction of unknown spatiotemporal relationships among different event types. an application area for stes mining is solar physics and space weather forecasting. studies from government agencies [10, 28] and independent institutions [30, 39] show that extreme space weather events can impact radiation in space, reduce the safety of space and air travel, disrupt intercontinental communication and gps, and even damage power grids. while much work has been done on prediction of solar flares using physical characteristics of source active regions, the mixed impact of different types of solar events to eruptive activity (such as flares or coronal mass ejections) have not been fully explored. for example, in fig. 1 , we demonstrate an active region event with a large sunspot followed by a flare. such relationships are known to exist among solar events [3, 25, 32, 36] although, to our knowledge, many studies are confined to a limited number of examples. one way to understand the exhaustive factors and conditions leading to an extreme space weather event is to determine the frequently occurring sequences of events which lead to substantially large flares, smaller flares, and non-flaring instances. the discriminating event sequences among these can shed light into typical conditions leading to eruptions or alternatively help forecasters when issuing all-clear forecasts. public health researchers, particularly epidemiologists, can also benefit from stes mining for understanding the frequently occurring activities leading to spread of infectious diseases. contact tracing applications [1] are one of the very few tools we can deploy to stop the spread of viral diseases with great epidemic potential such as the novel coronavirus (sars-cov-2) [19] . such applications can be used to trace the movements of individuals and the paths of individuals can be split into activity event types. mining stess occurring among different activities in the outbreak zones can help epidemiologists understand which activities or the sequences of activities lead to outbreaks and which are relatively safer. identifying these sequences can provide crucial information for prevention such as the relative contributions of different activities and ways of transmission. previous efforts have been devoted to solving the problem of mining the most prevalent spatiotemporal event sequences using apriori [5] , pattern growth-based [4] , or top-k algorithms [9] . while these three approaches achieve the expected results, they heavily rely on user-defined significance and prevalence threshold parameters, which define the cut-off points for sequences. the previous algorithms assume that the user has a prior knowledge of the optimal threshold parameters or a k value which in some cases, should be discovered from the datasets. another issue that surface with the previous stes mining approaches is that the prevalence threshold, is highly dependent on the events taking part in a sequence. for example, algorithms should be more tolerant in the case of stess whose event types have a rare occurrence, yet it should also be informative on the rareness of them. as a matter of fact, defining the same threshold for all the sequences may not be accurate in this context as the threshold parameter should discriminate against the event types participating in a given event sequence. in contrast to threshold-based approaches, we focus on overcoming the limitations of providing a user-defined threshold when discovering the stess and improve the relevance of our results. here, we introduce a novel algorithm, rand-esminer, which, by randomly repeating the mining process on a random subset of instances and follow relationships, finds an estimate participation index for event sequences. the rand-esminer uses our pattern growth-based esgrowth algorithm [4] as the backbone, where the follow relationships are translated into a directed acyclic graph structure, and randomly permutes the edges of this graph to mine the event sequences. mining the random subset multiple times does not necessarily improve the overall running time as shown in our experiments; however, it increases the robustness of the results and let us understand the distribution of participation indices. in our experiments, we compare the rand-esminer with earlier approaches and evaluate the efficiency and robustness of the algorithm using datasets from solar astronomy field. in the spatiotemporal frequent pattern mining literature, the term sequence (or its derivatives such as sequence patterns, sequential patterns) is used for identifying different types of knowledge from spatiotemporal data. these include sequences of locations frequently visited by spatiotemporal objects [14, 20] , partially-or totally-ordered sequences of event types whose instances follow each other [5, 13, 24, 34, 35, 49] (these are also referred to as couplings in [42] ), sequences of semantic annotations from semantic trajectories [51] , temporal sequences of ordered spatial itemsets (called spatio-sequences) [40] , and sequences of spatiotemporal association rules [47] . cao et al. described the spatiotemporal sequential patterns as the routes of frequently followed by objects in [14] . namely, a set of frequently visited locations is discovered from a dataset of spatiotemporal trajectory segments. spatiotemporal sequential patterns are related to the movement patterns of spatiotemporal objects in the form of trajectory segments. similarly, giannotti et al. introduced the trajectory patterns, and presented an algorithm for mining trajectory patterns [20] . trajectory patterns represent a set of trajectories frequently visiting similar locations with similar visiting times. while trajectory patterns are concerned with the behavioral aspect of spatiotemporal objects, the term, sequence, refers to visited locations. verhein introduced complex spatiotemporal sequence pattern mining [47] that focuses on sequences of spatiotemporal association rules. spatiotemporal association rules represent frequent movements of objects appearing between two regions during a time interval. apart from those, zhang et al. proposed splitter algorithm, which discovers fine-grained sequential patterns from semantic trajectories [51] . the splitter first retrieves spatially coarse patterns, and later reduces them to fine-grained patterns. the discovered patterns are sequences of categorized locations (deduced from semantic trajectories). another example of spatiotemporal sequences, called spatio-sequences, are presented by salas et al. [40] . the spatio-sequences are the temporal sequences of ordered spatial itemsets that are used for coupling geographically neighboring phenomena. huang et al. presented a framework for mining sequential patterns from spatiotemporal event datasets in [24] . the sequential patterns, in [24] , refer to a sequence of event types from point-based event instances. they defined a follow relation between the pointbased event instances of two different event types, presented significance measures for sequences, and introduced two pattern-growth based algorithms for the mining task. both of the algorithms create a pattern tree and expand the nodes of the pattern tree with recursively calling tree expansion procedures (i.e., follow joins). moreover, mohan et al. [34] introduced the cascading spatiotemporal pattern mining, which are partially ordered subsets of spatiotemporal event types whose instances are located together and occur in stages. in [4, 5] , aydin et al. introduced the stes mining from evolving regions. stes mining is also concerned with the sequences of event types, and the instances are trajectories of evolving regions. hence, the follow relationship is more complex when compared to [24] . the earlier event sequence mining algorithms [4, 5, 24] operate using a set of userdefined thresholds. here, we will concentrate on discovering the sequences without these threshold values, which often times are not available to domain experts. in stes mining, our focus is mining patterns from evolving region trajectories, and we are not particularly interested in point trajectory data, or stationery spatiotemporal data. our scope is on mining the most relevant. our primary objective is to improve the quality of the results and alleviate the issues associated with using preset, and usually arbitrary, thresholds. the rest of this paper is organized as follows. in section 2, we present background information on stes mining. in section 3, we introduce our novel stes mining algorithm. in section 4, we present our experimental evaluation. lastly, in section 5, we present our conclusion and possible future work. spatiotemporal event instances (ins i ) are the chronologically ordered lists of timestampgeometry pairs (tg i k ). the geometries are region-based and represented with polygons. the instances are evolving region trajectories, and each of them is identified by a unique identifier and has an associated event type. an event type signifies the class of its associated instances. a timestamp-geometry pair is a pair of timestamp value (t i ), and a region geometry (g i ). the event type of an instance is represented with ins i .e. an event type is denoted by e j . the set of instances of type e j is represented as i e j . in the upper portion of fig. 2 , we show the organizational structure of instances and events. note here that the set of all instances (i) are essentially union of instances from all event types in the dataset (∪ i e j for all e j ). let e = {e 1 , e 2 , . . . , e k } be the set of all event types, and i be a database of all event instances. the problem of stes mining is finding frequently occurring sequences of event types (i.e., event sequences) in the form of (e i 1 e i 2 . . . e i k ), such that the instances of e i 1 is followed-by the instances of e i 2 , . . . , and the instances of e i k−1 is followed-by the instances of e i k . the event sequences are denoted as esq i , and are derived from instance sequences. an instance sequence (denoted as i sq i in eq. 2) represents a chain of spatiotemporal follow relationships (denoted as " ") that occur between its participating instances. the number of participating instances in an i sq is the length of the sequence. a length-k instance sequence is alternatively called k-sequence. an i sq i is of-type an event sequence es j (as shown in eq. 3, if and only if the event types of the participating instances of i sq i are identical and in the same order as the event types in es j . in the lower portion of fig. 2 , we schematically depict the follow relation between the instances and show an example of a length-3 instance sequence and its associated spatiotemporal event sequence. given this information, the task of stes mining, in general, is interested in discovering spatiotemporal event sequences whose instance sequences are frequently repeated. the instance sequences are discovered by finding significant follow relationships outlined in section 2.1 and event sequences are derived from the types of these instance sequences. the prevalence of event sequences are measured by participation index measure, which is described in section 2.2. in this paper, we will focus on mining stess using a randomization approach, which will take a set of spatiotemporal event instances as input and returns all the discovered stess together with a list of estimated participation index values for each stes, obtained from randomized trials. the instance sequences are formed by two or more instances. between each two consecutive instances in an instance sequence, there exists a spatiotemporal follow relationship. the simplest form of follow relationship occurs between two event instances, and denoted by the ' ' symbol. the relationship is characterized by two predicates that are temporal continuity and spatial proximity. to actualize these predicates, we present two concepts that are the head and tail window of instances. we determine the head and tail with head and tail ratio parameters (denoted as hr and tr). the hr is the ratio between the head segment's lifespan and the instance's lifespan. the tr is the ratio between the tail segment's lifespan and the instance's lifespan. the head and tail window is derived from the tail of the instance. the first operation for obtaining the tail window is the spatial buffer. secondly, the spatially buffered geometries are propagated temporally. the amount of buffering is determined by buffer distance parameter (denoted as bd), while the period of temporal propagation is called tail validity period (denoted as tv). head windows are created similarly, where firstly the head segment is buffered using the buffer distance parameter and later using the head validity (hv) parameter the head is propagated. the difference is that the tail window is propagated forward (i.e., towards a later time step) while the head window is propagated backward (i.e., towards an earlier time step). we show an example head and tail segment generation in fig. 3a and the creation of tail window in fig. 3b. formalization given two instances ins i and ins j , there exists a spatiotemporal follow relationship between ins i and ins j (ins i ins j ) if and only if (1) the start time of ins i is less than the start time of ins j , and (2) there exists a spatiotemporal co-occurrence between the tail window of ins i and the head window of ins j . under these conditions, ins i is the followee and ins j is the follower in the relationship. to form a 2-sequence, there must be one spatiotemporal follow relationship between two instances. more generally, to form a k-sequence, there must be k − 1 spatiotemporal follow relationships between each consecutive participating instance. to measure how frequent an stes is in a given dataset, we use the participation index. the participation index is defined in [23] , and shows the minimum relative frequency of the participating event types. for an event sequence, es j =(e j 1 .. e j k ), the participation index of an stes (es j ) is the minimum of the participation ratios (pr) of its event types. the participation ratio of an event type (e i ) on an stes (es j ) is the ratio of the number of unique participators of e i 's instances to the total number of event instances of e i in the dataset. where | · | shows the set size. another aspect of significance is the strength of the follow relationships. the significance assessment is important as the accuracy and reliability of resulting event sequences are dependent on the discovered significance of the instance sequences. for assessing the strength of the follow relationships, we use the chain index (denoted as ci). the ci for 2sequences is defined as the significance of spatiotemporal co-occurrence between the tail window (tw) of followee and the head window (hw) of the follower. in this work, we measure the significance of spatiotemporal co-occurrence using j * measure [6, 8] . formally, for a 2-sequence, i sq r = (ins r 1 ins r 2 ), the strength of the follow relationship is assessed as: where t s represents the starting time of an instance for a k-sequence where k > 2, the significance is assessed as the minimum chain index of each follow relationship in the instance sequence (eq. 7) ci in the threshold-based approaches, the instance sequences are considered significant if their chain index (ci) value is greater than a user-defined chain index threshold (ci th ). similarly, the event sequences are considered frequent if their participation index (pi) value is greater than a user-defined participation index threshold (pi th ). the pattern-growth based esgrowth algorithm is introduced in [4] . the algorithm firstly identifies the follow relationships and creates the event sequence graph (esg). later, the algorithm recursively discovers the stess from the graph structure. the esgrowth algorithm is outlined in algorithm 1. in the initialization steps, as explained in section 2, the algorithm creates heads and tail windows of instances and identifies the follow relationships between those instances (algorithm 1-lines 2 to 4). the follow relationships are discovered by a spatiotemporal join procedure where the head and tail window trajectories of the instances are joined and filtered based on the ci th for significance testing. later, these significant follow relationships are inserted into a directed acyclic graph structure -esg (line 5 ). for any discovered follow relationship, (ins i ins j ), the transform procedure adds two vertices that represent ins i and ins j (if they are not present already), and a directed edge from ins i to ins j . the graph only stores the identifiers (i and j ) and event types (ins i .e and ins j .e) of the instances in its vertices. after creating esg, the esgrowth recursively discovers the stess. the algorithm starts by iterating the event types (e i ) in the set of all events (e). for each event type (e i ), it identifies the non-leaf vertices, which corresponds to the instances of e i (step 4). then a participation ratio (pr) is calculated from those vertices to check the prevalence using pi th . if pr is greater than the pi th , the recursive growsequence procedure is called. the growsequence procedure is shown in the second part of algorithm 1. the procedure takes a prefix event sequence (prefix), the current minimum pr for the prefix sequence, and a set of pointers to the vertices (v pre ) as parameters. the vertices in v pre correspond to the last discovered vertices in the paths, which virtually forms the instance sequences of-type prefix (event sequence) in the esg. the procedure proceeds as follows. first, the successors (immediate neighbors) of the vertices in v pre are found and added to the successor vertex set (sucv ). then, for each event type e j , a subset of successor vertex set (containing the instances of e j ) is created (line 4 of growsequence procedure in algorithm 1-denoted as sucv e j ). after identifying successors, a temporary participation ratio value (pr ) is calculated for extended event sequence (line 5 of growsequence). if pr is greater than the pi th , the prefix is extended with e j , and a new prefix event sequence (denoted as prefix') is created (line 7 of growsequence procedure). at this point, prefix' is guaranteed to be prevalent, thus is inserted to the set of prevalent event sequences, es. lastly, the growsequence procedure is called with the newly created event sequence, prefix'. along with the prefix', the minimum of old and new participation ratio (min(pr, pr )), and the vertex subset formed by the successors (sucv e j ) are passed as parameters. note that the base case in the recursion occurs when the temporary participation ratio is less than the participation index threshold. in this case, the prefix', which is created by appending the new event type to prefix, is not prevalent. therefore, there is no need to check the longer length event sequences generated from prefix'. previous approaches [4, 5, 9] use threshold-based stes mining algorithms that heavily relies on the domain experts' knowledge to choose some appropriate threshold parameters, which is usually not available. the thresholds in the earlier approaches are necessary to understand whether a discovered sequence is spurious or sufficiently frequent. however, often times, determining generic threshold values for all the event sequences is impractical, even for the domain experts. to tackle these issues, we propose a novel algorithm, rand-esminer (randomized spatiotemporal event sequence miner) that can help domain experts understand the intrinsic characteristics of the spatiotemporal event sequences better. the threshold-based approach is essentially a constraint-based data mining approach, where the instance sequences are filtered based on the ci th , and the event sequences are filtered based on the pi th . the usefulness of these thresholds for the efficiency of the mining algorithms is indisputable, primarily because of the exponential time and space complexity of these algorithms. however, the imbalance of the spatiotemporal instances in these datasets and their characteristics such as lifespan, total area, or the areal evolution make it very difficult for domain or data mining experts to create meaningful threshold values [7] . because of these, we created an algorithm that does not take participation index or chain index thresholds as input, but outputs stess along with a distribution of pi values. threshold-based algorithms output a set of prevalent stess (often coupled with a single pi value). from a practical point of view, event sequences which include rarely occurring events or instances with significantly different spatiotemporal characteristics are often overlooked. with the randomization approach, we do not consider any significance or prevalence threshold and perform the mining on the resampling subsets of the follow relationships. the randomized algorithm is inspired from the permutation tests (random resamples without replacement) and outputs the participation index values of all the discovered patterns. the primary task of the statistical descriptors is to summarize a characteristic of the given data and generalize the finding to the larger population. basic sample statistics such as sample mean or median give information about that particular sample; however, their values would fluctuate from sample to sample and magnitude of these fluctuations around the corresponding parameter is also important for the relevance of these results. statistical bootstrapping is an alternative to the traditional statistical methodology of assuming a particular probability distribution for a sample. bootstrap is a random resampling technique that estimates the distribution of a statistic and allows measures of accuracy to sample estimates [17] , it is especially useful when there is no analytical form to help estimate the distribution of the statistics of interest such as mean or variance. an increasingly common statistical tool for constructing sampling distributions is the randomization test (also referred to as permutation test). similar to the idea of bootstrapping, a permutation test builds sampling distribution, which is the permutation distribution, by resampling the observed data. specifically, we can permute the observed data. unlike bootstrapping, permutation tests resamples the data without replacement, which is more appropriate for our tasks. in our application area of solar data mining, the lack of accurate data is a common problem and one way to tackle with these noisy data is to randomize the mining process and obtain uncertainties with confidence intervals. our primary data source for stes mining is the solar event metadata obtained from the feature finding teams (fft) of solar dynamics observatory [33] . firstly, most of the solar event instances (representing the regions of solar events) have become available only after the launch of the solar dynamics observatory (sdo). hence, we only have the slices of the data. additionally, the sdo only captures the images of one side of the sun that is visible to us, and we currently do not have the reports of the solar events occurring on the opposite side of our line of sight. in our randomization approach, we treat the follow relationships as the sample dataset, and the participation index (pi) values of stess as a complex statistic to be obtained from the esg structure. by applying a random resampling without replacement (i.e., applying a set of permutation test) of the follow relationships (i.e., edges) we have the opportunity to explain the prevalence of stess as a distribution. note here that we do not perform a traditional permutation test which would require a null model and a statistical hypothesis testing. given the characteristics of solar event datasets, this approach is very promising due to its power of estimating the confidence intervals of the participation index for stess. this is to say, for each randomized experiment, we can obtain a participation index value for a discovered pattern and estimate its likelihood and see the variance of these values. in this part, we will explain the details of rand-esminer algorithm. the algorithm makes use of random resampling of follow relationships and outputs all the discovered stess along with their participation index values, which shows the prevalence of a particular stes, in each random trial. in algorithm 2, we show the overview of the randomized stes mining algorithm. the algorithm initially discovers all the follow relationships as in the threshold-based esgrowth algorithm and creates the esg (lines 2 to 5). in a nutshell, it firstly randomly resamples the edges in the esg, then discovers the stess from the resampled event sequence graphs, and eventually collects the results. the algorithm takes the set of instances (i), resampling ratio (rr), and the number of random trials to be performed (ν) as input. resampling ratio (rr) is the ratio between the number of edges to be resampled and total number of edges in the esg. note here that neither rr nor ν parameters require the expert knowledge; they are used for regulating the randomized trials and are not necessarily dependent on the intrinsic characteristics of data. the esg structure allows us to create random resamples of the follow relationships. the rand-esminer algorithm performs a randomized run for ν times to estimate the pi value for all the discovered stess (lines 7 to 15). the resampling is applied on the edges of the graph, which corresponds to the follow relationships among instances. edge resampling creates a subgraph of the esg, that is resg, by randomly sampling from the edge set of esg without replacement (line 8). note that our graph structure is not a multigraph; therefore, we opt for a permutation test (resampling without replacement) instead of bootstrapping (resampling with replacement). next, we discover the stess from the resampled subgraphs of esg. for each resampled subgraph, we perform a recursive procedure similar to the esgrowth algorithm. for each event type e i , we find the non-leaf vertices of e i , and grow the sequences (see growrandsequences procedure in the second part of algorithm 2). this can be considered as running the esgrowth algorithm with pi th = 0.0. lastly, we append them to the map structure (es rand ) for every iteration, and return the es rand , which contains the discovered stess and a size-(ν) list of pi values for each sequence. if an stes is not discovered during a trial, the pi value for that stes is recorded as 0. if an stes is recorded for the first time during the kth trial, we create a a new list of pi values (length-k) backfilled with zeroes for each previous trial it was not discovered. it is worth mentioning that unlike the threshold-based approaches which return a list of prevalent stess, the output of the rand-esminer algorithm is a map structure whose keys are discovered patterns and the values are list of calculated pi values in each of the (ν) random trials. in this section, we present our experimental evaluations of the randomization-based stes mining approach. we used real-life datasets from solar astronomy field. we evaluated the runtime performance of graph transformation procedures and compared the esgrowth and rand-esminer algorithms. our algorithms are implemented in java programming language, datasets were stored in text files, and experiments were conducted on an ubuntu virtual machine with 1tb ram with an intel xeon processor. all the event instances and graph elements are stored in memory for fair comparison. to analyze the performance of our proposed algorithm, we used real-life solar event datasets. these are monthly datasets from 2012, which include the spatiotemporal instances of seven different solar event types that are: active regions (ar), coronal holes (ch), emerging flux (ef ), filaments (f i), flares (f l), sigmoids(sg), and sunspots (ss). each instance consists of region polygons, downloaded from heliophysics event knowledgebase (hek) [31] , and the regions are tracked and interpolated using the algorithms presented in [11, 26] . the characteristics of our real-life datasets can be seen in table 1 . the datasets are in tabular format, where the instances of a particular event type is stored as a table. each row shows a particular time-geometry pair with four attributes: instance identifier, start time of the time-geometry pair, end time of the time-geometry pair, and spatial geometry. the spatial geometry is a polygon object formatted as well-known text (wkt) format. the goal of the experiments is to examine the performance of our randomized algorithm, both in terms of relevance and effciency, on these datasets and compare it with the threshold-based esgrowth algorithm. we will compare the discovered stess on our relevance analysis and later evaluate the running time efficiency. a preliminary step of the algorithm is the initialization (head and tail window generation) and the graph transformation. for that reason, we kept the global parameters that are used in the esg creation as constant throughout all the experiments. these parameters represent independent variables that should not alter the performance of the algorithms within these set of experiments. head and tail ratio parameters were selected as 0.2 and 0.5, respectively. the value used for the tail validity (tv) is 4 hours. head validity (hv) was set to zero for consistency with our earlier works. we chose 25 arcsec as the buffer distance (bd) parameter. for the case of the threshold-based algorithm, we conducted 16 experiments for each dataset with varying ci th nad pi th values. the ci th value was set to 0.10, 0.15, 0.2, and 0.25, while pi th value was set to 0.04, 0.08, 0.12, and 0.16. we ran the esgrowth algorithm with the above-mentioned combinations of ci th and pi th values to discover the frequent sequences. eventually, a total of 192 experiments were performed on the 12 datasets for the threshold-based approach. on the other hand, for the randomization approach, we resampled the data 100 times (ν = 100) for every dataset and estimated the distribution of the pi values of all the event sequences. the size of each sample is 10% the size of its respective original dataset (rr = 0.1). thus, we generated 100 pi values for all the discovered sequences to estimate prevalence of event sequences. in this part, we will discuss the relevance of the mining results from rand-esminer algorithm. for brevity, we chose to illustrate the length-2, length-3, and length-4 stess with the top-15 mean pi values from jan, feb, mar, and apr datasets. the comprehensive results for length-2, 3, and 4 stess from all datasets can be found in the appendix of this paper. figure 4 illustrates the most prevalent length-2 (top row), length-3 (middle row), length-4 (bottom row) stess from four datasets. the distribution of pi values from rand-esminer are demonstrated with the box plots. we also demonstrate the discovered pi values from the threshold-based approach with varying size scatter elements. each scatter represents a different experimental run, and when the event sequence is not found to be prevalent on an esgrowth experiment (meaning pi was less than pi th ), the result from that experiment is omitted. from fig. 4 , we can see that the discovered top-15 stess are consistent throughout the four datasets. we also present the number of top-15 occurrences for length-2 stess from all datasets in fig. 5 . the results from the length-3 and length-4 stess is available in the appendix for completeness. eight of the top-15 length-2 stess are discovered in all twelve datasets and 13 of them were discovered at least ten times. we further analyzed these 13 stess. table 2 shows the number of times stess was discovered by esgrowth as well as the averaged (across 12 monthly datasets) median pi values and averaged percetange stes was discovered in randomized trials. we can observe that averaged median pi values for these stess are generally over 0.04 (pi th for the comparable esgrowth experiment) and they are discovered in almost all of the randomized trials (that is to say the average percentage of randomized trials these stess were discovered is above 97% for the aforementioned stess). this is not the case for threshold-based runs, where some of these well-known stess were not discovered, even for relatively low threshold values. (see an incomprehensive list of observations found in the literature for some of these patterns: 'ss sg' [12, 45] ; 'ef ef' [41] ; 'ar ar' [21, 46] ; 'ef fl' [2, 41, 48] ; or 'ss ss' [22, 43] ). another aspect of our evaluation is the relevance comparison with threshold-based approach in terms of varying frequencies of stess. one observation we can make is the variation of the pi values when using different ci th values in the threshold-based approach. the variation is two-fold: (1) the variation of the pi values for a particular stes and (2) the variation of the pi values across different stess. the latter is much expected as the natural phenomena may or may not be spatiotemporally following each other. however, the former variation poses a challange that is difficult to solve with trial and error. for example, for (f l f l) sequences ci th = 0.2 and pi th = 0.12 can be an accurate cut-off points for thresholds. however, if we set the ci th to 0.2 and pi th to 0.12 for the entire dataset, we miss practically all (ar f l) sequences, as well as the sequences including the (ar f l) subsequence. it is well-known to solar physics experts that flares can occur anywhere on the sun's surface, from active regions (ar) to the the boundaries of the magnetic network of the quiet sun [27] . however, large area flares (f l) have preferred locations. they originate from the large active regions showing a complex geometry of the 3d magnetic field [38] . the stess are selected based on their top-15 occurrences (see fig. 5 ). averaged median participation index (pi) values and average percentage an stes was discovered in randomized trials are reported for rand-esminer. number of months an stes was discovered by esgrowth (with ci th = 0.1 and pi th = 0.04) is also reported. rand-esminer discovered these stess for all monthly datasets to capture (ar f l) event sequence, we can use ci th = 0.1 and pi th = 0.04 (see the results in table 2 ). however, this time majority of (ar ar), (ef ef ), and (ss sg) would be missed. these examples can be extended to include those three sequences leading to a never-ending cycle of pattern importance discussion. even for the simplistic cases of (f l f l) and (ar f l), or (ar f l) and (ss sg) creating user-defined thresholds is difficult, primarily because of the unbalanced spatiotemporal characteristics of the natural phenomena. therefore, we can suggest that mining a distribution of pi values using random edge resampling from the sample esg is a better approach, because outputing a mere pi value based on set thresholds cannot properly represent the characteristics of the population. in this part of our evaluation, we will show the runtime complexity of the initialization phase of rand-esminer algorithm. in fig. 6 , we demonstrate the running times in the initialization phase of the rand-esminer algorithm for all the datasets. we measured the running times in the initialization phase into two categories: (1) head and tail window generation time, which is denoted as ht generation time in fig. 6 and corresponds to line 2 and 3 in algorithm 2 and (2) follow relation and graph transformation time, which is denoted as follow time in fig. 6 and corresponds to line 4 and 5 in algorithm 2. the head and tail window generation requires complex spatial buffer and union operations. similarly, the follow relationships are discovered with a computationally expensive spatiotemporal join operation on evolving region trajectories. creating the esg is significantly less complex in terms of computation time. along with the running times, in fig. 6 , we illustrated the vertex and edge counts in the created esg for each dataset. the number of vertices correspond to the number of instances in the dataset, while the number of edges show the number of follow relationships. from the results, we can see that the head and tail window generation time varies significantly for each monthly dataset. we can observe that part of this stems from the number of instances (vertices) in the dataset, and another factor is the number of individual region polygons in the datasets. we observe the highest head and tail window generation times are recorded for may and june datasets where we have highest number of region polygons in the datasets. similarly, the lowest ht generation times are recorded for february and april datasets where we have the lowest number of region polygons. the follow time also greatly varies across our datasets. the follow time depends on the number of spatiotemporal follow relationships among the instances in the dataset. while we cannot assert a total correlation, the number of edges in the generated graph is a good indicator for the follow time. another factor that impacts the follow time is the number of instances and region polygons, because we get 20% and 50% of the instance trajectories as heads and tails (as hr = 0.2&tr = 0.5). for the case of the head and tail-window generation, our algorithm iterates through all the instances in the dataset and compute the time propagated and spatially buffered timegeometry pairs (representing the region trajectories). this process is done in linear time which explains the relation between the running time and the number of instances and region polygons. on the other hand, the esg generation algorithm iterates through the tail windows and performs a spatiotemporal join on overlap predicate with the head window of instances. this makes the complexity of the follow relationship identification quadratic; however, since we apply a two-step filter based on time-overlap and spatial-overlap predicates the complexity becomes subquadratic (and very close to linear) with respect to the number of region polygons in follow time. it should be noted that, in the situation where there is a time requirement constraint, the user can shrink the size of the head and tail windows to decrease the amount of overlapping; thus, reducing the number of follow relationships. in this part of our experiments, the running time requirements of our rand-esminer algorithm will be compared to the esgrowth algorithm. in fig. 7 , we demonstrate the total running times of our algorithms for each dataset (in (a)), as well as the average time spent on mining stess from esg (in (b)). in fig. 7a , the blue bars show the average running time of esgrowth algorithm with 19 different ci th values. the red bars show the total running time of rand-esminer algorithm, which consists of 100 randomized runs on the esg. in fig. 7b , we demonstrate the running times required for mining stess from esg. the initialization steps (ht generation and follow times shown in fig. 6 ) are omitted here for a better comparison, and we report the average running times of threshold-based runs (with 16 different ci th and pi th combinations), and the average running time of 100 randomized runs for each dataset. from the results shown in fig. 7a , we can notice that the total running times required for mining the stess follow a very similar pattern to the initialization, and it can be observed that for threshold-based approach, the total running time is dominated by the initialization. the higher ci th values we use for filtering the insignificant follow relationships (or edges) extensively prune the esg, leading to very low graph mining times. nevertheless, it is difficult to make conclusions about the trustworthiness of the stess or overall generality of stes mining process with high ci th values. when we analyze the performance of the rand-esminer algorithm, we see that for sparse esgs (such as feb, apr, and oct datasets) the total running time of the rand-esminer is similar to the esgrowth. on the other hand, for denser esgs (such as may, jun, and jul datasets), we observe greater comparison of the average run on the randomized and threshold-based approaches differences. this can be explained well with the algorithmic setup of randomization approach and the observations from fig. 7b . the average esg mining time of randomization approach for may, jun, and jul datasets are relatively higher than other datasets. in our experiments, the esg is resampled 100 times, and total running time of rand-esminer includes all the 100 randomized runs. whereas, for the threshold-based approach, the esg is mined only once. in summary, the total running time required for rand-esminer is approximately 27% more than esgrowth. the running time required for the rand-esminer is primarily dependent on the resampling ratio and the number of trials. to increase the trustworthiness of the results, one can increase the number of trials and resampling ratio. in addition to that, the trustworthiness of the results can be traded off with the running time. choosing a lower resampling ratio or number of trials would decrease the running time, as well as the trustworthiness of the results. in this work, we have introduced a novel spatiotemporal event sequence mining algorithm -rand-esminer, specifically designed for discovering stess without user-defined thresholds. our method differs from the conventional threshold-based methods that can be inaccurate; thus, inapplicable for large-scale data analysis. our novel randomized algorithm relies on applying permutation tests to the edges in the event sequence graph generated by spatiotemporal follow relationships. unlike the traditional techniques which discover stess with one pi value [4, 5] , our algorithm discovers a distribution of pi values, and estimates a confidence interval for stess without any thresholds. mining stess without thresholds is significant for scientific fields, as it can be easily applied for explorative tasks. our future work lies in the parallelization of rand-esminer algorithm. as the number of random resamplings and resampling ratio increases, the rand-esminer can be less efficient and exploiting parallel computation can leverage the efficiency issues and provide us with highly robust outcomes. a survey of covid-19 contact tracing apps magnetic flux emergence and associated dynamic phenomena in the sun spatiotemporal frequent pattern mining on solar data: current algorithms and future directions a graph-based approach to spatiotemporal event sequence mining spatiotemporal event sequence mining from evolving regions time-efficient significance measure for discovering spatiotemporal co-occurrences from data with unbalanced characteristics spatiotemporal frequent pattern discovery from solar event metadata measuring the significance of spatiotemporal cooccurrences top-(r%, k) spatiotemporal event sequence mining severe space weather events-understanding societal and economic impacts. a workshop report spatio-temporal interpolation methods for solar events metadata observations of rotating sunspots from trace discovering tight space-time sequences mining frequent spatio-temporal sequential patterns analysing spatiotemporal sequences in bluetooth tracking data sequence data mining nonparametric estimates of standard error: the jackknife, the bootstrap and other methods toward spatio-temporal patterns quantifying sars-cov-2 transmission suggests epidemic control with digital contact tracing trajectory pattern mining properties and emergence patterns of bipolar active regions measurement of kodaikanal white-light images-v. tilt-angle and size variations of sunspot groups discovering colocation patterns from spatial data sets: a general approach a framework for mining sequential patterns from spatio-temporal event data sets on the relation between filament eruptions, flares, and coronal mass ejections tracking solar events through iterative refinement x-ray network flares of the quiet sun highlights of the space weather risks and society workshop mining probabilistic frequent spatio-temporal sequential patterns with gap constraints from uncertain databases solar storm risk to the north american electric grid heliophysics event knowledgebase triggering an eruptive flare by emerging flux in a solar active-region complex computer vision for the solar dynamics observatory (sdo) cascading spatio-temporal pattern discovery: a summary of results cascading spatio-temporal pattern discovery onset of the magnetic explosion in solar flares and coronal mass ejections mining access patterns efficiently from web logs evolution of magnetic fields and energetics of flares in active region 8210 on the probability of occurrence of extreme space weather events the pattern next door: towards spatio-sequential pattern discovery magnetic flux emergence along the solar cycle spatiotemporal data mining: a computational perspective sunspots: an overview mining sequential patterns: generalizations and performance improvements role of sunspot and sunspot-group rotation in driving sigmoidal active region eruptions evolution of active regions mining complex spatio-temporal sequence patterns flares associated with efr's (emerging flux regions) normalized-mutual-information-based mining method for cascading patterns spade: an efficient algorithm for mining frequent sequences splitter: mining fine-grained sequential patterns in semantic trajectories data mining applications in social security spatiotemporal event forecasting in social media acknowledgements this project has been supported in part by funding from the division of advanced cyberinfrastructure within the directorate for computer and information science and engineering, the division of astronomical sciences within the directorate for mathematical and physical sciences, and the division of atmospheric and geospace sciences within the directorate for geosciences, under nsf award #1443061. it was also supported in part by funding from the heliophysics living with a star science publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. berkay aydin, ph.d is a research assistant professor in the department of computer science at georgia state university (gsu), as part of the next generation-astroinformatics research cluster program. he was a postdoctoral research associate at gsu prior to his current position, where his research was sponsored by nsf and nasa grants. dr. aydin's research is interdisciplinary and focused on management, retrieval, and analysis of solar astronomy big data. currently, he works on creating algorithms and data models for multivariate time series prediction and spatiotemporal frequent pattern discovery, which is helpful for understanding the implicit temporal and spatial relationships appearing among objects with spatial and temporal extents as well as forecasting extreme spaceweather events. soukaina filali boubrahimi is a ph.d. student at the computer science department, georgia state university. her research interests are in the problem of ensemble learning applied to time series data, a common machine learning method used to maximize the learning accuracy and popular in many domains. she is also involved in projects in the data-mining lab consisting of: (1) interpolation of spatio-temporal objects representing solar event trajectories, (2) clustering and visualization of decision trees of coronal mass ejection data, and (3) mining discriminative patterns from fmri-based networks. , and information retrieval (text and image data). he has published over 100 journal articles, book chapters, and peer-reviewed conference papers in these areas. his research has been sponsored by several federal agencies: nasa (major contributor), nsf, nga, as well as industry partners: intergraph corporation and rightnow technologies (now oracle), with the successful grant history exceeding $10m. key: cord-003316-r5te5xob authors: balloux, francois; brønstad brynildsrud, ola; van dorp, lucy; shaw, liam p.; chen, hongbin; harris, kathryn a.; wang, hui; eldholm, vegard title: from theory to practice: translating whole-genome sequencing (wgs) into the clinic date: 2018-12-17 journal: trends microbiol doi: 10.1016/j.tim.2018.08.004 sha: doc_id: 3316 cord_uid: r5te5xob hospitals worldwide are facing an increasing incidence of hard-to-treat infections. limiting infections and providing patients with optimal drug regimens require timely strain identification as well as virulence and drug-resistance profiling. additionally, prophylactic interventions based on the identification of environmental sources of recurrent infections (e.g., contaminated sinks) and reconstruction of transmission chains (i.e., who infected whom) could help to reduce the incidence of nosocomial infections. wgs could hold the key to solving these issues. however, uptake in the clinic has been slow. some major scientific and logistical challenges need to be solved before wgs fulfils its potential in clinical microbial diagnostics. in this review we identify major bottlenecks that need to be resolved for wgs to routinely inform clinical intervention and discuss possible solutions. hospitals worldwide are facing an increasing incidence of hard-to-treat infections. limiting infections and providing patients with optimal drug regimens require timely strain identification as well as virulence and drug-resistance profiling. additionally, prophylactic interventions based on the identification of environmental sources of recurrent infections (e.g., contaminated sinks) and reconstruction of transmission chains (i.e., who infected whom) could help to reduce the incidence of nosocomial infections. wgs could hold the key to solving these issues. however, uptake in the clinic has been slow. some major scientific and logistical challenges need to be solved before wgs fulfils its potential in clinical microbial diagnostics. in this review we identify major bottlenecks that need to be resolved for wgs to routinely inform clinical intervention and discuss possible solutions. thanks to progress in high-throughput sequencing technologies over the last two decades, generating microbial genomes is now considered neither particularly challenging nor expensive. as a result, whole-genome sequencing (wgs) (see glossary) has been championed as the obvious and inevitable future of diagnostics in multiple reviews and opinion pieces dating back to 2010 [1] [2] [3] [4] . despite enthusiasm in the community, wgs diagnostics has not yet been widely adopted in clinical microbiology, which may seem at odds with the current suite of applications for which wgs has huge potential, and which are already widely used in the academic literature. common applications of wgs in diagnostic microbiology include isolate characterization, antimicrobial resistance (amr) profiling, and establishing the sources of recurrent infections and between-patient transmissions. all of these have obvious clinical relevance and provide case studies where wgs could, in principle, provide additional information and even replace the knowledge obtained through standard clinical microbiology techniques. this review reiterates the potential of wgs for clinical microbiology, but also its current limitations, and suggests possible solutions to some of the main bottlenecks to routine implementation. in particular, we argue that applying existing wgs pipelines developed for fundamental research is unlikely to produce the fast and robust tools required, and that new dedicated approaches are needed for wgs in the clinic. at the most basic level, wgs can be used to characterize a clinical isolate, informing on the likely species and/or subtype and allowing phylogenetic placement of a given sequence relative to an existing set of isolates. wgs-based strain identification gives a far superior resolution in principle, wgs can provide highly relevant information for clinical microbiology in near-real-time, from phenotype testing to tracking outbreaks. however, despite this promise, the uptake of wgs in the clinic has been limited to date, and future implementation is likely to be a slow process. the increasing information provided by wgs can cause conflict with traditional microbiological concepts and typing schemes. decreasing raw sequencing costs have not translated into decreasing total costs for bacterial genomes, which have stabilised. existing research pipelines are not suitable for the clinic, and bespoke clinical pipelines should be developed. compared to genetic marker-based approaches such as multilocus sequence typing (mlst) and can be used when standard techniques such as pulsed-field gel electrophoresis (pfge), variable-number tandem repeat (vntr) profiling, and maldi-tof are unable to accurately distinguish lineages [5] . wgs-informed strain identification could be of particular significance for bacteria with large accessory genomes, which encompass many of the clinically most problematic bacteria, where much of the relevant genetic diversity is driven by differences in the accessory genome on the chromosome and/or plasmid carriage. somewhat ironically, the extremely rich information of wgs data, with every genome being unique, generates problems of its own. clinical microbiology tends to rely on often largely ad hoc taxonomical nomenclature, such as biochemical serovars for salmonella enterica or mycobacterial interspersed repetitive units (mirus) for mycobacterium tuberculosis. while the rich information contained in wgs should in principle allow superseding traditional taxonomic classifications [6, 7] , defining an intuitive, meaningful and rigorous classification for genome sequences represents a major challenge. for strictly clonal species, which undergo no horizontal gene transfer (hgt), such as m. tuberculosis, it is possible to devise a 'natural' robust phylogenetically based classification [8] . unfortunately, organisms undergoing regular hgt, and with a significant accessory genome, do not fall neatly into existing classification schemes. in fact, it is even questionable whether a completely satisfactory classification scheme could be devised for such organisms, as classifications based on the core genome, accessory genome, housekeeping genes (mlst), genotypic markers, plasmid sequence, virulence factors or amr profile may all produce incompatible categories ( figure 1 ). beyond species identification and characterization, genome sequences provide a rich resource that can be exploited to predict the pathogen's phenotype. the main microbial traits of clinical relevance are amr and virulence, but may also include other traits such as the ability to form biofilms or survival in the environment. sequence-based drug profiling is one of the pillars of hiv treatment and has to be credited for the remarkable success of antiretroviral therapy (art) regimes. prediction of amr from sequence data has also received considerable attention for bacterial pathogens but has not led to comparable success at this stage. resistance against single drugs can be relatively straightforward to predict in some instances. for example, the presence of the sccmec cassette is a reliable predictor for broad-spectrum beta-lactam resistance in staphylococcus aureus, with strains carrying this element referred to as methicillin-resistant s. aureus (mrsa). in principle, wgs offers the possibility to predict the full resistance profile to multiple drugs (the 'resistome'). possibly the first real attempt to predict the resistome from wgs data was a study by holden et al. in 2013, showing that, for a large dataset of s. aureus st22 isolates, 98.8% of all phenotypic resistances could be explained by at least one previously documented amr element or mutation in the sequence data [9] . since then, several tools have been developed for the prediction of resistance profiles from wgs. these include those designed for prediction of resistance phenotype from acquired amr genes, including resfinder [10] and abricate (https://github.com/tseemann/abricate), together with those also taking into account point mutations in chromosome-borne genes such as arg-annot [11] , the sequence search tool for antimicrobial resistance (sstar) [12] , and the comprehensive antibiotic resistance database (card) [12] . of these, resfinder and card can be implemented as online methods that, dependent on user traffic, can be considerably slower than most other tools that only use the command-line. they are, however, superior in terms of broad usability and are more intuitive than, for example, the glossary accessory genome: the variable genome consisting of genes that are present only in some strains of a given species. many of the organisms representing the most severe amr threats are characterised by large accessory genomes containing important components of clinically relevant phenotypic diversity. antimicrobial resistance (amr): the ability of a microorganism to reproduce in the presence of a specific antimicrobial compound. also referred to as antibiotic resistance (abr or ar). the sum of the detected amr genes in a sequenced isolate is sometimes referred to as the resistome. horizontal gene tranfer (hgt): the transmission of genetic material laterally between organisms outside 'vertical' parent-to-offspring inheritance, including across species boundaries. genetic elements related to clinically relevant phenotypes such as amr and virulence are often transmitted via hgt. k-mer: a string of length k contained within a larger sequence. for example, the sequence 'attgt' contains two 4-mers: 'attg' and 'ttgt'. the analysis of the k-mer content of raw sequencing reads allows for rapid characterization of the genetic difference between isolates without the need for genome assembly. multilocus sequence typing (mlst): a scheme used to assign types to bacteria based on the alleles present at a defined set of chromosome-borne housekeeping genes. also referred to as sequence typing (st). phylogenetic tree: a representation of inferred evolutionary relationships based on the genetic differences between a set of sequences. also referred to as a phylogeny. transmission chain: the route of transmission of a pathogen between hosts during an outbreak. this can often be characterized using wgs compared to traditional epidemiological inference based on, for example, tracing contacts between patients. virulence: broadly, a pathogen's ability to cause damage to its host, for example through invasion, adhesion, immune evasion, and toxin production. however, virulence is currently loosely defined by indirect proxies either phenotypically (e.g., through serum-killing assays) or genetically (e.g., by the presence of genes involved in capsule synthesis or hypermucosvisity). whole-genome sequencing (wgs): the process of determining the complete nucleotide sequence of an organism's genome. this is generally achieved by 'shotgun' sequencing of short reads that are either assembled de novo or mapped onto a high-quality reference genome. graphical user interface of sstar. other tools exist for richer species-specific characterization such as phyresse [13] and patric-rast [14] . further tools have been developed to predict phenotype directly from unassembled sequencing reads, bypassing genome assembly [15, 16] . it has been proposed that wgs-based phenotyping might, in some instances, be equally, if not more, accurate than traditional phenotyping [16] [17] [18] [19] . however, it is probably no coincidence that the most successful applications to date have primarily been on m. tuberculosis and s. aureus, which are characterised by essentially no, or very limited, accessory genomes, respectively. other successful examples include streptococcal pathogens, where wgs-based predictions and measured phenotypic resistance show good agreement even in large and diverse samples of isolates [20, 21] . on the whole, however, predicting comprehensive amr profiles in organisms with open genomes, such as escherichia coli, where only 6% of genes are found in every single strain [22] , is challenging and requires extremely extensive and well curated reference databases. the transition to wgs might appear relatively straightforward if viewed as merely replacing pcr panels which are already used when traditional phenotyping can be cumbersome and unreliable. however, to put the problem in context, there are over 2000 described b-lactamase gene sequences responsible for multiresistance to b-lactam antibiotics such as penicillins, cephalosporins, and carbapenems [23] . whilst b-lactam resistance in some pathogens, including s. pneumoniae, can be predicted through, for example, penicillin-binding protein (pbp) typing and machine-learning-based approaches [24] , the general problem of reliably assigning resistance phenotype based on many described gene sequences is commonplace. at this stage, many of the amr reference databases are not well integrated or curated and have no minimum clinical standard. they often have varying predictive ranges and biases and produce fairly inaccessible output files with little guidance on how to interpret or utilise this information for clinical intervention. perhaps because of these limitations, although of obvious benefit as part of a diagnostics platform, both awareness and uptake in the clinic has been limited. additionally, with some notable exceptions, such as the pneumococci [24] , most amr profile predictions from wgs data are qualitative, simply predicting whether an isolate is expected to be resistant or susceptible against a compound despite amr generally being a continuous and often complex trait. the level of resistance of a strain to a drug can be affected by multiple epistatic amr elements or mutations [25] , the copy number variation of these elements [26] , the function of the genetic background of the strain [27-29], and modulating effects by the environment [30] . the level of resistance is generally well captured by the semiquantitative phenotypic measurement minimum inhibitory concentration (mic), even if clinicians often use a discrete interpretation of mics into resistant/susceptible based on fairly arbitrary cut-off values. quantitative resistance predictions are not just of academic interest. in the clinic, low-level resistance strains can still be treated with a given antibiotic but the standard dose should be increased, which can be the best option at hand, especially for drugs with low toxicity. the majority of efforts to predict phenotypes from bacterial genomes have been on amr profiling. yet, some tools have also been developed for multispecies virulence profiling: the virulence factors database (vfdb) [31] or virulencefinder [32] as well as the bespoke virulence prediction tool for klebsiella pneumoniae, kleborate [33] . one major challenge is that virulence is often a context-dependent trait. for example, in k. pneumoniae various imperfect proxies for virulence are used. these include capsule type, hypermucovisity, biofilm and siderophore production, or survival in serum-killing assays. while all of these traits are quantifiable and reproducible, and could thus in principle be predicted using wgs, it remains unclear how well they correlate with virulence in the patient. given that virulence is one of the most commonly studied phenotypes, yet lacks a clear definition, the general problem of predicting bacterial phenotype from genotype may be substantially more complex than the special case of amr, which is itself far from solved for all clinically relevant species. beyond phenotype prediction for individual isolates, wgs has allowed reconstructing outbreaks within hospitals and the community across a diversity of taxa ranging from carbapenemresistant k. pneumoniae [34] [35] [36] and acinetobacter baumannii [37] to mrsa [38, 39] , streptococcal disease [40] , and neisseria gonorrhoea [41] , amongst others. wgs can reveal which isolates are part of an outbreak lineage and, by integrating epidemiological data with phylogenetic information, detect direct probable transmission events [42] [43] [44] [45] . timed phylogenies, for example generated through beast [46, 47] , can provide likely time-windows on inferred transmissions, as well as dating when an outbreak lineage may have started to expand. approaches based on transmission chains can also be used to identify sources of recurrent infections (so called 'super-spreaders'), and do not necessarily rely on all isolates within the outbreak having been sequenced, allowing for partial sampling and analyses of ongoing outbreaks [48] . in this way wgs-based inference can elucidate patterns of infection which are impossible to recapitulate from standard sequence typing alone [35] . however, wgs-informed outbreak tracking is usually performed only retrospectively. typically, the publication dates of academic literature relating to outbreak reconstruction lag greatly, often in the order of at least 5 years since the initial identification of an outbreak [49, 50] . even analyses published more rapidly are generally still too slow to inform on real-time interventions [38] . some attempts have been made to show that near-real-time hospital outbreak reconstruction is feasible retrospectively [51, 52] or have performed analyses for ongoing outbreaks in close to real-time [53, 54] , but these studies are still in a minority and remain largely within the academic literature. some of this time-lag probably relates to the difficulty of transmission-chain reconstruction at actionable time-scales. this can be relatively straightforward for viruses with high mutation rates, small genomes, and fast and constant transmission times, such as ebola [55] and zika virus [56] , but conversely, reconstructing outbreaks for bacteria and fungi poses a series of challenges. available tools tend to be sophisticated and complex to implement, and the sequence data needs extremely careful quality control and curation. unfortunately, in some cases insufficient genetic variation will have accumulated over the course of an outbreak, and a transmission chain simply cannot be inferred without this signal [57, 58] . furthermore, extensive within-host genetic diversity (typical in chronic infections) can render the inference of transmission chains intractable [59] . these complexities mean that a 'one-size fits all' bioinformatics approach to outbreak analyses simply does not exist. one of the key promises of wgs is in molecular surveillance and real-time tracking of infectious disease. this relies on transparent and standardized data sharing of the millions of genomes sequenced each year, together with accompanying metadata on isolation host, date of sampling, and geographic location. with enough data, surveillance initiatives have the potential to identify the likely geographic origin of emerging pathogens and amr genes, group seemingly unrelated cases into outbreaks, and clearly identify when sequences are divergent from other circulating strains. in a hospital setting, surveillance can help to detect transmission within the hospital and inflow from the community, optimize antimicrobial stewardship, and inform treatment decisions; at national and global scales, it can highlight worldwide emerging trends for which collated evidence can direct both retrospective but also anticipatory policy decisions. amongst the most successful global surveillance initiatives and analytical frameworks are those relating specifically to the spread of viruses. influenza surveillance is arguably the most developed, with large sequencing repositories such as the gisaid database (gisaid.org) and online data exploration and phylodynamics available through web tools such as nextflu [60] and nextstrain (http://nextstrain.org), which also allows examination of other significant viruses including zika, ebola, and avian influenza. another popular tool for the sharing of data and visualization of phylogenetic trees and their accompanying meta-data is microreact (microreact.org) [61] , which also allows for interactive data querying and includes bacteria and fungi. a further tool, predominately for bacterial data, is wgsa (www.wgsa.net). wgsa allows the upload of genome assemblies through a drag-and-drop web browser, allowing for a quick characterization of species, mlst type, resistance profile, and phylogenetic placement in the context of the existing species database based on core genes. at the time of writing wgsa comprises 20 649 genomes predominantly from s. aureus, n. gonorrhoeae, and salmonella enterica serovar typhi, together with ebola and zika viruses, all with some associated metadata. although an exciting initiative, wgsa and associated platforms are still a reasonably long way off characterizing all clinically relevant isolates and often rely entirely on the sequences uploaded already being assembled. more generally, the success of any wgs surveillance is dependent on the timely and open sharing of information from around the globe. while sequence data from academic publications is near systematically deposited on public sequence databases (at least upon publication), such data are near useless if the accompanying metadata (see above) are not also released, as remains the case far too often. additionally, as more genomes are routinely sequenced in clinical settings as part of standard procedures, ensuring that the culture of sharing sequence data persists beyond academic research will become increasingly important. for wgs to be routinely adopted in clinical microbiology, it needs to be cost-effective. it is commonly accepted that sequencing costs are plummeting with the national human genome research institute (nhgri) estimating the cost per raw megabase (mb) of dna sequence to 0.12 usd (www.genome.gov/sequencingcostsdata). this has led to claims that a draft bacterial genome can currently cost less than 1 usd to generate [62] . this is a misunderstanding as one cannot simply extrapolate the cost of a bacterial genome by multiplying a highthroughput per dna megabase (mb) sequencing cost by the size of its genome. for microbial sequencing, multiple samples must be multiplexed for cost efficiency, which is easier to achieve in large reference laboratories with high sample turnover. excluding indirect costs such as salaries for personnel, preparation of sequencing libraries now makes up the major fraction of microbial sequencing costs ( figure 2 ). the precipitous drop in the cost of producing raw dna sequences in recent years (figure 2a ) mostly reflects a massive increase in output with new iterations of illumina production machines. these numbers ignore all other costs and simply reflect output relative to the cost of the sequencing kits/cartridges. realistic cost estimates for a microbial genome including library preparation on the best available platforms give a different picture ( figure 2b ). since the introduction of the illumina miseq platform in 2011, new sequencing kits generating higher output have only marginally affected true microbial genome sequencing costs, as library preparation makes up a significant portion of the total (60 usd of a total of 74 usd for a typical bacterial genome in 2018). these costs have remained stable over time and are unlikely to go down significantly in the near future. indeed, the market seems to be consolidating in fewer hands (e.g., represented by the procurement of kapa by roche in 2015), which economic theory predicts will not favor price decrease. it is also important to keep in mind that these costs are massive underestimates which do not include indirect costs such as salaries for laboratory personnel and downstream bioinformatics. such indirect costs are difficult to estimate precisely in an academic setting but are far from trivial. per-genome sequencing and analysis costs are likely to be even higher in a clinical diagnostics environment due to the need for highly standardised and accredited procedures. however, a micro-costing analysis covering laboratory and personnel costs estimated the cost of clinical wgs to £481 per m. tuberculosis isolate versus £518 applying standard methods, representing relatively marginal cost savings but with significant time savings [63] . wgs does indeed represent a potentially cost-effective and highly informative tool for clinical diagnostics, but for microbiology-scale sequencing we seem to be in a post-plummeting-costs age. one key feature of useful diagnostics tools is their ability to rapidly inform treatment. most applications of wgs so far have been for lab-cultured organisms (bacteria and fungi). traditional culture methods require long turnaround time, with most bacterial cultures taking 1-5 days, fungal cultures 7-30 days, and mycobacterial cultures up to 14-60 days. in this scenario, wgs is used as an adjunct technology primarily to provide information on the presence of amr and virulence genes, which is particularly useful for mechanisms that are difficult to determine phenotypically (e.g. carbapenem resistance). this use of wgs, whilst solving some of the current clinical problems, does not speed up the diagnosis of infection; it is more the case that new technology is replacing some of the more cumbersome laboratory techniques whilst providing additional information. wgs is more appealing as a microbiological fast diagnostics solution when combined with procedures that circumvent (or shorten) the traditional culture step. this can be achieved through direct sampling of clinical material (box 1) or by using a protocol enriching for sequences of specific organism(s). such enrichment methods, generally based on the capture of known sequences though hybridization, are a particularly tractable approach for viruses due to their small genome size. for example, the vircap virome capture method targets all known viruses and can even enrich for novel sequences [64] . similar methods targeting specific organisms have been developed and successfully deployed, representing an attractive option for unculturable organisms [16, [65] [66] [67] [68] . relative to the time required for culture and downstream analysis of the data, variation in the speed of different sequencing technologies is relatively modest. there is considerable enthusiasm for the oxford nanopore technology (ont) which outputs data in real time, although the ont requires a comparable amount of time to the popular illumina miseq sequencer to generate the same volume of sequence data. sequencing on the miseq sequencer takes between 13 to 56 hours, but as run time correlates with sequence output and read length, researchers tend to systematically favour runs of longer duration. in the context of this review, genetic material from the human patient present in clinical samples represents contamination, a major obstacle to obtaining a high yield of microbial dna. protocols exist to deplete human dna prior to sequencing [69, 70] but these are not completely problem-free as the depletion protocol is likely to bias estimates of the microbial community, and some human reads will likely remain. in particular, levels of human dna are significantly higher in faecal samples from hospitalized patients compared to healthy controls [71] , box 1. wgs beyond single genomes wgs in the strict sense usually refers to sequencing the genome of a single organism, and it is common to distinguish between the sample (the material that has actually been taken from the patient) and the isolate (an organism that has been cultured and isolated from that sample). wgs methods traditionally sequence a cultured isolate to reduce contamination from other organisms, or sometimes rely on enrichment strategies targeting sequences from a specific organism [66, 67] . however, this represents only a small fraction of the total microbial diversity present in a clinical sample. in contrast, metagenomic approaches sequence samples in an untargeted way. this approach is particularly relevant for clinical scenarios where the pathogen of interest cannot be predicted and/or is fastidious (i.e., has complex culturing requirements). example applications of clinical metagenomics include: when the disease causing agent is unexpected [74, 75] ; investigating the spread of amr-carrying plasmids across species [35] ; and characterizing the natural history of the microbiome [76] . the removal of the culture requirement can drastically decrease turn-around time from sample to data and enable identification of both rare and novel pathogens. different samples however present different challenges. easy-to-collect sample sites (e.g., faeces and sputum) typically also have a resident microbiota, so it can be challenging to distinguish the etiological agent of disease from colonizing microbes. conversely, sites that are usually sterile (e.g., cerebrospinal fluid, pleural fluid) present a much better opportunity for metagenomics to contribute to clinical care. metagenomic data are more complex to analyze than single species wgs data and tend to rely on sophisticated computational tools, such as the desman software allowing inference of strain-level variation in a metagenomic sample [77] . such approaches can be difficult to implement, are computationally very demanding, and are unlikely to be deployable in clinical microbiology in the near future, although cloud-based platforms may circumvent the need for computational resources in diagnostic laboratories. furthermore, some faster approaches for rapid strain characterization from raw sequence reads, such as mash [78] and kmerfinder [10, 79] , could find a use in diagnostics microbiology, with the latter having been shown to identify the presence of pathogenic strains even in culture-negative samples [10] . however, the differences between these methods should not obscure their fundamental similarities. obtaining singlespecies genomes from culture is one end of a continuum of methods that stretches all the way to full-blown metagenomics of a sample. in principle, all methods produce the same kind of data: strings of bases. furthermore, in all cases what is clinically relevant represents only a small fraction of these data. integrating sequencing data from different methods into a single diagnostics pipeline is therefore an attractive prospect to quickly identify the genomic needles in the metagenomic haystack in a species-agnostic manner. for example, the presence of a particular antibiotic-resistance gene in sequencing data may recommend against the use of that antibiotic; whether the gene is present in data from a single-species isolate or from metagenomes is irrelevant. as an example, leggett et al. used minion metagenomic profiling to identify pathogen-specific amr genes present in a faecal sample from a critically ill infant all within 5 h of taking the initial sample [80] . suggesting that the problem is exacerbated in clinical settings. therefore, the ethical and legal issues raised by introducing human wgs into routine healthcare [72] cannot be avoided by microbially focused clinical metagenomics. dismissing these concerns as minor may be an option for academic researchers uninterested in these human data, but it is naive to think that hospital ethics committees will share this view. even in the absence of human dna, metagenomic samples from multiple body sites can be used to identify individuals in datasets of hundreds of people [73] . managing clinical metagenomics data in light of these concerns should be taken seriously, not only as a barrier to implementation but because of the real risks to patient privacy. a major problem in the analysis of wgs data is that there are currently very few (if any) accepted gold standards. the fundamental steps of wgs analyses in microbial genomics tend to be similar across applications and usually consist of the following steps: sequence data quality control; identification/confirmation of the sequenced biological material; characterization of the sequenced isolate (including typing efforts as well as characterization of virulence factors and putative amr elements/mutations); epidemiologic analysis; and finally, storage of the results ( figure 3 ). however, how these analyses are implemented varies widely, both between microbial species and human labs. despite some commercial attempts at one-stop analysis suites such as ridom seqsphere+ (http://www.ridom.com/seqsphere/), most laboratories use a collection of open-source tools to perform particular subanalyses. typically, these tools are then woven together into a patchwork of software (a 'pipeline'). the idea of a pipeline is to allow within-laboratory standardized analysis of batches of isolates with relatively little manual bioinformatics work. such pipelines can be highly customizable for a wide range of questions. there are also some communal efforts at streamlining workflows across laboratories. as an example, galaxy (https://usegalaxy.org) is a framework that allows nonbioinformaticians to use a wide array of bioinformatics tools through a web interface. one major limitation to rapidly attaining useful information in a clinical setting is that analysis pipelines for microbial genomics have generally been developed for fundamental research or public health epidemiology [81] . this usually means that the pipeline permits a very thorough and sophisticated workflow with a large number of options and moving parts. for example, at the time of writing (may, 2018), the 'qc and manipulation' step in galaxy alone consists of 35 different tools, tests, and workflows that can be applied to an input sequence. while this is desirable from a researcher's perspective, it is clearly prohibitive for real-time analysis in a clinical setting. a user requires in-depth knowledge about the purpose each tool serves, the relative strengths and weaknesses of each approach, and a functional understanding of the important parameters. furthermore, most analysis pipelines require proficiency in linux systems and navigating the command line, something clinical microbiologists are rarely trained for. the road to stringent, exhaustive analysis of wgs data is long and paved with good intentions. in order to move towards real-time interpretable results for clinics it will be necessary to take certain shortcuts. the focus should be on rapid, automated analysis and clear, unambiguous results. some steps in the pipeline can simply be omitted for clinical purposes. as an example, genome assembly might appear to be a bottleneck for real-time wgs diagnostics, but is probably rarely required; sufficient characterization of an isolate can be made by analysis of the k-mers in the raw sequence data, which is orders of magnitude faster. accurate identification of an isolate can be made rapidly with minhash-based k-mer matching methods such as mash [78] , and amr elements can be identified from k-mers alone [14] . another example of a computationally intensive step that could be omitted from a default pipeline is sophisticated phylogenetic inference. best practice for the creation of phylogenetic trees may involve evaluating the individual likelihood of a very wide range of possible trees given a sequence alignment or other distance metric, repeated for thousands of bootstrapped replicates, giving a tree with high confidence but with extreme computational time costs. a clinical pipeline could use much faster approaches and still provide an informative phylogenetic tree [82] . in figure 3 we outline our schematic vision of a computational pipeline specific to diagnostics in clinical microbiology. the clinical pipeline would only encompass a small subset of the research pipeline aimed at generating rapid and interpretable output. for epidemiological inference, pairwise distances between strains would be computed as a matrix of jaccard distances on the shared proportion of k-mers as outputted by mash [78] . this matrix could be used to generate a phylogenetic tree using a computationally inexpensive method (e.g., neighbor-joining). additionally, a correlation between pairwise genetic distance and sampling date could be performed steps on the right marked with an asterisk represent simplified versions optimised for speed. cgmlst, core genome multilocus sequence typing; snp, single-nucleotide polymorphism; wgmlst, whole genome multilocus sequence typing. to test for evidence of temporal signal in the data (i.e., accumulation of a sufficient number of mutations over the sampling period). in the presence of temporal signal, the user would be provided with a transmission chain based on a fast algorithm such as seqtrack [83] . any bespoke pipeline for clinical diagnostics would need to be linked with regularly updated multispecies databases containing information about the latest developments in typing schemes, as well as clinically important factors such as amr determinants. results would have to be continuously validated, and international accreditation standards met at regular intervals. at a national level, accreditation bodies (e.g., ukas in the uk) may lack the expertise required. in our experience, many promising databases have collapsed after funding expired or the responsible postdoc left for another job. if wgs is ever to make it into the clinic it will be necessary to secure indefinite funding of both infrastructure and personnel for such databases. the lack of uptake of wgs-based diagnostics may also be in part due to an understandable desire to maintain the 'status quo' in a busy hospital environment with already established treatment and intervention systems. additionally, and perhaps significantly, it also highlights the difficulty to communicate the potential benefits of wgs to the day-to-day life of a clinic. the main proponents of wgs tend to be based in the public health/research environment and are rarely actively involved in clinical decision-making. this in itself can present something of a language barrier, challenging meaningful dialogue over how adoption of new approaches can lead to quantifiable improvements in existing systems. further, the physical planning, implementation and integration of wgs diagnostics may be unlikely to succeed without carefully planned introduction and continued training of its user base. this is of course challenged by the already resource-limited infrastructure of many clinical settings. despite its immense promise and some early successes, it is difficult to predict if and when wgs will completely supersede current standards in clinical microbiology. there are several major bottlenecks to its implementation as a routine approach to diagnose and characterise microbial infections (see outstanding questions). these include, among others: the current costs of wgs, which remain far from negligible despite a common belief that sequencing costs have plummeted; a lack of training in, and possible cultural resistance to, bioinformatics among clinical microbiologists; a lack of the necessary computational infrastructure in most hospitals; the inadequacy of existing reference microbial genomics databases necessary for reliable amr and virulence profiling; and the difficulty of setting up effective, standardized, and accredited bioinformatics protocols. focusing in the near future on wgs applications that fulfil unmet diagnostic needs and demonstrate clear benefits to patients and healthcare professionals will help to drive the cultural changes required for the transition to wgs in clinical microbiology. however, irrespective of how this transition occurs and how complete it is, it is likely to feel highly disruptive for many clinical microbiologists. there is also a genuine risk that precious knowledge in basic microbiology will be lost after the transition to wgs, particularly if investment prioritises new technology at the expense of older expertise. more positively, irrespective of the future implementation of wgs in clinical microbiology, we should not forget that the availability of extensive genomic data has been instrumental in the development of a multitude of routine non-wgs typing schemes. efforts to develop wgs-based microbial diagnostics have unsurprisingly focused on highresource settings. though, we can see an opportunity for low-/medium-income countries to outstanding questions can wgs be used to develop robust classification schemes that account for the genetic diversity of organisms with open genomes? which clinically relevant phenotypes can be reliably predicted using wgs, and for which organisms? how can phylogenetic analyses of outbreaks be speeded up to meaningfully contribute to infection control at actionable time scales? how can publicly available databases be reliably maintained to the required clinical accreditation standards over long time periods? will the true cost of generating a bacterial genome remain stable as the sequencing market consolidates in fewer hands? how can clinical metagenomic data be managed safely in line with the ethical considerations applicable to identifiable human dna? how can unwieldy bioinformatics pipelines developed with academic research in mind be adapted for a clinical setting? can current expertise in traditional clinical microbiology be maintained in the transition to wgs? get up to speed with the latest wgs-based developments in real-time clinical diagnostics, rather than adopting classical microbiological phenotyping which might eventually be largely phased out in high-income countries. one precedent for the successful adoption of a technology without transitions through its acknowledged historical predecessors is the widespread use of mobile phones in africa. this has greatly increased communication and allowed access to e-banking, despite the fact that many people previously had no traditional bank account and only limited access to landlines. most hospitals in the developing world do not currently benefit from a clinical microbiology laboratory. the installation of a molecular laboratory based around a standard sequencer, such as a benchtop miseq, might constitute an ideal investment, as it is neither far more expensive nor more complex than setting up a standard clinical microbiology laboratory. high-throughput sequencing and clinical microbiology: progress, opportunities and challenges transforming clinical microbiology with bacterial genome sequencing routine use of microbial whole genome sequencing in diagnostic and public health microbiology bacterial genome sequencing in the clinic: bioinformatic challenges and solutions utility of matrix-assisted laser desorption ionization-time of flight mass spectrometry following introduction for routine laboratory bacterial identification armed conflict and population displacement as drivers of the evolution and dispersal of mycobacterium tuberculosis multilocus sequence typing as a replacement for serotyping in salmonella enterica a robust snp barcode for typing mycobacterium tuberculosis complex strains a genomic portrait of the emergence, evolution, and global spread of a methicillin-resistant staphylococcus aureus pandemic benchmarking of methods for genomic taxonomy arg-annot, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. antimicrob sstar, a stand-alone easy-to-use antimicrobial resistance gene predictor phyresse: a web tool delineating mycobacterium tuberculosis antibiotic resistance and lineage from whole-genome sequencing data antimicrobial resistance prediction in patric and rast rapid determination of anti-tuberculosis drug resistance from whole-genome sequences rapid antibiotic-resistance predictions from genome sequence data for staphylococcus aureus and mycobacterium tuberculosis wgs accurately predicts antimicrobial resistance in escherichia coli prediction of staphylococcus aureus antimicrobial resistance by whole-genome sequencing whole-genome sequencing and epidemiological analysis do not provide evidence for cross-transmission of mycobacterium abscessus in a cohort of pediatric cystic fibrosis patients short-read whole genome sequencing for determination of antimicrobial resistance mechanisms and capsular serotypes of current invasive streptococcus agalactiae recovered in the usa using whole genome sequencing to identify resistance determinants and predict antimicrobial resistance phenotypes for year 2015 invasive pneumococcal disease isolates recovered in the united states comparison of 61 sequenced escherichia coli genomes in silico serine beta-lactamases analysis reveals a huge potential resistome in environmental and pathogenic species validation of beta-lactam minimum inhibitory concentration predictions for pneumococcal isolates with newly encountered penicillin binding protein (pbp) sequences evolutionary mechanisms shaping the maintenance of antibiotic resistance multicopy plasmids potentiate the evolution of antibiotic resistance in bacteria spatiotemporal microbial evolution on antibiotic landscapes vfdb 2016: hierarchical and refined dataset for big data analysis -10 years on real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic escherichia coli genetic diversity, mobilisation and spread of the yersiniabactin-encoding mobile element icekp in klebsiella pneumoniae populations tracking a hospital outbreak of kpcproducing st11 klebsiella pneumoniae with whole genome sequencing nested russian doll-like genetic mobility drives rapid dissemination of the carbapenem resistance gene bla(kpc) evolution and transmission of carbapenem-resistant klebsiella pneumoniae expressing the bla(oxa-232) gene during an institutional outbreak associated with endoscopic retrograde cholangiopancreatography utility of whole-genome sequencing in characterizing acinetobacter epidemiology and analyzing hospital outbreaks rapid whole-genome sequencing for investigation of a neonatal mrsa outbreak whole-genome sequencing for the investigation of a hospital outbreak of mrsa in china prolonged and large outbreak of invasive group a streptococcus disease within a nursing home: repeated intrafacility transmission of a single strain genomic analysis and comparison of two gonorrhea outbreaks simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks impact of hiv co-infection on the evolution and transmission of multidrug-resistant tuberculosis bayesian inference of infectious disease transmission from whole-genome sequence data microevolutionary analysis of clostridium difficile genomes to investigate transmission beast 2: a software platform for bayesian evolutionary analysis bayesian phylogenetics with beauti and the beast 1.7 genomic infectious disease epidemiology in partially sampled and ongoing outbreaks transmission of staphylococcus aureus between health-care workers, the environment, and patients in an intensive care unit: a longitudinal cohort study based on wholegenome sequencing whole-genome sequencing to determine transmission of neisseria gonorrhoeae: an observational study a pilot study of rapid benchtop sequencing of staphylococcus aureus and clostridium difficile for outbreak detection and surveillance whole-genome sequencing for analysis of an outbreak of methicillin-resistant staphylococcus aureus: a descriptive study real time application of whole genome sequencing for outbreak investigation -what is an achievable turnaround time? translating genomics into practice for real-time surveillance and response to carbapenemase-producing enterobacteriaceae: evidence from a complex multi-institutional kpc outbreak real-time, portable genome sequencing for ebola surveillance multiplex pcr method for minion and illumina sequencing of zika and other virus genomes directly from clinical samples inferences from tip-calibrated phylogenies: a review and a practical guide when are pathogen genome sequences informative of transmission events? within-host bacterial diversity hinders accurate reconstruction of transmission networks from genomic distance data nextflu: real-time tracking of seasonal influenza virus evolution in humans microreact: visualizing and sharing data for genomic epidemiology and phylogeography insights from 20 years of bacterial genome sequencing rapid, comprehensive, and affordable mycobacterial diagnosis with whole-genome sequencing: a prospective study virome capture sequencing enables sensitive viral diagnosis and comprehensive virome analysis deep sequencing of viral genomes provides insight into the evolution and pathogenesis of varicella zoster virus and its vaccine in humans specific capture and whole-genome sequencing of viruses from clinical samples same-day diagnostic and surveillance data for tuberculosis via whole-genome sequencing of direct respiratory samples rapid whole genome sequencing of m. tuberculosis directly from clinical samples depletion of human dna in spiked clinical specimens for improvement of sensitivity of pathogen detection by next-generation sequencing a method for selectively enriching microbial dna from contaminating vertebrate host dna excretion of host dna in feces is associated with risk of clostridium difficile infection the ethical introduction of genomebased information and technologies into public health identifying personal microbiomes using metagenomic codes astrovirus va1/hmo-c: an increasingly recognized neurotropic pathogen in immunocompromised patients human coronavirus oc43 associated with fatal encephalitis natural history of the infant gut microbiome and impact of antibiotic treatment on bacterial strain diversity and stability desman: a new tool for de novo extraction of strains from metagenomes mash: fast genome and metagenome distance estimation using minhash rapid whole-genome sequencing for detection and characterization of microorganisms directly from clinical samples rapid minion metagenomic profiling of the preterm infant gut microbiota to aid in pathogen diagnostics whole genome sequencing in clinical and public health microbiology evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study reconstructing disease outbreaks from genetic data: a graph approach we are grateful to nadia debech and jan oksens for their help with digging up historic pricing information for sequencing key: cord-103297-4stnx8dw authors: widrich, michael; schäfl, bernhard; pavlović, milena; ramsauer, hubert; gruber, lukas; holzleitner, markus; brandstetter, johannes; sandve, geir kjetil; greiff, victor; hochreiter, sepp; klambauer, günter title: modern hopfield networks and attention for immune repertoire classification date: 2020-08-17 journal: biorxiv doi: 10.1101/2020.04.12.038158 sha: doc_id: 103297 cord_uid: 4stnx8dw a central mechanism in machine learning is to identify, store, and recognize patterns. how to learn, access, and retrieve such patterns is crucial in hopfield networks and the more recent transformer architectures. we show that the attention mechanism of transformer architectures is actually the update rule of modern hop-field networks that can store exponentially many patterns. we exploit this high storage capacity of modern hopfield networks to solve a challenging multiple instance learning (mil) problem in computational biology: immune repertoire classification. accurate and interpretable machine learning methods solving this problem could pave the way towards new vaccines and therapies, which is currently a very relevant research topic intensified by the covid-19 crisis. immune repertoire classification based on the vast number of immunosequences of an individual is a mil problem with an unprecedentedly massive number of instances, two orders of magnitude larger than currently considered problems, and with an extremely low witness rate. in this work, we present our novel method deeprc that integrates transformer-like attention, or equivalently modern hopfield networks, into deep learning architectures for massive mil such as immune repertoire classification. we demonstrate that deeprc outperforms all other methods with respect to predictive performance on large-scale experiments, including simulated and real-world virus infection data, and enables the extraction of sequence motifs that are connected to a given disease class. source code and datasets: https://github.com/ml-jku/deeprc transformer architectures (vaswani et al., 2017) and their attention mechanisms are currently used in many applications, such as natural language processing (nlp), imaging, and also in multiple instance learning (mil) problems . in mil, a set or bag of objects is labelled rather than objects themselves as in standard supervised learning tasks (dietterich et al., 1997) . examples for mil problems are medical images, in which each sub-region of the image represents an instance, video a pooling function f is used to obtain a repertoire-representation z for the input object. finally, an output network o predicts the class labelŷ. b) deeprc uses stacked 1d convolutions for a parameterized function h due to their computational efficiency. potentially, millions of sequences have to be processed for each input object. in principle, also recurrent neural networks (rnns), such as lstms (hochreiter et al., 2007) , or transformer networks (vaswani et al., 2017) may be used but are currently computationally too costly. c) attention-pooling is used to obtain a repertoire-representation z for each input object, where deeprc uses weighted averages of sequence-representations. the weights are determined by an update rule of modern hopfield networks that allows to retrieve exponentially many patterns. classification, in which each frame is an instance, text classification, where words or sentences are instances of a text, point sets, where each point is an instance of a 3d object, and remote sensing data, where each sensor is an instance (carbonneau et al., 2018; uriot, 2019) . attention-based mil has been successfully used for image data, for example to identify tiny objects in large images (ilse et al., 2018; pawlowski et al., 2019; tomita et al., 2019; kimeswenger et al., 2019) and transformer-like attention mechanisms for sets of points and images . however, in mil problems considered by machine learning methods up to now, the number of instances per bag is in the range of hundreds or few thousands (carbonneau et al., 2018; lee et al., 2019 ) (see also tab. a2). at the same time the witness rate (wr), the rate of discriminating instances per bag, is already considered low at 1% − 5%. we will tackle the problem of immune repertoire classification with hundreds of thousands of instances per bag without instance-level labels and with extremely low witness rates down to 0.01% using an attention mechanism. we show that the attention mechanism of transformers is the update rule of modern hopfield networks (krotov & hopfield, 2016 demircigil et al., 2017) that are generalized to continuous states in contrast to classical hopfield networks (hopfield, 1982) . a detailed derivation and analysis of modern hopfield networks is given in our companion paper (ramsauer et al., 2020) . these novel continuous state hopfield networks allow to store and retrieve exponentially (in the dimension of the space) many patterns (see next section). thus, modern hopfield networks with their update rule, which are used as an attention mechanism in the transformer, enable immune repertoire classification in computational biology. immune repertoire classification, i.e. classifying the immune status based on the immune repertoire sequences, is essentially a text-book example for a multiple instance learning problem (dietterich et al., 1997; maron & lozano-pérez, 1998; wang et al., 2018) . briefly, the immune repertoire of an individual consists of an immensely large bag of immune receptors, represented as amino acid sequences. usually, the presence of only a small fraction of particular receptors determines the immune status with respect to a particular disease (christophersen et al., 2014; emerson et al., 2017) . this is because the immune system has already acquired a resistance if one or few particular immune receptors that can bind to the disease agent are present. therefore, classification of immune repertoires bears a high difficulty since each immune repertoire can contain millions of sequences as instances with only a few indicating the class. further properties of the data that complicate the problem are: (a) the overlap of immune repertoires of different individuals is low (in most cases, maximally low single-digit percentage values) (greiff et al., 2017; elhanati et al., 2018) , (b) multiple different sequences can bind to the same pathogen (wucherpfennig et al., 2007) , and (c) only subsequences within the sequences determine whether binding to a pathogen is possible (dash et al., 2017; glanville et al., 2017; akbar et al., 2019; springer et al., 2020; fischer et al., 2019) . in summary, immune repertoire classification can be formulated as multiple instance learning with an extremely low witness rate and large numbers of instances, which represents a challenge for currently available machine learning methods. furthermore, the methods should ideally be interpretable, since the extraction of class-associated sequence motifs is desired to gain crucial biological insights. the acquisition of human immune repertoires has been enabled by immunosequencing technology (georgiou et al., 2014; brown et al., 2019) which allows to obtain the immune receptor sequences and immune repertoires of individuals. each individual is uniquely characterized by their immune repertoire, which is acquired and changed during life. this repertoire may be influenced by all diseases that an individual is exposed to during their lives and hence contains highly valuable information about those diseases and the individual's immune status. immune receptors enable the immune system to specifically recognize disease agents or pathogens. each immune encounter is recorded as an immune event into immune memory by preserving and amplifying immune receptors in the repertoire used to fight a given disease. this is, for example, the working principle of vaccination. each human has about 10 7 -10 8 unique immune receptors with low overlap across individuals and sampled from a potential diversity of > 10 14 receptors (mora & walczak, 2019) . the ability to sequence and analyze human immune receptors at large scale has led to fundamental and mechanistic insights into the adaptive immune system and has also opened the opportunity for the development of novel diagnostics and therapy approaches (georgiou et al., 2014; brown et al., 2019) . immunosequencing data have been analyzed with computational methods for a variety of different tasks (greiff et al., 2015; shugay et al., 2015; miho et al., 2018; yaari & kleinstein, 2015; wardemann & busse, 2017) . a large part of the available machine learning methods for immune receptor data has been focusing on the individual immune receptors in a repertoire, with the aim to, for example, predict the antigen or antigen portion (epitope) to which these sequences bind or to predict sharing of receptors across individuals (gielis et al., 2019; springer et al., 2020; jurtz et al., 2018; moris et al., 2019; fischer et al., 2019; greiff et al., 2017; sidhom et al., 2019; elhanati et al., 2018) . recently, jurtz et al. (2018) used 1d convolutional neural networks (cnns) to predict antigen binding of t-cell receptor (tcr) sequences (specifically, binding of tcr sequences to peptide-mhc complexes) and demonstrated that motifs can be extracted from these models. similarly, konishi et al. (2019) use cnns, gradient boosting, and other machine learning techniques on b-cell receptor (bcr) sequences to distinguish tumor tissue from normal tissue. however, the methods presented so far predict a particular class, the epitope, based on a single input sequence. immune repertoire classification has been considered as a mil problem in the following publications. a deep learning framework called deeptcr (sidhom et al., 2019) implements several deep learning approaches for immunosequencing data. the computational framework, inter alia, allows for attention-based mil repertoire classifiers and implements a basic form of attention-based averaging. ostmeyer et al. (2019) already suggested a mil method for immune repertoire classification. this method considers 4-mers, fixed sub-sequences of length 4, as instances of an input object and trained a logistic regression model with these 4-mers as input. the predictions of the logistic regression model for each 4-mer were max-pooled to obtain one prediction per input object. this approach is characterized by (a) the rigidity of the k-mer features as compared to convolutional kernels (alipanahi et al., 2015; zhou & troyanskaya, 2015; zeng et al., 2016) , (b) the max-pooling operation, which constrains the network to learn from a single, top-ranked k-mer for each iteration over the input object, and (c) the pooling of prediction scores rather than representations (wang et al., 2018) . our experiments also support that these choices in the design of the method can lead to constraints on the predictive performance (see table 1 ). our proposed method, deeprc, also uses a mil approach but considers sequences rather than k-mers as instances within an input object and a transformer-like attention mechanism. deeprc sets out to avoid the above-mentioned constraints of current methods by (a) applying transformer-like attention-pooling instead of max-pooling and learning a classifier on the repertoire rather than on the sequence-representation, (b) pooling learned representations rather than predictions, and (c) using less rigid feature extractors, such as 1d convolutions or lstms. in this work, we contribute the following: we demonstrate that continuous generalizations of binary modern hopfield-networks (krotov & hopfield, 2016 demircigil et al., 2017) have an update rule that is known as the attention mechanisms in the transformer. we show that these modern hopfield networks have exponential storage capacity, which allows them to extract patterns among a large set of instances (next section). based on this result, we propose deeprc, a novel deep mil method based on modern hopfield networks for large bags of complex sequences, as they occur in immune repertoire classification (section "deep repertoire classification). we evaluate the predictive performance of deeprc and other machine learning approaches for the classification of immune repertoires in a large comparative study (section "experimental results") exponential storage capacity of continuous state modern hopfield networks with transformer attention as update rule in this section, we show that modern hopfield networks have exponential storage capacity, which will later allow us to approach massive multiple-instance learning problems, such as immune repertoire classification. see our companion paper (ramsauer et al., 2020) for a detailed derivation and analysis of modern hopfield networks. we assume patterns x 1 , . . . , x n ∈ r d that are stacked as columns to the matrix x = (x 1 , . . . , x n ) and a query pattern ξ that also represents the current state. the largest norm of a pattern is m = max i x i . the separation ∆ i of a pattern x i is defined as its minimal dot product difference to any of the other patterns: we consider a modern hopfield network with current state ξ and the energy function for energy e and state ξ, the update rule is proven to converge globally to stationary points of the energy e, which are local minima or saddle points (see (ramsauer et al., 2020) , appendix, theorem a2 ). surprisingly, the update rule eq. (1) is also the formula of the well-known transformer attention mechanism. to see this more clearly, we simultaneously update several queries ξ i . furthermore the queries ξ i and the patterns x i are linear mappings of vectors y i into the space r d . for matrix notation, we set x i = w t k y i , ξ i = w t q y i and multiply the result of our update rule with w v . using y = (y 1 , . . . , y n ) t , we define the matrices and the patterns are now mapped to the hopfield space with dimension d = d k . we set β = 1/ √ d k and change softmax to a row vector. the update rule eq. (1) multiplied by w v performed for all queries simultaneously becomes in row vector notation: this formula is the transformer attention. if the patterns x i are well separated, the iterate eq. (1) converges to a fixed point close to a pattern to which the initial ξ is similar. if the patterns are not well separated the iterate eq.(1) converges to a fixed point close to the arithmetic mean of the patterns. if some patterns are similar to each other but well separated from all other vectors, then a metastable state between the similar patterns exists. iterates that start near a metastable state converge to this metastable state. for details see ramsauer et al. (2020) , appendix, sect. a2. typically, the update converges after one update step (see ramsauer et al. (2020) , appendix, theorem a8) and has an exponentially small retrieval error (see ramsauer et al. (2020) , appendix, theorem a9). our main concern for application to immune repertoire classification is the number of patterns that can be stored and retrieved by the modern hopfield network, equivalently to the transformer attention head. the storage capacity of an attention mechanism is critical for massive mil problems. we first define what we mean by storing and retrieving patterns from the modern hopfield network. definition 1 (pattern stored and retrieved). we assume that around every pattern x i a sphere s i is given. we say x i is stored if there is a single fixed point x * i ∈ s i to which all points ξ ∈ s i converge, for randomly chosen patterns, the number of patterns that can be stored is exponential in the dimension d of the space of the patterns (x i ∈ r d ). theorem 1. we assume a failure probability 0 < p 1 and randomly chosen patterns on the sphere with radius m = k √ d − 1. we define a := 2 d−1 (1 + ln(2 β k 2 p (d − 1))), b := 2 k 2 β 5 , and c = b w0(exp(a + ln(b)) , where w 0 is the upper branch of the lambert w function and ensure then with probability 1 − p, the number of random patterns that can be stored is examples are c ≥ 3.1546 for β = 1, k = 3, d = 20 and p = 0.001 (a + ln(b) > 1.27) and c ≥ 1.3718 for β = 1 k = 1, d = 75, and p = 0.001 (a + ln(b) < −0.94). see ramsauer et al. (2020) , appendix, theorem a5 for a proof. we have established that a modern hopfield network or a transformer attention mechanism can store and retrieve exponentially many patterns. this allows us to approach mil with massive numbers of instances from which we have to retrieve a few with an attention mechanism. deep repertoire classification problem setting and notation. we consider a mil problem, in which an input object x is a bag of n instances x = {s 1 , . . . , s n }. the instances do not have dependencies nor orderings between them and n can be different for every object. we assume that each instance s i is associated with a label y i ∈ {0, 1}, assuming a binary classification task, to which we do not have access. we only have access to a label y = max i y i for an input object or bag. note that this poses a credit assignment problem, since the sequences that are responsible for the label y have to be identified and that the relation between instance-label and bag-label can be more complex (foulds & frank, 2010) . a modelŷ = g(x) should be (a) invariant to permutations of the instances and (b) able to cope with the fact that n varies across input objects (ilse et al., 2018) , which is a problem also posed by point sets (qi et al., 2017) . two principled approaches exist. the first approach is to learn an instance-level scoring function h : s → [0, 1], which is then pooled across instances with a pooling function f , for example by average-pooling or max-pooling (see below). the second approach is to construct an instance representation z i of each instance by h : s → r dv and then encode the bag, or the input object, by pooling these instance representations (wang et al., 2018) via a function f . an output function o : r dv → [0, 1] subsequently classifies the bag. the second approach, the pooling of representations rather than scoring functions, is currently best performing (wang et al., 2018) . in the problem at hand, the input object x is the immune repertoire of an individual that consists of a large set of immune receptor sequences (t-cell receptors or antibodies). immune receptors are primarily represented as sequences s i from a space s i ∈ s. these sequences act as the instances in the mil problem. although immune repertoire classification can readily be formulated as a mil problem, it is yet unclear how well machine learning methods solve the above-described problem with a large number of instances n 10, 000 and with instances s i being complex sequences. next we describe currently used pooling functions for mil problems. pooling functions for mil problems. different pooling functions equip a model g with the property to be invariant to permutations of instances and with the ability to process different numbers of instances. typically, a neural network h θ with parameters θ is trained to obtain a function that maps each instance onto a representation: z i = h θ (s i ) and then a pooling function z = f ({z 1 , . . . , z n }) supplies a representation z of the input object x = {s 1 , . . . , s n }. the following pooling functions are typically used: average-pooling: where e m is the standard basis vector for dimension m and attention-pooling: z = n i=1 a i z i , where a i are non-negative (a i ≥ 0), sum to one ( n i=1 a i = 1), and are determined by an attention mechanism. these pooling functions are invariant to permutations of {1, . . . , n } and are differentiable. therefore, they are suited as building blocks for deep learning architectures. we employ attention-pooling in our deeprc model as detailed in the following. modern hopfield networks viewed as transformer-like attention mechanisms. the modern hopfield networks, as introduced above,have a storage capacity that is exponential in the dimension of the vector space and converge after just one update (see (ramsauer et al., 2020) , appendix).additionally, the update rule of modern hopfield networks is known as key-value attention mechanism, which has been highly successful through the transformer (vaswani et al., 2017) and bert (devlin et al., 2019) models in natural language processing. therefore using modern hopfield networks with the key-value-attention mechanism as update rule is the natural choice for our task. in particular, modern hopfield networks are theoretically justified for storing and retrieving the large number of vectors (sequence patterns) that appear in the immune repertoire classification task. instead of using the terminology of modern hopfield networks, we explain our deeprc architecture in terms of key-value-attention (the update rule of the modern hopfield network), since it is well known in the deep learning community. the attention mechanism assumes a space of dimension d k in which keys and queries are compared. a set of n key vectors are combined to the matrix k. a set of d q query vectors are combined to the matrix q. similarities between queries and keys are computed by inner products, therefore queries can search for similar keys that are stored. another set of n value vectors are combined to the matrix v . the output of the attention mechanism is a weighted average of the value vectors for each query q. the i-th vector v i is weighted by the similarity between the i-th key k i and the query q. the similarity is given by the softmax of the inner products of the query q with the keys k i . all queries are calculated in parallel via matrix operations. consequently, the attention function att(q, k, v ; β) maps queries q, keys k, and values v to d v -dimensional outputs: att(q, k, v ; β) = softmax(βqk t )v (see also eq. (2)). while this attention mechanism has originally been developed for sequence tasks (vaswani et al., 2017) , it can be readily transferred to sets ye et al., 2018) . this type of attention mechanism will be employed in deeprc. the deeprc method. we propose a novel method deep repertoire classification (deeprc) for immune repertoire classification with attention-based deep massive multiple instance learning and compare it against other machine learning approaches. for deeprc, we consider immune repertoires as input objects, which are represented as bags of instances. in a bag, each instance is an immune receptor sequence and each bag can contain a large number of sequences. note that we will use z i to denote the sequence-representation of the i-th sequence and z to denote the repertoire-representation. at the core, deeprc consists of a transformer-like attention mechanism that extracts the most important information from each repertoire. we first give an overview of the attention mechanism and then provide details on each of the sub-networks h 1 , h 2 , and o of deeprc. attention mechanism in deeprc. this mechanism is based on the three matrices k (the keys), q (the queries), and v (the values) together with a parameter β. values. deeprc uses a 1d convolutional network h 1 (lecun et al., 1998; hu et al., 2014; kelley et al., 2016) that supplies a sequence-representation z i = h 1 (s i ), which acts as the values v = z = (z 1 , . . . , z n ) in the attention mechanism (see figure 2 ). keys. a second neural network h 2 , which shares its first layers with h 1 , is used to obtain keys k ∈ r n ×d k for each sequence in the repertoire. this network uses 2 self-normalizing layers (klambauer et al., 2017) with 32 units per layer (see figure 2 ). query. we use a fixed d k -dimensional query vector ξ which is learned via backpropagation. for more attention heads, each head has a fixed query vector. with the quantities introduced above, the transformer attention mechanism (eq. (2)) of deeprc is implemented as follows: where z ∈ r n ×dv are the sequence-representations stacked row-wise, k are the keys, and z is the repertoire-representation and at the same time a weighted mean of sequence-representations z i . the attention mechanism can readily be extended to multiple queries, however, computational demand could constrain this depending on the application and dataset. theorem 1 demonstrates that this mechanism is able to retrieve a single pattern out of several hundreds of thousands. attention-pooling and interpretability. each input object, i.e. repertoire, consists of a large number n of sequences, which are reduced to a single fixed-size feature vector of length d v representing the whole input object by an attention-pooling function. to this end, a transformer-like attention mechanism adapted to sets is realized in deeprc which supplies a i -the importance of the sequence s i . this importance value is an interpretable quantity, which is highly desired for the immunological problem at hand. thus, deeprc allows for two forms of interpretability methods. (a) a trained deeprc model can compute attention weights a i , which directly indicate the importance of a sequence. (b) deeprc furthermore allows for the usage of contribution analysis methods, such as integrated gradients (ig) (sundararajan et al., 2017) or layer-wise relevance propagation (montavon et al., 2018; arras et al., 2019) . see sect. a8 for details. classification layer and network parameters. the repertoire-representation z is then used as input for a fully-connected output networkŷ = o(z) that predicts the immune status, where we found it sufficient to train single-layer networks. in the simplest case, deeprc predicts a single target, the class label y, e.g. the immune status of an immune repertoire, using one output value. however, since deeprc is an end-to-end deep learning model, multiple targets may be predicted simultaneously in classification or regression settings or a mix of both. this allows for the introduction of additional information into the system via auxiliary targets such as age, sex, or other metadata. table 1 with sub-networks h 1 , h 2 , and o. d l indicates the sequence length. network parameters, training, and inference. deeprc is trained using standard gradient descent methods to minimize a cross-entropy loss. the network parameters are θ 1 , θ 2 , θ o for the sub-networks h 1 , h 2 , and o, respectively, and additionally ξ. in more detail, we train deeprc using adam (kingma & ba, 2014) with a batch size of 4 and dropout of input sequences. implementation. to reduce computational time, the attention network first computes the attention weights a i for each sequence s i in a repertoire. subsequently, the top 10% of sequences with the highest a i per repertoire are used to compute the weight updates and prediction. furthermore, computation of z i is performed in 16-bit, others in 32-bit precision to ensure numerical stability in the softmax. see sect. a2 for details. in this section, we report and analyze the predictive power of deeprc and the compared methods on several immunosequencing datasets. the roc-auc is used as the main metric for the predictive power. methods compared. we compared previous methods for immune repertoire classification, (ostmeyer et al., 2019) ("log. mil (kmer)", "log. mil (tcrb)") and a burden test (emerson et al., 2017) , as well as the baseline methods logistic regression ("log. regr."), k-nearest neighbour ("knn"), and support vector machines ("svm") with kernels designed for sets, such as the jaccard kernel ("j") and the minmax ("mm") kernel (ralaivola et al., 2005) . for the simulated data, we also added baseline methods that search for the implanted motif either in binary or continuous fashion ("known motif b.", "known motif c.") assuming that this motif was known (for details, see sect. a4). datasets. we aimed at constructing immune repertoire classification scenarios with varying degree of difficulties and realism in order to compare and analyze the suggested machine learning methods. to this end, we either use simulated or experimentally-observed immune receptor sequences and we implant signals, specifically, sequence motifs or sets thereof weber et al., 2020) , at different frequencies into sequences of repertoires of the positive class. these frequencies represent the witness rates and range from 0.01% to 10%. overall, we compiled four categories of datasets: (a) simulated immunosequencing data with implanted signals, (b) lstm-generated immunosequencing data with implanted signals, (c) real-world immunosequencing data with implanted signals, and (d) real-world immunosequencing data with known immune status, the cmv dataset (emerson et al., 2017) . the average number of instances per bag, which is the number of sequences per immune repertoire, is ≈300,000 except for category (c), in which we consider the scenario of low-coverage data with only 10,000 sequences per repertoire. the number of repertoires per dataset ranges from 785 to 5,000. in total, all datasets comprise ≈30 billion sequences or instances. this represents the largest comparative study on immune repertoire classification (see sect. a3). hyperparameter selection. we used a nested 5-fold cross validation (cv) procedure to estimate the performance of each of the methods. all methods could adjust their most important hyperparameters on a validation set in the inner loop of the procedure. see sect. a5 for details. table 1 : results in terms of auc of the competing methods on all datasets. the reported errors are standard deviations across 5 cross-validation (cv) folds (except for the column "simulated"). real-world cmv: average performance over 5 cv folds on the cmv dataset (emerson et al., 2017) . real-world data with implanted signals: average performance over 5 cv folds for each of the four datasets. a signal was implanted with a frequency (=witness rate) of 1% or 0.1%. either a single motif ("om") or multiple motifs ("mm") were implanted. lstm-generated data: average performance over 5 cv folds for each of the 5 datasets. in each dataset, a signal was implanted with a frequency of 10%, 1%, 0.5%, 0.1%, or 0.05%, respectively. simulated: here we report the mean over 18 simulated datasets with implanted signals and varying difficulties (see tab. a9 for details). the error reported is the standard deviation of the auc values across the 18 datasets. results. in each of the four categories, "real-world data", "real-world data with implanted signals", "lstm-generated data", and "simulated immunosequencing data", deeprc outperforms all competing methods with respect to average auc. across categories, the runner-up methods are either the svm for mil problems with minmax kernel or the burden test (see table 1 and sect. a6). results on simulated immunosequencing data. in this setting the complexity of the implanted signal is in focus and varies throughout 18 simulated datasets (see sect. a3). some datasets are challenging for the methods because the implanted motif is hidden by noise and others because only a small fraction of sequences carries the motif, and hence have a low witness rate. these difficulties become evident by the method called "known motif binary", which assumes the implanted motif is known. the performance of this method ranges from a perfect auc of 1.000 in several datasets to an auc of 0.532 in dataset '17' (see sect. a6). deeprc outperforms all other methods with an average auc of 0.846 ± 0.223, followed by the svm with minmax kernel with an average auc of 0.827 ± 0.210 (see sect. a6). the predictive performance of all methods suffers if the signal occurs only in an extremely small fraction of sequences. in datasets, in which only 0.01% of the sequences carry the motif, all auc values are below 0.550. results on lstm-generated data. on the lstm-generated data, in which we implanted noisy motifs with frequencies of 10%, 1%, 0.5%, 0.1%, and 0.05%, deeprc yields almost perfect predictive performance with an average auc of 1.000 ± 0.001 (see sect. a6 and a7). the second best method, svm with minmax kernel, has a similar predictive performance to deeprc on all datasets but the other competing methods have a lower predictive performance on datasets with low frequency of the signal (0.05%). results on real-world data with implanted motifs. in this dataset category, we used real immunosequences and implanted single or multiple noisy motifs. again, deeprc outperforms all other methods with an average auc of 0.980 ± 0.029, with the second best method being the burden test with an average auc of 0.883 ± 0.170. notably, all methods except for deeprc have difficulties with noisy motifs at a frequency of 0.1% (see tab. a11) . results on real-world data. on the real-world dataset, in which the immune status of persons affected by the cytomegalovirus has to be predicted, the competing methods yield predictive aucs between 0.515 and 0.825 (see table 1 ). we note that this dataset is not the exact dataset that was used in emerson et al. (2017) . it differs in pre-processing and also comprises a different set of samples and a smaller training set due to the nested 5-fold cross-validation procedure, which leads to a more challenging dataset. the best performing method is deeprc with an auc of 0.831 ± 0.002, followed by the svm with minmax kernel (auc 0.825 ± 0.022) and the burden test with an auc of 0.699 ± 0.041. the top-ranked sequences by deeprc significantly correspond to those detected by emerson et al. (2017) , which we tested by a mann-whitney u-test with the null hypothesis that the attention values of the sequences detected by emerson et al. (2017) would be equal to the attention values of the remaining sequences (p-value of 1.3 · 10 −93 ). the sequence attention values are displayed in tab. a14. we have demonstrated how modern hopfield networks and attention mechanisms enable successful classification of the immune status of immune repertoires. for this task, methods have to identify the discriminating sequences amongst a large set of sequences in an immune repertoire. specifically, even motifs within those sequences have to be identified. we have shown that deeprc, a modern hopfield network and an attention mechanism with a fixed query, can solve this difficult task despite the massive number of instances. deeprc furthermore outperforms the compared methods across a range of different experimental conditions. impact on machine learning and related scientific fields. we envision that with (a) the increasing availability of large immunosequencing datasets (kovaltsuk et al., 2018; corrie et al., 2018; christley et al., 2018; zhang et al., 2020; rosenfeld et al., 2018; shugay et al., 2018) , (b) further fine-tuning of ground-truth benchmarking immune receptor datasets (weber et al., 2020; olson et al., 2019; marcou et al., 2018) , (c) accounting for repertoire-impacting factors such as age, sex, ethnicity, and environment (potential confounding factors), and (d) increased gpu memory and increased computing power, it will be possible to identify discriminating immune receptor motifs for many diseases, potentially even for the current sars-cov-2 (covid-19) pandemic minervina et al., 2020; galson et al., 2020) . such results would greatly benefit ongoing research on antibody and tcr-driven immunotherapies and immunodiagnostics as well as rational vaccine design (brown et al., 2019) . in the course of this development, the experimental verification and interpretation of machine-learningidentified motifs could receive additional focus, as for most of the sequences within a repertoire the corresponding antigen is unknown. nevertheless, recent technological breakthroughs in highthroughput antigen-labeled immunosequencing are beginning to generate large-scale antigen-labeled single-immune-receptor-sequence data thus resolving this longstanding problem (setliff et al., 2019) . from a machine learning perspective, the successful application of deeprc on immune repertoires with their large number of instances per bag might encourage the application of modern hopfield networks and attention mechanisms on new, previously unsolved or unconsidered, datasets and problems. impact on society. if the approach proves itself successful, it could lead to faster testing of individuals for their immune status w.r.t. a range of diseases based on blood samples. this might motivate changes in the pipeline of diagnostics and tracking of diseases, e.g. automated testing of the immune status in regular intervals. it would furthermore make the collection and screening of blood samples for larger databases more attractive. in consequence, the improved testing of immune statuses might identify individuals that do not have a working immune response towards certain diseases to government or insurance companies, which could then push for targeted immunisation of the individual. similarly to compulsory vaccination, such testing for the immune status could be made compulsory by governments, possibly violating privacy or personal self-determination in exchange for increased over-all health of a population. ultimately, if the approach proves itself successful, the insights gained from the screening of individuals that have successfully developed resistances against specific diseases could lead to faster targeted immunisation, once a certain number of individuals with resistances can be found. this might strongly decrease the harm done by e.g. pandemics and lead to a change in the societal perception of such diseases. consequences of failures of the method. as common with methods in machine learning, potential danger lies in the possibility that users rely too much on our new approach and use it without reflecting on the outcomes. however, the full pipeline in which our method would be used includes wet lab tests after its application, to verify and investigate the results, gain insights, and possibly derive treatments. failures of the proposed method would lead to unsuccessful wet lab validation and negative wet lab tests. since the proposed algorithm does not directly suggest treatment or therapy, human beings are not directly at risk of being treated with a harmful therapy. substantial wet lab and in-vitro testing and would indicate wrong decisions by the system. leveraging of biases in the data and potential discrimination. as for almost all machine learning methods, confounding factors, such as age or sex, could be used for classification. this, might lead to biases in predictions or uneven predictive performance across subgroups. as a result, failures in the wet lab would occur (see paragraph above). moreover, insights into the relevance of the confounding factors could be gained, leading to possible therapies or counter-measures concerning said factors. furthermore, the amount of data available with respec to relevant confounding factors could lead to better or worse performance of our method. e.g. a dataset consisting mostly of data from individuals within a specific age group might yield better performance for that age group, possibly resulting in better or exclusive treatment methods for that specific group. here again, the application of deeprc would be followed by in-vitro testing and development of a treatment, where all target groups for the treatment have to be considered accordingly. all datasets and code is available at https://github.com/ml-jku/deeprc. the cmv dataset is publicly available at https://clients.adaptivebiotech.com/pub/emerson-2017-natgen. in section a2 we provide details on the architecture of deeprc, in section a3 we present details on the datasets, in section a4 we explain the methods that we compared, in section a5 we elaborate on the hyperparameters and their selection process. then, in section a6 we present detailed results for each dataset category in tabular form, in section a7 we provide information on the lstm model that was used to generate antibody sequences, in section a8 we show how deeprc can be interpreted, in section a9 we show the correspondence of previously identified tcr sequences for cmv immune status with attention values by deeprc, and finally we present variations and an ablation study of deeprc in section a10. input layer. for the input layer of the cnn, the characters in the input sequence, i.e. the amino acids (aas), are encoded in a one-hot vector of length 20. to also provide information about the position of an aa in the sequence, we add 3 additional input features with values in range [0, 1] to encode the position of an aa relative to the sequence. these 3 positional features encode whether the aa is located at the beginning, the center, or the end of the sequence, respectively, as shown in figure a1 . we concatenate these 3 positional features with the one-hot vector of aas, which results in a feature vector of size 23 per sequence position. each repertoire, now represented as a bag of feature vectors, is then normalized to unit variance. since the cytomegalovirus dataset (cmv dataset) provides sequences with an associated abundance value per sequence, which is the number of occurrences of a sequence in a repertoire, we incorporate this information into the input of deeprc. to this end, the one-hot aa features of a sequence are multiplied by a scaling factor of log(c a ) before normalization, where c a is the abundance of a sequence. we feed the sequences with 23 features per position into the cnn. sequences of different lengths were zero-padded to the maximum sequence length per batch at the sequence ends. 1d cnn for motif recognition. in the following, we describe how deeprc identifies patterns in the individual sequences and reduces each sequence in the input object to a fixed-size feature vector. deeprc employs 1d convolution layers to extract patterns, where trainable weight kernels are convolved over the sequence positions. in principle, also recurrent neural networks (rnns) or transformer networks could be used instead of 1d cnns, however, (a) the computational complexity of the network must be low to be able to process millions of sequences for a single update. additionally, (b) the learned network should be able to provide insights in the recognized patterns in form of motifs. both properties (a) and (b) are fulfilled by 1d convolution operations that are used by deeprc. we use one 1d cnn layer (hu et al., 2014) with selu activation functions (klambauer et al., 2017) to identify the relevant patterns in the input sequences with a computationally light-weight operation. the larger the kernel size, the more surrounding sequence positions are taken into account, which influences the length of the motifs that can be extracted. we therefore adjust the kernel size during hyperparameter search. in prior works (ostmeyer et al., 2019) , a k-mer size of 4 yielded good predictive performance, which could indicate that a kernel size in the range of 4 may be a proficient choice. for d v trainable kernels, this produces a feature vector of length d v at each sequence position. subsequently, global max-pooling over all sequence positions of a sequence reduces the sequence-representations z i to vectors of the fixed length d v . given the challenging size of the input data per repertoire, the computation of the cnn activations and weight updates is performed using 16-bit floating point values. a list of hyperparameters evaluated for deeprc is given in table a3 . a comparison of rnn-based and cnn-based sequence embedding for motif recognition in a smaller experimental setting is given in sec. a10. regularization. we apply random and attention-based subsampling of repertoire sequences to reduce over-fitting and decrease computational effort. during training, each repertoire is subsampled to 10, 000 input sequences, which are randomly drawn from the respective repertoire. this can also be interpreted as random drop-out (hinton et al., 2012) on the input sequences or attention weights. during training and evaluation, the attention weights computed by the attention network are furthermore used to rank the input sequences. based on this ranking, the repertoire is reduced to the 10% of sequences with the highest attention weights. these top 10% of sequences are then used to compute the weight updates and the prediction for the repertoire. additionally, one might employ further regularization techniques, which we only partly investigated further in a smaller experimental setting in sec. a10 due to high computational demands. such regularization techniques include l1 and l2 weight decay, noise in the form of random aa permutations in the input sequences, noise on the attention weights, or random shuffling of sequences between repertoires that belong to the negative class. the last regularization technique assumes that the sequences in positive-class repertoires carry a signal, such as an aa motif corresponding to an immune response, whereas the sequences in negative-class repertoires do not. hence, the sequences can be shuffled randomly between negative class repertoires without obscuring the signal in the positive class repertoires. hyperparameters. for the hyperparameter search of deeprc for the category "simulated immunosequencing data", we only conducted a full hyperparameter search on the more difficult datasets with motif implantation probabilities below 1%, as described in table a3 . this process was repeated for all 5 folds of the 5-fold cross-validation (cv) and the average score on the 5 test sets constitutes the final score of a method. table a3 provides an overview of the hyperparameter search, which was conducted as a grid search for each of the datasets in a nested 5-fold cv procedure, as described in section a4. computation time and optimization. we took measures on the implementation level to address the high computational demands, especially gpu memory consumption, in order to make the large number of experiments feasible. we train the deeprc model with a small batch size of 4 samples and perform computation of inference and updates of the 1d cnn using 16-bit floating point values. the rest of the network is trained using 32-bit floating point values. the adam parameter for numerical stability was therefore increased from the default value of = 10 −8 to = 10 −4 . training was performed on various gpu types, mainly nvidia rtx 2080 ti. computation times were highly dependent on the number of sequences in the repertoires and the number and sizes of cnn kernels. a single update on an nvidia rtx 2080 ti gpu took approximately 0.0129 to 0.0135 seconds, while requiring approximately 8 to 11 gb gpu memory. taking these optimizations and gpus with larger memory (≥ 16 gb) into account, it is already possible to train deeprc, possibly with multi-head attention and a larger network architecture, on larger datasets (see sec. a10). our network implementation is based on pytorch 1.3.1 (paszke et al., 2019) . incorporation of additional inputs and metadata. additional metadata in the form of sequencelevel or repertoire-level features could be incorporated into the input via concatenation with the feature vectors that result from taking the maximum of the 1d cnn outputs w.r.t. the sequence positions. this has the benefit that the attention mechanism and output network can utilize the sequence-level or repertoire-level features for their predictions. sparse metadata or metadata that is only available during training could be used as auxiliary targets to incorporate the information via gradients into the deeprc model. limitations. the current methods are mostly limited by computational complexity, since both hyperparameter and model selection is computationally demanding. for hyperparameter selection, a large number of hyperparameter settings have to be evaluated. for model selection, a single repertoire requires the propagation of many thousands of sequences through a neural network and keeping those quantities in gpu memory in order to perform the attention mechanism and weight update. thus, increased gpu memory would significantly boost our approach. increased computational power would also allow for more advanced architectures and attention mechanisms, which may further improve predictive performance. another limiting factor is over-fitting of the model due to the currently relatively small number of samples (bags) in real-world immunosequencing datasets in comparison to the large number of instances per bag and features per instance. we aimed at constructing immune repertoire classification scenarios with varying degree of realism and difficulties in order to compare and analyze the suggested machine learning methods. to this end, we either use simulated or experimentally-observed immune receptor sequences and we implant signals, which are sequence motifs weber et al., 2020) , into sequences of repertoires of the positive class. it has been shown previously that interaction of immune receptors with antigens occur via short sequence stretches . thus, implantation of short motif sequences simulating an immune signal is biologically meaningful. our benchmarking study comprises four different categories of datasets: (a) simulated immunosequencing data with implanted signals (where the signal is defined as sets of motifs), (b) lstm-generated immunosequencing data with implanted signals, (c) real-world immunosequencing data with implanted signals, and (d) real-world immunosequencing data. each of the first three categories consists of multiple datasets with varying difficulty depending on the type of the implanted signal and the ratio of sequences with the implanted signal. the ratio of sequences with the implanted signal, where each sequence carries at most 1 implanted signal, corresponds to the witness rate (wr). we consider binary classification tasks to simulate the immune status of healthy and diseased individuals. we randomly generate immune repertoires with varying numbers of sequences, where we implant sequence motifs in the repertoires of the diseased individuals, i.e. the positive class. the sequences of a repertoire are also randomly generated by different procedures (detailed below). each sequence is composed of 20 different characters, corresponding to amino acids, and has an average length of 14.5 aas. in the first category, we aim at investigating the impact of the signal frequency, i.e. the wr, and the signal complexity on the performance of the different methods. to this end, we created 18 datasets, whereas each dataset contains a large number of repertoires with a large number of random aa sequences per repertoire. we then implanted signals in the aa sequences of the positive class repertoires, where the 18 datasets differ in frequency and complexity of the implanted signals. in detail, the aas were sampled randomly independent of their respective position in the sequence, while the frequencies of aas, distribution of sequence lengths, and distribution of the number of sequences per repertoire, i.e. the number of instances per bag, are following the respective distributions observed in the real-world cmv dataset (emerson et al., 2017) . for this, we first sampled the number of sequences for a repertoire from a gaussian n (µ = 316k, σ = 132k) distribution and rounded to the nearest positive integer. we re-sampled if the size was below 5k. we then generated random sequences of aas with a length of n (µ = 14.5, σ = 1.8), again rounded to the nearest positive integers. each simulated repertoire was then randomly assigned to either the positive or negative class, with 2, 500 repertoires per class. in the repertoires assigned to the positive class, we implanted motifs with an average length of 4 aas, following the results of the experimental analysis of antigenbinding motifs in antibodies and t-cell receptor sequences by . we varied the characteristics of the implanted motifs for each of the 18 datasets with respect to the following parameters: (a) ρ, the probability of a motif being implanted in a sequence of a positive repertoire, i.e. the average ratio of sequences containing the motif, which is the witness rate. in this way, we generated 18 different datasets of variable difficulty containing in total roughly 28.7 billion sequences. see table a1 for an overview of the properties of the implanted motifs in the 18 datasets. in the second dataset category, we investigate the impact of the signal frequency and complexity in combination with more plausible immune receptor sequences by taking into account the positional aa distributions and other sequence properties. to this end, we trained an lstm (hochreiter & schmidhuber, 1997 ) in a standard next character prediction (graves, 2013) setting to create aa sequences with properties similar to experimentally observed immune receptor sequences. in the first step, the lstm model was trained on all immuno-sequences in the cmv dataset (emerson et al., 2017) that contain valid information about sequence abundance and have a known cmv label. such an lstm model is able to capture various properties of the sequences, including positiondependent probability distributions and combinations, relationships, and order of aas. we then used the trained lstm model to generate 1, 000 repertoires in an autoregressive fashion, starting with a start sequence that was randomly sampled from the trained-on dataset. based on a visual inspection of the frequencies of 4-mers (see section a7), the similarity of lstm generated sequences and real sequences was deemed sufficient for the purpose of generating the aa sequences for the datasets in this category. further details on lstm training and repertoire generation are given in section a7. after generation, each repertoire was assigned to either the positive or negative class, with 500 repertoires per class. we implanted motifs of length 4 with varying properties in the center of the sequences of the positive class to obtain 5 different datasets. each sequence in the positive repertoires has a probability ρ to carry the motif, which was varied throughout 5 datasets and corresponds to the wr (see table a1 ). each position in the motif has a probability of 0.9 to be implanted and consequently a probability of 0.1 that the original aa in the sequence remains, which can be seen as noise on the motif. in the third category, we implanted signals into experimentally obtained immuno-sequences, where we considered 4 dataset variations. each dataset consists of 750 repertoires for each of the two classes, where each repertoire consists of 10k sequences. in this way, we aim to simulate datasets with a low sequencing coverage, which means that only relatively few sequences per repertoire are available. the sequences were randomly sampled from healthy (cmv negative) individuals from the cmv dataset (see below paragraph for explanation). two signal types were considered: (a) one signal with one motif. the aa motif ldr was implanted in a certain fraction of sequences. the pattern is randomly altered at one of the three positions with probabilities 0.2, 0.6, and 0.2, respectively. (b) one signal with multiple motifs. one of the three possible motifs ldr, cas, and gl-n was table a1 : properties of simulated repertoires, variations of motifs, and motif frequencies, i.e. the witness rate, for the datasets in categories "simulated immunosequencing data", "lstm-generated data", and "real-world data with implanted signals". noise types for * are explained in paragraph "real-world data with implanted signals". implanted with equal probability. again, the motifs were randomly altered before implantation. the aa motif ldr changed as described above. the aa motif cas was altered at the second position with probability 0.6 and with probability 0.3 at the first position. the pattern gl-n, wheredenotes a gap location, is randomly altered at the first position with probability 0.6 and the gap has a length of 0, 1, or 2 aas with equal probability. additionally, the datasets differ in the values for ρ, the average ratio of sequences carrying a signal, which were chosen as 1% or 0.1%. the motifs were implanted at positions 107, 109, and 114 according to the imgt numbering scheme for immune receptor sequences (lefranc et al., 2003) with probabilities 0.3, 0.35 and 0.2, respectively. with the remaining 0.15 chance, the motif is implanted at any other sequence position. this means that the motif occurrence in the simulated sequences is biased towards the middle of the sequence. we used a real-world dataset of 785 repertoires, each of which containing between 4, 371 to 973, 081 (avg. 299, 319) tcr sequences with a length of 1 to 27 (avg. 14.5) aas, originally collected and provided by emerson et al. (2017) . 340 out of 785 repertoires were labelled as positive for cytomegalovirus (cmv) serostatus, which we consider as the positive class, 420 repertoires with negative cmv serostatus, considered as negative class, and 25 repertoires with unknown status. we changed the number of sequence counts per repertoire from −1 to 1 for 3 sequences. furthermore, we exclude a total of 99 repertoires with unknown cmv status or unknown information about the sequence abundance within a repertoire, reducing the dataset for our analysis to 686 repertoires, 312 of which with positive and 374 with negative cmv status. we give a non-exhaustive overview of previously considered mil datasets and problems in table a2 . to our knowledge the datasets considered in this work pose the most challenging mil problems with respect to the number of instances per bag (column 5). table a2 : mil datasets with their numbers of bags and numbers of instances. "total number of instances" refers to the total number of instances in the dataset. the simulated and real-world immunosequencing datasets considered in this work contain a by orders of magnitudes larger number of instances per bag than mil datasets that were considered by machine learning methods up to now. we evaluate and compare the performance of deeprc against a set of machine learning methods that serve as baseline, were suggested, or can readily be adapted to immune repertoire classification. in this section, we describe these compared methods. this method serves as an estimate for the achievable classification performance using prior knowledge about which motif was implanted. note that this does not necessarily lead to perfect predictive performance since motifs are implanted with a certain amount of noise and could also be present in the negative class by chance. the known motif method counts how often the known implanted motif occurs per sequence for each repertoire and uses this count to rank the repertoires. from this ranking, the area under the receiver operator curve (auc) is computed as performance measure. probabilistic aa changes in the known motif are not considered for this count, with the exception of gap positions. we consider two versions of this method: (a) known motif binary: counts the occurrence of the known motif in a sequence and (b) known motif continuous: counts the maximum number of overlapping aas between the known motif and all sequence positions, which corresponds to a convolution operation with a binary kernel followed by max-pooling. since the implanted signal is not known in the experimentally obtained cmv dataset, this method cannot be applied to this dataset. the support vector machine (svm) approach uses a fixed mapping from a bag of sequences to the corresponding k-mer counts. the function h kmer maps each sequence s i to a vector representing the occurrence of k-mers in the sequence. to avoid confusion with the sequence-representation obtained from the cnn layers of deeprc, we denote u i = h kmer (s i ), which is analogous to z i . specifically, where #{p m ∈ s i } denotes how often the k-mer pattern p m occurs in sequence s i . afterwards, average-pooling is applied to obtain u = 1/n n i=1 u i , the k-mer representation of the input object x. for two input objects x (n) and x (l) with representations u (n) and u (l) , respectively, we implement the minmax kernel (ralaivola et al., 2005) as follows: where u (n) m is the m-th element of the vector u (n) . the jaccard kernel (levandowsky & winter, 1971 ) is identical to the minmax kernel except that it operates on binary u (n) . we used a standard c-svm, as introduced by cortes & vapnik (1995) . the corresponding hyperparameter c is optimized by random search. the settings of the full hyperparameter search as well as the respective value ranges are given in table a4a . the same k-mer representation of a repertoire, as introduced above for the svm baseline, is used for the k-nearest neighbor (knn) approach. as this method clusters samples according to distances between them, the previous kernel definitions cannot be applied directly. it is therefore necessary to transform the minmax as well as the jaccard kernel from similarities to distances by constructing the following (levandowsky & winter, 1971) : d jaccard (u (n) , u (l) ) = 1 − k jaccard (u (n) , u (l) ). (a2) the amount of neighbors is treated as the hyperparameter and optimized by an exhaustive grid search. the settings of the full hyperparameter search as well as the respective value ranges are given in table a5 . we implemented logistic regression on the k-mer representation u of an immune repertoire. the model is trained by gradient descent using the adam optimizer (kingma & ba, 2014) . the learning rate is treated as the hyperparameter and optimized by grid search. furthermore, we explored two regularization settings using combinations of l1 and l2 weight decay. the settings of the full hyperparameter search as well as the respective value ranges are given in table a6 . we implemented a burden test (emerson et al., 2017; li & leal, 2008; wu et al., 2011) in a machine learning setting. the burden test first identifies sequences or k-mers that are associated with the individual's class, i.e., immune status, and then calculates a burden score per individual. concretely, for each k-mer or sequence, the phi coefficient of the contingency table for absence or presence and positive or negative immune status is calculated. then, j k-mers or sequences with the highest phi coefficients are selected as the set of associated k-mers or sequences. j is a hyperparameter that is selected on a validation set. additionally, we consider the type of input features, sequences or k-mers, as a hyperparameter. for inference, a burden score per individual is calculated as the sum of associated k-mers or sequences it carries. this score is used as raw prediction and to rank the individuals. hence, we have extended the burden test by emerson et al. (2017) to k-mers and to adaptive thresholds that are adjusted on a validation set. the logistic multiple instance learning (mil) approach for immune repertoire classification (ostmeyer et al., 2019) applies a logistic regression model to each k-mer representation in a bag. the resulting scores are then summarized by max-pooling to obtain a prediction for the bag. each amino acid of each k-mer is represented by 5 features, the so-called atchley factors (atchley et al., 2005) . as k-mers of length 4 are used, this gives a total of 4 × 5 = 20 features. one additional feature per 4-mer is added, which represents the relative frequency of this 4-mer with respect to its containing bag, resulting in 21 features per 4-mer. two options for the relative frequency feature exist, which are (a) whether the frequency of the 4-mer ("4mer") or (b) the frequency of the sequence in which the 4-mer appeared ("tcrβ") is used. we optimized the learning rate, batch size, and early stopping parameter on the validation set. the settings of the full hyperparameter search as well as the respective value ranges are given in table a8 . for all competing methods a hyperparameter search was performed, for which we split each of the 5 training sets into an inner training set and inner validation set. the models were trained on the inner training set and evaluated on the inner validation set. the model with the highest auc score on the inner validation set is then used to calculate the score on the respective test set. here we report the hyperparameter sets and search strategy that is used for all methods. deeprc. the set of hyperparameters of deeprc is shown in table a3 . these hyperparameter combinations are adjusted via a grid search procedure. table a3 : deeprc hyperparameter search space. every 5 · 10 3 updates, the current model was evaluated against the validation fold. the early stopping hyperparameter was determined by selecting the model with the best loss on the validation fold after 10 5 updates. * : experiments for {64; 128; 256} kernels were omitted for datasets with motif implantation probabilities ≥ 1% in the category "simulated immunosequencing data". known motif. this method does not have hyperparameters and has been applied to all datasets except for the cmv dataset. the corresponding hyperparameter c of the svm is optimized by randomly drawing 10 3 values in the range of [−6; 6] according to a uniform distribution. these values act as the exponents of a power of 10 and are applied for each of the two kernel types (see table a4a ). knn. the amount of neighbors is treated as the hyperparameter and optimized by grid search operating in the discrete range of [1; max{n, 10 3 }] with a step size of 1. the corresponding tight upper bound is automatically defined by the total amount of samples n ∈ n >0 in the training set, capped at 10 3 (see table a5 ). number of neighbors {1; max{n, 10 3 }} type of kernel {minmax; jaccard} table a5 : settings used in the hyperparameter search of the knn baseline approach. the number of trials (per type of kernel) is automatically defined by the total amount of samples n ∈ n >0 in the training set, capped at 10 3 . logistic regression. the hyperparameter optimization strategy that was used was grid search across hyperparameters given in table a6. learning rate 10 −{2;3;4} batch size 4 max. updates 10 5 coefficient β 1 (adam) 0.9 coefficient β 2 (adam) 0.999 weight decay weightings {(l1 = 10 −7 , l2 = 10 −3 ); (l1 = 10 −7 , l2 = 10 −5 )} table a6 : settings used in the hyperparameter search of the logistic regression baseline approach. burden test. the burden test selects two hyperparameters: the number of features in the burden set and the type of features, see table a7 . number of features in burden set {50, 100, 150, 250} type of features {4mer; sequence} table a7 : settings used in the hyperparameter search of the burden test approach. logistic mil. for this method, we adjusted the learning rate as well as the batch size as hyperparameters by randomly drawing 25 different hyperparameter combinations from a uniform distribution. the corresponding range of the learning rate is [−4.5; −1.5], which acts as the exponent of a power of 10. the batch size lies within the range of [1; 32]. for each hyperparameter combination, a model is optimized by gradient descent using adam, whereas the early stopping parameter is adjusted according to the corresponding validation set (see table a8 ). learning rate 10 {−4.5;−1.5} batch size {1; 32} relative abundance term {4mer; tcrβ} number of trials 25 max. epochs 10 2 coefficient β 1 (adam) 0.9 coefficient β 2 (adam) 0.999 table a8 : settings used in the hyperparameter search of the logistic mil baseline approach. the number of trials (per type of relative abundance) defines the quantity of combinations of random values of the learning rate as well as the batch size. in this section, we report the detailed results on all four categories of datasets (a) simulated immunosequencing data (table a9 ) (b) lstm-generated data (table a10) , (c) real-world data with implanted signals (table a11) , and (d) real-world data on the cmv dataset (table a12) , as discussed in the main paper. ± 0.000 ± 0.000 ± 0.271 ± 0.000 ± 0.000 ± 0.218 ± 0.000 ± 0.000 ± 0.029 ± 0.000 ± 0.001 ± 0.017 ± 0.001 ± 0.002 ± 0.023 ± 0.001 ± 0.048 ± 0.013 ± 0.223 svm (minmax) 1.000 1.000 0.764 1.000 1.000 0.603 1.000 0.998 0.539 1.000 0.994 0.529 1.000 0.741 0.513 1.000 0.706 0.503 0.827 ± 0.000 ± 0.000 ± 0.016 ± 0.000 ± 0.000 ± 0.021 ± 0.000 ± 0.002 ± 0.024 ± 0.000 ± 0.004 ± 0.016 ± 0.000 ± 0.024 ± 0.006 ± 0.000 ± 0.013 ± 0.013 ± 0.013 ± 0.013 ± 0.014 ± 0.011 ± 0.009 ± 0.007 ± 0.008 ± 0.011 ± 0.012 ± 0.012 ± 0.007 ± 0.014 ± 0.017 ± 0.010 ± 0.020 ± 0.012 ± 0.016 ± 0.016 ± 0.074 known motif b. 1.000 1.000 0.973 1.000 1.000 0.865 1.000 1.000 0.700 1.000 0.989 0.609 1.000 0.946 0.570 1.000 0.834 0.532 0.890 ± 0.000 ± 0.000 ± 0.004 ± 0.000 ± 0.000 ± 0.004 ± 0.000 ± 0.000 ± 0.020 ± 0.000 ± 0.002 ± 0.017 ± 0.000 ± 0.010 ± 0.024 ± 0.000 ± 0.016 ± 0.020 ± 0.001 ± 0.014 ± 0.020 ± 0.001 ± 0.013 ± 0.017 ± 0.001 ± 0.012 ± 0.012 ± 0.001 ± 0.018 ± 0.018 ± 0.002 ± 0.010 ± 0.009 ± 0.002 ± 0.012 ± 0.013 ± 0.202 table a9 : auc estimates based on 5-fold cv for all 18 datasets in category "simulated immunosequencing data". the reported errors are standard deviations across the 5 cross-validation folds except for the last column "avg.", in which they show standard deviations across datasets. wildcard characters in motifs are indicated by z, characters with 50% probability of being removed by d . table a10 : auc estimates based on 5-fold cv for all 5 datasets in category "lstm-generated data". the reported errors are standard deviations across the 5 cross-validation folds except for the last column "avg.", in which they show standard deviations across datasets. characters affected by noise, as described in a3, paragraph "lstm-generated data", are indicated by r . table a12 : results on the cmv dataset (real-world data) in terms of auc, f1 score, balanced accuracy, and accuracy. for f1 score, balanced accuracy, and accuracy, all methods use their default thresholds. each entry shows mean and standard deviation across 5 cross-validation folds. we trained a conventional next-character lstm model (graves, 2013) based on the implementation in https://github.com/spro/practical-pytorch (access date 1st of may, 2020) using pytorch 1.3.1 (paszke et al., 2019) . for this, we applied an lstm model with 100 lstm blocks in 2 layers, which was trained for 5, 000 epochs using the adam optimizer (kingma & ba, 2014) with learning rate 0.01, an input batch size of 100 character chunks, and a character chunk length of 200. as input we used the immuno-sequences in the cdr3 column of the cmv dataset, where we repeated sequences according to their counts in the repertoires, as specified in the templates column of the cmv dataset. we excluded repertoires with unknown cmv status and unknown sequence abundance from training. after training, we generated 1, 000 repertoires using a temperature value of 0.8. the number of sequences per repertoire was sampled from a gaussian n (µ = 285k, σ = 156k) distribution, where the whole repertoire was generated by the lstm at once. that is, the lstm can base the generation of the individual aa sequences in a repertoire, including the aas and the lengths of the sequences, on the generated repertoire. a random immuno-sequence from the trained-on repertoires was used as initialization for the generation process. this immuno-sequence was not included in the generated repertoire. finally, we randomly assigned 500 of the generated repertoires to the positive (diseased) and 500 to the negative (healthy) class. we then implanted motifs in the positive class repertoires as described in section a3.2. as illustrated in the comparison of histograms given in fig. a2 , the generated immuno-sequences exhibit a very similar distribution of 4-mers and aas compared to the original cmv dataset. real-world data deeprc allows for two forms of interpretability methods. (a) due to its attention-based design, a trained model can be used to compute the attention weights of a sequence, which directly indicates its importance. (b) deeprc furthermore allows for the usage of contribution analysis methods, such as integrated gradients (ig) (sundararajan et al., 2017) or layer-wise relevance propagation (montavon et al., 2018; arras et al., 2019; montavon et al., 2019; preuer et al., 2019) . we apply ig to identify the input patterns that are relevant for the classification. to identify aa patterns with high contributions in the input sequences, we apply ig to the aas in the input sequences. additionally, we apply ig to the kernels of the 1d cnn, which allows us to identify aa motifs with high contributions. in detail, we compute the ig contributions for the aas and positional features in the kernels for every repertoire in the validation and test set, so as to exclude potential artifacts caused by over-fitting. averaging the ig values over these repertoires then results in concise aa motifs. we include qualitative visual analyses of the ig method on different datasets below. here, we provide examples for the interpretation of trained deeprc models using integrated gradients (ig) (sundararajan et al., 2017) as contribution analysis method. the following illustrations were created using 50 ig steps, which we found sufficient to achieve stable ig results. a visual analysis of deeprc models on the simulated datasets, as illustrated in tab. a13 and fig. a3 , shows that the implanted motifs can be successfully extracted from the trained model and are straightforward to interpret. in the real-world cmv dataset, deeprc finds complex patterns with high variability in the center regions of the immuno-sequences, as illustrated in figure a4 . real-world data with implanted signals extracted motif implanted motif(s) g r s r a r f r l r d r r r {l r d r r r ; c r a r s; g r l-n} motif freq. ρ 0.05% 0.1% 0.1% table a13 : visualization of motifs extracted from trained deeprc models for datasets from categories "simulated immunosequencing data", "lstm-generated data", and "real-world data with implanted signals". motif extraction was performed using integrated gradients on the 1d cnn kernels over the validation set and test set repertoires of one cv fold. wildcard characters are indicated by z, random noise on characters by r , characters with 50% probability of being removed by d , and gap locations of random lengths of {0; 1; 2} by -. larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the diseased class. contributions to positional encoding are indicated by < (beginning of sequence), ∧ (center of sequence), and > (end of sequence). only kernels with relatively high contributions are shown, i.e. with contributions roughly greater than the average contribution of all kernels. b) c) figure a3 : integrated gradients applied to input sequences of positive class repertoires. three sequences with the highest contributions to the prediction of their respective repertoires are shown. a) input sequence taken from "simulated immunosequencing data" with implanted motif sz d z d n and motif implantation probability 0.1%. the deeprc model reacts to the s and n at the 5 th and 8 th sequence position, thereby identifying the implanted motif in this sequence. b) and c) input sequence taken from "real-world data with implanted signals" with implanted motifs {l r d r r r ; c r a r s; g r l-n} and motif implantation probability 0.1%. the deeprc model reacts to the fully implanted motif cas (b) and to the partly implanted motif aas c and a at the 5 th and 7 th sequence position (c), thereby identifying the implanted motif in the sequences. wildcard characters in implanted motifs are indicated by z, characters with 50% probability of being removed by d , and gap locations of random lengths of {0; 1; 2} by -. larger characters in the sequences indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the diseased class. figure a4 : visualization of the contributions of characters within a sequence via ig. each sequence was selected from a different repertoire and showed the highest contribution in its repertoire. the model was trained on cmv dataset, using a kernel size of 9, 32 kernels and 137 repertoires for early stopping. larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the disease class. table a14 : tcrβ sequences that had been discovered by emerson et al. (2017) with their associated attention values by deeprc. these sequences have significantly (p-value 1.3e-93) higher attention values than other sequences. the column "quantile" provides the quantile values of the empiricial distribution of attention values across all sequences in the dataset. in this section we investigate the impact of different variations of deeprc on the performance on the cmv dataset. we consider both a cnn-based sequence embedding, as used in the main paper, and an lstm-based sequence embedding. in both cases we vary the number of attention heads and the β parameter for the softmax function the attention mechanism (see eq. 2 in main paper). for the cnn-based sequence embedding we also vary the number of cnn kernels and the kernel sizes used in the 1d cnn. for the lstm-based sequence embedding we use one one-directional lstm layer, of which the output values at the last sequence position (without padding) are taken as embedding of the sequence. here we vary the number of lstm blocks in the lstm layer. to counter over-fitting due to the increased complexity of these deeprc variations, we added a l2 weight penalty to the training loss. the factor with which the l2 weight penalty contributes to the training loss is varied over 3 orders of magnitudes, where suitable value ranges were manually determined on one of the training folds beforehand. to reduce the computational effort, we do not consider all numbers of kernels that were considered in the main paper. furthermore, we only compute the auc scores on 3 of the 5 cross-validation folds. the hyperparameters, which were used in a grid search procedure, are listed in tab. a15 for the cnn-based sequence embedding and tab. a16 for the lstm-based sequence embedding. results. we show performance in terms of auc score with single hyperparameters set to fixed values so as to investigate their influence in tab. a18 for the cnn-based sequence embedding and tab. a17 for the lstm-based sequence embedding. we note that due to restricted computational resources this study was conducted with fewer different numbers of cnn kernels, with the auc estimated from only 3 of the 5 cross-validation folds, which leads to a slight decrease of performance in comparison to the full hyperparameter search and cross-validation procedure used in the main paper. as can be seen in tab. a18 and a17, the lstm-based sequence embedding generalizes slightly better than the cnn-based sequence embedding. table a17 : impact of hyperparameters on deeprc with lstm for sequence encoding. mean ("mean") and standard deviation ("std") for the area under the roc curve over the first 3 folds of a 5-fold nested cross-validation for different sub-sets of hyperparameters ("sub-set") are shown. the following sub-sets were considered: "full": full grid search over hyperparameters; "beta=*": grid search over hyperparameters with reduction to specific value * of beta value of attention softmax; "heads=*": grid search over hyperparameters with reduction to specific number * of attention heads; "lstms=*": grid search over hyperparameters with reduction to specific number * of lstm blocks for sequence embedding. table a18 : impact of hyperparameters on deeprc with 1d cnn for sequence encoding. mean ("mean") and standard deviation ("std") for the area under the roc curve over the first 3 folds of a 5-fold nested cross-validation for different sub-sets of hyperparameters ("sub-set") are shown. the following sub-sets were considered: "full": full grid search over hyperparameters; "beta=*": grid search over hyperparameters with reduction to specific value * of beta value of attention softmax; "heads=*": grid search over hyperparameters with reduction to specific number * of attention heads; "ksize=*": grid search over hyperparameters with reduction to specific kernel size * of 1d cnn kernels for sequence embedding; "kernels=*": grid search over hyperparameters with reduction to specific number * of 1d cnn kernels for sequence embedding. a compact vocabulary of paratope-epitope interactions enables predictability of antibody-antigen binding predicting the sequence specificities of dna-and rna-binding proteins by deep learning explaining and interpreting lstms solving the protein sequence metric problem rank-loss support instance machines for miml instance annotation augmenting adaptive immunity: progress and challenges in the quantitative engineering and analysis of adaptive immune receptor repertoires multiple instance learning: a survey of problem characteristics and applications vdjserver: a cloud-based analysis portal and data commons for immune repertoire sequences and rearrangements tetramer-visualized gluten-specific cd4+ t cells in blood as a potential diagnostic marker for coeliac disease without oral gluten challenge ireceptor: a platform for querying and analyzing antibody/b-cell and t-cell receptor repertoire data across federated repositories support-vector networks quantifiable predictive features define epitope-specific t cell receptor repertoires on a model of associative memory with huge storage capacity bert: pre-training of deep bidirectional transformers for language understanding solving the multiple instance problem with axis-parallel rectangles predicting the spectrum of tcr repertoire sharing with a data-driven model of recombination immunosequencing identifies signatures of cytomegalovirus exposure history and hla-mediated effects on the t cell repertoire predicting antigen-specificity of single t-cells based on tcr cdr3 regions. biorxiv a review of multi-instance learning assumptions deep sequencing of b cell receptor repertoires from covid-19 evaluation and benchmark for biological image segmentation the promise and challenge of high-throughput sequencing of the antibody repertoire tcrex: detection of enriched t cell epitope specificity in full t cell receptor sequence repertoires. biorxiv identifying specificity groups in the t cell receptor repertoire generating sequences with recurrent neural networks. arxiv a bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status learning the high-dimensional immunogenomic features that predict public and private antibody repertoires improving neural networks by preventing co-adaptation of feature detectors long short-term memory fast model-based protein homology detection without alignment neural networks and physical systems with emergent collective computational abilities convolutional neural network architectures for matching natural language sentences attention-based deep multiple instance learning nettcr: sequence-based prediction of tcr binding to peptide-mhc complexes using convolutional neural networks basset: learning the regulatory code of the accessible genome with deep convolutional neural networks detecting cutaneous basal cell carcinomas in ultra-high resolution and weakly labelled histopathological images self-normalizing neural networks capturing the differences between humoral immunity in the normal and tumor environments from repertoire-seq of b-cell receptors using supervised machine learning observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires dense associative memory for pattern recognition dense associative memory is robust to adversarial inputs gradient-based learning applied to document recognition set transformer: a framework for attention-based permutation-invariant neural networks imgt unique numbering for immunoglobulin and t cell receptor variable domains and ig superfamily v-like domains distance between sets methods for detecting associations with rare variants for common diseases: application to analysis of sequence data the extended cohnkanade dataset (ck+): a complete dataset for action unit and emotion-specified expression high-throughput immune repertoire analysis with igor a framework for multiple-instance learning computational strategies for dissecting the high-dimensional complexity of adaptive immune repertoires longitudinal high-throughput tcr repertoire profiling reveals the dynamics of t cell memory formation after mild covid-19 infection. biorxiv methods for interpreting and understanding deep neural networks layer-wise relevance propagation: an overview how many different clonotypes do immune repertoires contain? current opinion in systems biology treating biomolecular interaction as an image classification problem -a case study on t-cell receptorepitope recognition prediction. biorxiv sumrep: a summary statistic framework for immune receptor repertoire comparison and model validation biophysicochemical motifs in t-cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocyte and adjacent healthy tissue pytorch: an imperative style, high-performance deep learning library needles in haystacks: on classifying tiny objects in large images interpretable deep learning in drug discovery pointnet: deep learning on point sets for 3d classification and segmentation graph kernels for chemical informatics cov-abdab: the coronavirus antibody database. biorxiv immunedb, a novel tool for the analysis, storage, and dissemination of immune repertoire sequencing data a $$k$$-nearest neighbor based algorithm for multi-instance multi-label active learning machine learning in automated text categorization high-throughput mapping of b cell receptor sequences to antigen specificity vdjtools: unifying post-analysis of t cell receptor repertoires vdjdb: a curated database of t-cell receptor sequences with known antigen specificity deeptcr: a deep learning framework for understanding t-cell receptor sequence signatures within complex t-cell repertoires prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs. biorxiv axiomatic attribution for deep networks attention-based deep neural networks for detection of cancerous and precancerous esophagus tissue on histopathological slides learning with sets in multiple instance regression applied to remote sensing attention is all you need revisiting multiple instance neural networks novel approaches to analyze immunoglobulin repertoires immunesim: tunable multi-feature simulation of b-and t-cell receptor repertoires for immunoinformatics benchmarking genome-wide protein function prediction through multiinstance multi-label learning rare-variant association testing for sequencing data with the sequence kernel association test polyspecificity of t cell and b cell receptor recognition practical guidelines for b-cell receptor repertoire sequencing analysis learning embedding adaptation for few-shot learning convolutional neural network architectures for predicting dna-protein binding pird: pan immune repertoire database multi-instance multi-label learning with application to scene classification predicting effects of noncoding variants with deep learning-based sequence model the ellis unit linz, the lit ai lab and the institute for machine learning are supported by the land oberösterreich, lit grants deeptoxgen ( in the following, the appendix to the paper "modern hopfield networks and attention for immune key: cord-306725-0vam15pt authors: li, hao; zhang, bin; yue, hua; tang, cheng title: first detection and genomic characteristics of bovine torovirus in dairy calves in china date: 2020-05-09 journal: arch virol doi: 10.1007/s00705-020-04657-9 sha: doc_id: 306725 cord_uid: 0vam15pt bovine torovirus (btov) is a diarrhea-causing pathogen. in this study, 92 diarrheic fecal samples from five farms in four provinces in china were collected and tested for btov using a rt-pcr assay, and 21.73% samples were found to be btov positive. moreover, two complete btov genome sequences (mn073058 and mn073059) were obtained from the clinical samples, which were 28,297 and 28,301 nucleotides in length, respectively. sequence analysis showed that the two isolates shared 10 identical amino acid mutations in the s protein compared to the complete s sequences of btov available in the genbank database. in addition, seven consecutive amino acid mutations were found from aa 1,486 to 1,492 in the s protein of isolate mn073058. moreover, the two isolates shared one identical amino acid mutation in the receptor binding sites of the he protein. to the best of our knowledge, this is the first report on the epidemic and genomic characterization of btov in china, which is helpful for further understanding the genetic evolution of btov. electronic supplementary material: the online version of this article (10.1007/s00705-020-04657-9) contains supplementary material, which is available to authorized users. toroviruses are members of the family coronaviridae, order nidovirales, and include bovine torovirus (btov) [1, 2] , berne virus (etov) [3] , porcine torovirus (ptov) [4] , and human torovirus (htov) [5] . btov mainly causes diarrhea in calves and adult cattle. the virus can not only be detected in feces but also in the respiratory tract, indicating that the virus has dual tissue tropism [6] [7] [8] [9] . btov has been detected in 16 countries with a wide geographical distribution [2, 7, [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] . in addition, a sequence reported to be from goat torovirus is present in the genbank database, and in 2012, researchers detected the presence of ptov in pig herds by rt-pcr in china [20] . presently, three btov genomic sequence (ay427798, lc088094, lc088095) can be obtained from the genbank database. the btov genome encodes four structural proteins: spike glycoprotein (s), membrane glycoprotein (m), hemagglutinin esterase (he), and nucleocapsid glycoprotein (n) [21] . the s protein is involved in viral infectivity and induces the production of neutralizing antibodies [22] . the m protein plays a role in btov assembly and nucleocapsid recognition [21] . the he has a putative f-g-d-s motif and has acetylesterase activity specific for n-acetyl-9-oacetylneuraminic acid [1, 21] . he contains three functional domains: a lectin domain (r), an esterase domain (e) and a membrane-proximal domain (mp) [23] . the n protein is the only viral rna-binding polypeptide found in infected cells. it protects the genome and ensures its timely replication and reliable transmission, as well as playing a role in virus transcription and translation [24, 25] . diarrhea is a common disease in dairy cows in china, which leads to serious economic losses. bovine coronavirus (bcov), bovine group a rotavirus (brva), bovine viral diarrhea virus (bvdv), bovine norovirus (bnov), and nebovirus (nev) had been identified as common diarrheacausing viruses in bovines in china [26] [27] [28] [29] [30] . however, there is currently no information regarding btov in china. the goal of this study was to detect and the genome of characterize of btov in calves in china. from september to december 2018, 92 diarrheic fecal samples were collected from calves (aged < 3 months) from five farms in four provinces of china. these included 23 samples from sichuan province (one farm), 39 from liaoning province (two farms), 20 from shanxi province (one farm), and 10 from henan province (one farm). all samples were stored at -80 °c in sterile 50-ml centrifuge tubes until further testing. total rna from 300 μl of the fecal suspension was extracted using a qiaamp viral rna mini kit (qiagen, hilden, germany) according to the manufacturer's instructions. complementary dna (cdna) was synthesized using a primescript™ reverse transcription kit (takara, dalian, china) and stored at − 20 °c. btovs were detected using a specific rt-pcr assay established in our laboratory. the specificity and reproducibility of the rt-pcr assay has been validated, and the detection limit is 1.908 × 10 −1 pg/µl. briefly, a pair of primers to investigate coinfection with brva, bcov, bvdv, nev and bnov, all btov-positive samples were subjected to previously described specific rt-pcr assays for these viruses. [27] [28] [29] 31] the detection of brva, bcov, nev were following our previous reports, bvdv was following the previous report [], bnov was following the previous report []. to investigate the genomic characteristics of btov isolates from china, 29 primer pairs were designed to amplify their genomes (table s1 ). the 3' and 5' ends of the viral genome were sequenced by rapid amplification of cdna ends using a smart race cdna amplification kit (takara). the pcr products were cloned into the pmd19-t simple vector (takara) for further sequencing (sangon biotech). sequences were assembled using seqman software (version 7.0; dnastar, inc., wi, usa). putative orfs in the linear genomes were identified using the orf finder tool (http://www.ncbi.nlm.nih.gov/gorf/gorf.html). nucleotide and deduced amino acid sequences were compared using the megalign program of lasergene software, version 7.1 (dnastar, madison, wi, usa). a multiple sequence alignment was performed, and a neighbor-joining phylogenetic tree was built using the interior branch test method with the aid of mega7 software. possible recombination events were identified using simplot software (version 3.5.1) and the recombination detection program (rdp 4.95) by the rdp, chimaera, bootscan, 3seq, geneconv, max-chi, and siscan methods. among the 92 diarrheic samples, 21.73% (20/92) tested btov-positive by rt-pcr. the detection rate was 69.56% a portion of the m gene of 20 positive samples was sequenced, and all partial sequences of the m gene (gen-bank accession no. mn073058-mn073079) were submitted to the genbank database. the 20 nucleotide sequences were 98.8% to 100% identical to one another and 96.3% to 99.3% identical to those of previously reported btov strains. a phylogenetic tree was constructed using the 406bp m fragment sequence, including the 20 btov sequences from calves as well as 10 other btov sequences from the genbank database. the 20 btov sequences from this study were divided into two different groups (fig. s1 ). one of these groups included btov isolates from three different provinces, which formed a unique large branch, but the one remaining strain clustered with the turkish strains ht1 and ht2. to obtain more-precise information about the evolutionary relationships of the btov isolates from calves, six representative btov isolates were chosen from four positive farms, and two btov isolates were chosen from each province. however, we were only able to successfully assemble two complete genomic sequences (sc-1 sichuan / 2018 and sc-2 sichuan / 2018), both of which were from the same farm in sichuan province. these sequences were submitted to the genbank database with accession numbers mn073058 and mn073059. the length of the linear genome was 28,297 and 28,301 nt, respectively, and the g + c content was 36.36%. the complete genome of the strains shared 42.9-99.2% nt sequence identity with the 10 genome sequences of mammalian tovs in the genbank database. analysis of the complete genome sequences revealed that the two chinese btovs shared 99.1% nucleotide sequence identity with each other, 96.89% to 97.07% identity with the two genome sequences from japan, and shared 81.9% nt identities with the prototype strain bread 1 (fig. 1 ). the complete s genes of strains sc-1 and sc-2 were both 4,755 bp in length and encoded a protein of 1,586 amino acids. the complete s genes shared 99.7% nt sequence identity and 99.3% aa sequence identity with each other and shared 92.2-97.6% nt sequence identity (91.3-98.4% aa sequence identity) with the 12 btov s sequences available in the genbank database. a phylogenetic analysis based on the complete amino acid sequence of the s protein showed that the btovs could be separated into four groups (fig. 2) , designated tentatively as group 1 to group 4. the two btovs clustered with the btov strains b145 and ht2 belonging to group 3 and were independently clustered in a small branch. further analysis showed that, compared to the 12 available btov s sequences, strains sc-1 and sc-2 had 10 amino acid mutations, of which two were located in the signal peptide, six were located in the putative s1 region, and two were located in the putative s2 region. in addition, sc-1 had a fig. 1 phylogenetic tree based on the complete genomic nucleotide sequence of all mammalian toroviruses. sequence alignments were performed using clustalw in mega 7.0 software. the tree was constructed by the maximum-likelihood method with bootstrap values calculated for 1000 replicates. the scale bar indicates amino acid substitutions per site. bovine torovirus was found in calves and may be related to members of the genus coronavirus. the bovine torovirus strains btov/sc-1/china and btov /sc-2/china investigated in this study are indicated by black triangles fig. 2 phylogenetic tree based on the deduced 1586-aa sequence of the complete s gene. sequence alignments were performed using clustalw in mega 7.0 software. the tree was constructed by the maximum-likelihood method with bootstrap values calculated for 1,000 replicates. the bovine torovirus strains investigated in this study are indicated by by black triangles continuous 7-amino-acid mutation at the end of the putative s2 region, and sc-2 had a unique amino acid mutation in the putative s2 region (fig. 3 ). the complete he gene of strains sc-1 and sc-2 are 1,260 bp in length and encode a protein of 419 amino acids. f-g-d-s, the putative esterase active site in all he proteins, was located at aa positions 34-37. this complete he gene shares 77.1-98.7% nt sequence identity (76.2-98.1% aa sequence identity) with all 16 available btov he gene sequences. phylogenetic trees constructed using full-length he gene sequences showed that the two he sequences clustered into the same subgroup and belonged to genotype ii (fig. 4) . furthermore, there was a unique amino acid variation (l202f) in strains sc-1 and sc-2 compared to the 16 available btov he sequences, and it is interesting that this mutation is located in the neutralizing epitope. in addition, the sc-1 strain had a unique amino acid variation (i289y), and the sc-2 strain had three unique amino acid variations (k137s, s238q, w290g) compared to the other available btov he sequences (fig. 5 ). btov sc-1and sc-2 were predicted to be recombination strains using the recombination detection program (rdp 4.0) software, and standard similarity plot analysis was performed using simplot 3.5.1 (fig. 6) . notably, both strains sc-1 and sc-2 had the same predicted breakpoints in the crossover region, at nucleotide positions 20,810 and 28,047 in the 3' end of orf1b and the 5' end of the n coding region, respectively. in this study, 21.73% (20/92) of the diarrheic stool samples collected from dairy cows were found to be btov positive. these samples were from four farms in three provinces, and the two more distant farms were separated by more than 1500 kilometers. the results indicate that btov has circulated in chinese cows with wide geographical (table 1) , which is similar to the previous report from the usa [32] . this may lead to an increase in clinical severity and increased difficulty in diagnosing and controlling calf diarrhea. in this research, we determined the obtained two complete genome sequences of two btov isolates from the same farm in sichuan province, increasing the number of btov genome sequences in the genbank database to five, thus contributing to a better understanding of the genome structure and genetic evolution of btov. phylogenetic analysis indicated that these two btov isolates had a close genetic relationship to strains from japan. strains sc-1 and sc-2 were predicted to be recombinants derived from btov strains breda1 and ptov, in agreement with a previous report [8] . moreover, the two chinese strains shared identical unique amino acid changes in the s and he genes when compared to the other strains with sequences available in the genbank database, indicating the unique evolution in chinese btov strains. interestingly, although these two isolates came from the same farm, their s and he genes had their own unique amino acid variations. these results will be helpful for understanding the genetic evolution of btov. the s protein is involved in the induction of neutralizing antibody production during the infection process and plays an important role in the pathogenesis of btov, which is similar to that of coronaviruses [22, 24, 33] . another member of the family coronaviridae, bcov, has a genomic structure similar to that of btov, and mutations in the s protein are associated with viral antigenicity, pathogenicity, tissue tropism, and host range change [34, 35] . it has been reported that cleavage most likely occurs between amino acids 1,002 and 1,003 of the mature s protein in btov, resulting in the formation of the s1 and s2 proteins [24] . the two chinese strains shared two identical amino acid mutations in the signal peptide that might affect its function. it is worth noting that these two strains have six identical amino acid mutations in the putative s1 region. since the s1 protein is responsible for receptor recognition and neutralizing antibody induction, mutations at amino acids 60 and 96 (t60i, l96p) might affect virus replication [36] . therefore, the effects of these unique amino acid variations on the function of the proteins of strains sc-1 and sc-2 s is worth further study. based on the he gene sequences, btov can be divided into three genotypes [18] . in this study, a phylogenetic tree constructed using all 18 complete he gene sequences available in the genbank database showed that the two he fig. 6 similarity plot analysis of the complete genome sequences of ptov shi (blue line), btov breda (red line), and ptov npl (pink line), with btov sc-1 and sc-2 as query sequences, using a sliding window of 500 nt and a moving step size of 20 nt sequences belonged to a genotype that is prevalent throughout the world [10, 12, 25, 37] . the torovirus he protein contains three domains: mp, e and r, among which the r domain is involved in receptor recognition and plays an important role in the process of btov infection [38] . the two chinese strains shared an identical unique amino acid change in the e region. in addition, strain sc-2 had another unique amino acid change in the e region. therefore, the ability of these mutations to affect viral receptor binding requires further investigation. in conclusion, this study is the first to confirm the existence and prevalence of btov in chinese calves with diarrhea, contributing to the diagnosis and control of calf diarrhea in china. moreover, two complete btov genome sequences were obtained from the clinical samples, and these two btov isolates had unique amino acid changes in the s and he proteins. these findings will enhance our understanding of the epidemic and the genetic evolution of btov. nidovirales: a new order comprising coronaviridae and arteriviridae detection of bovine torovirus in fecal specimens of calves with diarrhea from ontario farms antibodies to berne virus in horses and other animals identification and characterization of a porcine torovirus du pasquier p (1984) an enveloped virus in stools of children and adults with gastroenteritis that resembles the breda virus of calves the complete sequence of the bovine torovirus genome enteric and nasal shedding of bovine torovirus (breda virus) in feedlot cattle detection and characterization of bovine torovirus from the respiratory tract in japanese cattle studies with an unclassified virus isolated from diarrheic calves detection of bovine torovirus in fecal specimens from calves with diarrhea in turkey detection of bovine torovirus in neonatal calf diarrhoea in lower austria and styria (austria) detection of bovine torovirus in fecal specimens of calves with diarrhea in japan the significance of bredavirus as a diarrhea agent in calf herds in lower saxony detection and molecular characterisation of bovine corona and toroviruses from croatian cattle first detection and molecular diversity of brazilian bovine torovirus (btov) strains from young and adult cattle molecular epidemiology of bovine toroviruses circulating in south korea infectious agents associated with diarrhoea of calves in the canton of tilarán phylogenetic and evolutionary relationships among torovirus field variants: evidence for multiple intertypic recombination events breda virus-like particles in calves in south africa molecular characterization and phylogenetic analysis of the genome of porcine torovirus bovine torovirus (breda virus) revisited structure, function, and evolution of coronavirus spike proteins structural basis for ligand and substrate recognition by torovirus hemagglutinin esterases bovine torovirus: sequencing of the structural genes and expression of the nucleocapsid protein of breda virus an interaction between the nucleocapsid protein and a component of the replicase-transcriptase complex is crucial for the infectivity of coronavirus genomic rna critical role of cellular cholesterol in bovine rotavirus infection molecular investigation of bovine viral diarrhea virus infection in yaks (bos gruniens) from qinghai, china detection and molecular characteristics of neboviruses in dairy cows in china prevalence of a novel bovine coronavirus strain with a recombinant hemagglutinin/esterase gene in dairy calves in china prevalence and complete genome of bovine norovirus with novel vp1 genotype in calves in china reverse transcription-pcr assays for detection of bovine enteric caliciviruses (bec) and analysis of the genetic relationships among bec and human caliciviruses case-control study of microbiological etiology associated with calf diarrhea a single amino acid change within antigenic domain ii of the spike protein of bovine coronavirus confers resistance to virus neutralization coronavirus spike proteins in viral entry and pathogenesis crystal structure of bovine coronavirus spike protein lectin domain genetic and antigenic characterization of newly isolated bovine toroviruses from japanese cattle characterization of epidemic diarrhea outbreaks associated with bovine torovirus in adult cows hemagglutinin-esterase, a novel structural protein of torovirus acknowledgements this study was supported by funding from the key: cord-300796-rmjv56ia authors: nan title: the signal sequence of the p62 protein of semliki forest virus is involved in initiation but not in completing chain translocation date: 1990-09-01 journal: j cell biol doi: nan sha: doc_id: 300796 cord_uid: rmjv56ia so far it has been demonstrated that the signal sequence of proteins which are made at the er functions both at the level of protein targeting to the er and in initiation of chain translocation across the er membrane. however, its possible role in completing the process of chain transfer (see singer, s. j., p. a. maher, and m. p. yaffe. proc. natl. acad. sci. usa. 1987. 84:1015-1019) has remained elusive. in this work we show that the p62 protein of semliki forest virus contains an uncleaved signal sequence at its nh2-terminus and that this becomes glycosylated early during synthesis and translocation of the p62 polypeptide. as the glycosylation of the signal sequence most likely occurs after its release from the er membrane our results suggest that this region has no role in completing the transfer process. iosynthesis of proteins at the er can be subdivided into several steps. these are (a) targeting of translation complexes to the er membrane; (b) synthesis and transfer (translocation) of the polypeptide chain across the lipid bilayer; and (c) protein maturation in the lumen of er (chain folding, disulphide bridge formation, glycosylation, and oligomerization). the mechanisms for these processes have been studied extensively during recent years (kornfeld and kornfeld, 1985; wickner and lodish, 1985; rapoport, 1986; lodish, 1988; rothman, 1989) . a most important finding has been that all proteins made at the er carry a signal sequence (also called signal peptide), a hydrophobic peptide which is usually located at the nh2-terminal region of the polypeptide chain. one function of the signal peptide is to achieve targeting of the polysome to the er membrane (rapoport, 1986) . when the signal sequence emerges from the ribosome it binds to the signal recognition particle, which mediates binding of the polysome to the docking protein in the er. after this another function of the signal sequence is expressed, that is to interact with some components of the er membrane and thereby initiate translocation of the polypeptide chain into the lumen of the er (gilmore and blobel, 1985; robinson et al., 1987; wiedmann et al., 1987) . further synthesis of the polypeptide then continues with concomitant chain translocation. an important but as yet unresolved question is whether the signal sequence has any role in the translocation process per se or whether its functions are limited to the targeting and translocation-initiation steps. for instance, singer and co-workers danny huylebroeck's present address is innogenetics, industriepark 7, box 4, b-9710, ghent, belgium. (1987a) have suggested a translocator protein model in which the signal sequence helps to keep the machinery open for chain transfer. it is specifically this last question we have addressed in the present work. we describe the characteristics and behavior of the uncleaved signal sequence of the p62 protein of semliki forest virus (sfv) l upon translocation across the er membrane in vitro. the p62 protein is one subunit of the heterodimeric spike protein of the sfv membrane (reviewed in garoff et al., 1982) . it is made as a precursor protein together with the other structural proteins of sfv, i.e., the nucleocapsid protein, c, and the other spike subunit, el. the three proteins are synthesized from a 4.1-kb long mrna in the order c, p62, and el, and separated by cleavage of the growing precursor chain. during synthesis of the p62 polypeptide at the er all but a 31 residue cooh-terminal portion and the membrane anchor is translocated across the membrane. the p62 signal sequence has so far been only roughly localized to the nh2-terminal third of the polypeptide chain (garoffet al., 1978; bonatti et al., 1984) . we show here that the signal sequence of p62 consists of a 16 residue peptide at its nh2-terminal region. this region includes one out of four glycosylation sites (asn~3) for n-linked oligosaccharide on the p62 chain. we also demonstrate that the glycosylation of the p62 signal sequence occurs early during chain translocation. as this modification of the signal region most likely correlates with its release into the lumen of er it follows that the signal sequence of p62 is probably only needed for an initial step in chain translocation and not to small scale plasrnid dna preparations were done using the alkali-sds method essentially as described by birnboim and dnly (1979) . large quantities of plasmids to be used for in vitro transcription were prepared by lysozyme-triton lysis of the bacteria, followed by csc1-etbr banding (kahn et al., 1979) . etbr was removed by several extractions with isopropanol and, after fivefold dilution, the dna was precipitated twice with ethanol and further purified over a biorad a-50m column. restriction endonucleases and dna-modifying enzymes were used according to the suppliers instructions. removal of the 3' sticky end from the sac i site in pgem2-alphag (zerial et al., 1986) with t4 dna polymerase was done at 15°c (2 h), dntps were added (end concentration 100 #m each), and the dna was subsequently filled in at 15°c for 1 h. all ligations were done at 24°c for 4 h except for linker ligations (4°c, 16 h). all other molecular biological manipulations were done using slightly modified standard protocols (maniatis et al., 1982) . in vitro transcription (0.3 #g supercoiled template dna per 10 #1 vol) in the presence of sp6 rna polymerase (6-8 u) and the cap structure was carried out as previously described (zerial et al., 1986) . in vitro translation reactions using a rabbit reticulocyte lysate were performed at 30°c essentially as described . 1.5 #1 of the in vitro synthesized rna was translated in a total volume of 15 #1. potassium, magnesium, and spermidine concentrations were 100, 1.2, and 0.375 mm, respectively. when indicated, 1 #1 of er membranes was included. in some translocations the membranes were pretreated with 200 #m peptide for 5 min on ice. the final peptide concentration in the total translation mixture here was, after addition of the pretreated membranes, adjusted to 100/~m. to obtain partial synchronization of translation, ata was added after a preincubation of 1.5-3.0 rain (borgese et al., 1974) . a final ata concentration of 0.075 mm was found to be sufficient for inhibiting initiation of chain synthesis (see control in fig. 6 , lane/). higher concentrations of ata inhibited first transloeation and then also chain elongation. for protease protection experiments, proteinase k was added to a final concentration of 0.1 mg/ml and the samples were incubated at 0*c for 30 min in the presence or absence of 1% triton x-100. proteolysis was stopped by the addition of pmsf (final concentration 2 mg/ml) and samples were kept at 0*c for 5 min before further processing for electrophoresis (cutler and garoff, 1986) . bands containing labeled protein were visualized by fluorography. quantitation of proteins was done by cutting the bands out of the dried gel, solubilizing them with protosol (from dupont de nemours, nen) according to the instructions of the manufacturer, and finally counting the 3~s radioactivity in a liquid scintillator (wallac lkb, turku, finland). the localization of the bands on the dried gel was done with the aid of the fluorograph in transillumination. 15-#1 translation mixtures were adjusted to ph 11-11.5 by adding an appropriate volume (pretitrated) of 0.1 n naoh. after a 10-rain incubation on ice the samples were separated into a pellet fraction and a supernatant fraction by centrifugation through a 100-#1 alkaline socmse cushion (gilmore and blobel, 1985) for 10 rain at 30 psi in an airfuge (beckman instruments, inc., palo alto, ca) using the a-100/30 rotor and cellulose propionate tubes precoated with bsa (1% solution). the entire supernatant was removed, neutralized with 1 n hci, diluted 2.5 times with water, and then precipitated by adding 3.5 vol of acetone. these precipitated proteins and pelleted membranes (obtained from the airfuge tube) were taken up in 4% sds by incubating at 56°c for 15 min and then processed for immunoprecipitation reactions as described below. total translation mixtures were adjusted to 4% sds, then boiled for 4 rain and diluted 1:2 with water. 4 vol of immunoprecipitation buffer (2.5 % triton x-100, 190 mm nac1, 60 mm tris-hc1, ph 7.4, 6 mm edta, and 20/~g pmsf/rni) and 2 #1 of antibody were added for 16 h at 40c. the mixture was briefly centrifuged (2-3 min in an eppendorf minifuge) and to the supernatant one fifth volume of a 1:1 slurry of protein a beads were added and incubated at 24"c for 2 h under constant agitation. the beads were collected and washed four times with 1 ml ripa buffer (gielkens et al., 1976) by centrifugation, followed by a single wash with a buffer containing 150 mm naci, 10 mm tris-hcl ph 7.4, and 20 t~g pmsf/ml. the beads were then taken up in excess gel loading buffer (cutler and garoff, 1986) , heated at 95°c for 5 min, and cleared by centrifugation before loading the immunoprecipitate on the gel. constructions of pgem2alphagx and pgem2dhfrx. for the construction of the final fusion protein-coding plasmids used in this study we first had to make plasmids pgem2alphagx, which are derived from pgem2alphag and pgem2dhfrx, which are derived from pgem2dhfr (zerial et al., 1986) . plasmid pgem2alphag contains a 548 bp-long nco i-pst i fragment encompassing the entire chimpanzee alpha-globin coding region between the hinc ii and pst i sites of the polylinker of the plasmid pgem1 (pmmega biotech). the nco i site contains the translation initiation codon from alpha-globin (zerial et al., 1986 ). an xho i site, allowing subsequent in-frame ligations of sfv sequences, had to be introduced in pgem2alphag. therefore, this plasmid was cut (upstream of the nco i site) with sac i, the 3' sticky ends removed with "1"4 dna folymerase, an xho i octamer linker introduced and, after cutting with xho i, the plasmid was religated at low dna concentration (1 #g/ml). plasmid pgem2alphagx then contains the 2,057 bp-long xbo i-pvu i fragment needed for the construction of the fusion protein-coding plasmid pc62alphag. an intermediate construct, analogous to pgem2alphagx, and also conraining a unique xho i site, was needed for the constructions of dhfrcontaining plasmids. for this purpose we inserted the xho i linker into partially xmn i cut pgem2dhfr (zerial et al., 1986) . after cutting the linkers, linear plasmid was purified on agarose gel and religated. since the second xmn i site in pgem2dhfr is located in the beta-lactamase coding region of the vector (snt~liffe, 1979) and insertion of an xho i site by an octamer linker will result in an ampicillin-sensitive e. coli phenotype after transformation, only the desired pgem2dhfrx construct was obtained. from this plasmid, an xho i-pvu' i fragment of at least 2,012 bp (the precise length of the cdna insert, i.e., the length of the 3' untranslated region of dhfr, is not known in pgem2dhfr) was used for the construction of pc62dhfr. construction of the fusion protein-coding plasmids pc62alphag and pc62dhfr. plasmid pgemi-sfv (also called pg-sfv-15/5; melancon and garoff, 1986 ) contains a re, engineered edna copy of the sfv 26s mrna sequences cloned as a barn hi fragment in the barn hi site of the polylinker downstream of the sp6 promoter in the plasmid pgem1 (promaga biotech). from the sfv plasmid, a 2,381 hp-long pvu i-xho i fragment, containing the coding sequences for the capsid protein and the nh2-terminai region of the p62 protein, was isolated. the xho i-pvu i fragments from pgemi-sfv, pgem2alphagx and pgem2dhfrx were isolated and ligated at a 1:1 molar ratio to obtain pc62alphag and pc62dhfr, respectively. plasmid dnas from ampicillin-resistant colonies were screened and compared to the starting vectors by restriction analysis. altogether, the sfv-alpha-globin edna fusion results in a complete c region and 40 codons from the 5' end of the p62 region fused to the whole of the alpha-globin coding sequence (see fig. 1 ). eight new codons have been introduced at the point of edna fusion. in the sfv-dhfr construction the c region and the 40 first codons of p62 are fused to the dhfr coding sequence such that one new codon is introduced and the first 31 codons of dhfr are lost. construction of plasmidp62dhfr. for engineering of a p62 protein signal sequence-dhfr fusion protein which is not derived from a c proteincontaining precursor we synthesized the whole p62 signal sequence region. two overlapping oligonucleotides were made (dna-synthesizer; applied biosystems, foster city, ca) :(1) 5' atacacagaattcagcaccatgt-ccgccccgctgattac tgccatgtgtgtcctiv~caatc_~tacct-tcccgtc~ttccagcccccgtgtgtacc~, (2) 5' gttatcct-cgagcatccgtagtgtggcctctgcgttgttttcatagcagca-aggtacacacgggggc tggaagcac gggaaggtagcattgcjca-aggac. they correspond to both strands of the p62 signal sequence region of the sfv edna. together they span the coding region of amino acid residues 1-40 of p62. oligo 1 (the coding strand) includes, in addition, the region coding for initiator methionine of the c protein plus its 5' flanking sequences (5' agcaccatg). at the extreme 5' end of this oligo we have added the recognition sequence for eco ri and its flanking sequences from the 5' end of the structural part of the sfv cdna (5' atacacagattc). oligo 2 ends at its 3' end with the xho i site which follows the signal sequence region on the p62 gene. the two oligonucleotides were hybridized (51 complementary bases), filled in using sequenase (united states biochemical co., cleveland, oh) and restricted with eco ri and xho i. the resulting dna fragment was then purified and inserted into pcp62dhfr instead of the c and p62 sequences. for this purpose the pcp62dhfr plasmid was eco ri and xho i restricted and the plasmid part with the dhfr sequences isolated. the resulting plasmid p62dhfr contains thus the coding sequences for the initiator methionine of c and the first 40 residues of the p62 protein, including the signal sequence, in front of the dhfr gene (see fig. 1 ). construction of pgem sfv d-4. this plasmid was constructed by ligatiag three fragments together. the first one was the major part of pgem1, cut just after the promoter region with hind iii and barn hi. the second fragment (hind i~-xho i) was isolated from the plasmid psvs-sfv . this fragment contains the sequences encoding the capsid and the nh2-terminal part of the 1962 protein of sfv. the third fragment was obtained by cleaving plasmid pl1 sfv d-4 (see below) with xho i and barn hi and isolating the fragment containing the 3' part of the coding sequence for the p62 protein. however, it should be noted that in the d-4 version there is an exchange of 15 codons at the 3' end of the 1062 gene for six aberrant ones. the corresponding p62 protein variant is called 1962 d-4 (see fig. 1 ). it should also be mentioned that pl1 sfv d-4 has been derived from pl1 sfv d-9, (cutler and garoff, 1986 ) by exchanging the xho i-cla i region containing the 3' part of the p62 coding region with the similar fragment from psv2 sfv d-4. this latter plasmid is described in garoff et al. (1983) . to define the p62 signal sequence we have studied the translocation phenotype of two reporter molecules, the rabbit alpha-globin and the mouse dihydrofolate reductase (dhfr), both of which have been extended at their nh:-termini with an nh2-terminal 40 residue peptide from p62. the hybrid molecules were tested in a microsome-supplemented in vitro translation system. the alpha-globin and the dhfr have earlier been shown to be translocation incompetent if not extended with a heterologous signal sequence at their nh2termini (zerial et al., 1986) . we first tested the expression of in vitro-made rna from the construction pcp62dhfr in an in vitro translation system. this would be expected to yield free c protein and p62reporter hybrid (p62-dhfr) through c-catalyzed autoproteolytic cleavage of the nascent c-p62-reporter precursor ( fig. 1 ) (aliperti and schlesinger, 1978; hahn et al., 1985; melancon and garoff, 1987) . furthermore, the p62-reporter hybrid should be translocated across microsomal membranes and possibly glycosylated at asn~3 of the p62 sequence if the 40 residues long nh2-terminal p62 peptide carries a signal sequence. is shown to be linked to asn residue 13 in the p62 part (garoff et al., 1982) . additional amino acids resulting from in frame translation of the multicloning region of pgem2 and the added xho i linker as well as the initiator met of p62dhfr are also indicated. analysis showing the translocation activity of the p62-dhfr protein. in the absence of membranes (lane/) two major protein species were translated from the sp6-directed transcript. one of these had the expected size of c (33 kd) and the other one that of the p62-dhfr hybrid molecule (21 kd). the coding region has apparently been translated faithfully and the precursor protein cleaved efficiently. the identity of the p62-dhfr was directly proven by immunoprecipitation with a dhfr specific antiserum (see fig. 3 ). the two weaker bands migrating faster than the capsid in fig. 2 , lane i were most likely derived from c coding sequences because they are found in all protein analyses of in vitro transcripfigure 3 . immunological identification of the p62-dhfr hybrid protein and analysis of its association with membranes. rna transcribed from pc62dhfr was translated in vitro in the presence of membranes which in some cases had been treated with an acceptor (acc) or nonacceptor (non) peptide. the samples were treated, after translation, at ph 11-11.5, and the proteins then separated into a membrane-bound pellet fraction (p) and a supernatant fraction (s) by centrifugation. in all samples the p62-dhfr polypeptides were isolated using an anti-dhfr antibody. the proteins were then analyzed by sds-page (10 %) and subsequent autoradiography. the slower migrating band corresponds to glycosylated and the faster one to nonglycosylated forms of p62-dhfr (compare fig. 3) . tion/translation mixtures involving cdnas with c regions (compare fig. 6 ). when microsomes were added to the c-p62-dhfr in vitro translation system a new band appeared which migrated somewhat slower than the p62-dhfr band seen in the analysis of the mixture lacking membranes (fig. 2, lane 2) . it almost comigrated with one of the two weak c derived bands. the new band apparently corresponds to iri2-dhfr hybrids that have been translocated into the lumen of the added microsomes and have become glycosylated. the immunoprecipitation analysis shown in fig. 3 confirmed the identity of this material. the protease digestions in the absence (fig. 2, lane 3) and presence of triton x-100 (lane 4) clearly demonstrated that the slower migrating p62-dhfr molecules were indeed translocated. about half of this material remains protected in the presence of intact microsomes whereas all is digested when the membranes are solubilized with detergent. in contrast, the other translated material did not show such a pronounced membrane-dependent protease resistance. note that protease treatment of all samples yielded a resistant protein of a small size. this most likely represents a protease-resistant c fragment. the glycosylation of the translocated p62-hybrid and its effect on the apparent size of the protein was shown in an experiment where a short peptide (asn-leu-thr), which competes for n-linked glycosylation, was included during translation. apparently only unglycosylated faster migrating p62-dhfr hybrids were formed in these conditions although chain translocation took place conferring protease resistance (fig. 2, lanes 5-7) . additional analyses (lanes 8-10) illustrate that a control peptide (asn-leu-athr) which cannot serve as an acceptor site for n-linked glycosylation, had no effect on the glycosylation of the p62-reporter hybrids when tested in an analogous way. similar studies as with pcp62dhfr were also performed with the pcp62globin coded proteins in vitro. the results (not shown) were analogous to those described above for the pcp62dhfr construct. c protein and p62-globin hybrid were synthesized in the absence of membranes. when membranes were added, a protease-protected form of the hybrid appeared. this hybrid was also glycosylated as deduced from an experiment involving the acceptor peptide for glycosylation. fig. 3 (lanes 1-6) shows the results of analyses in which we have tested whether the p62 signal sequence region confers stable membrane attachment to the p62-dhfr hybrid. microsome-supplemented translations were adjusted to ph 11-11.5 with naoh, incubated on ice for 10 min, and then separated into a membrane pellet and supernatant fraction by ultracentrifugation. in all samples the p62-dhfr polypeptides were isolated using an anti-dhfr antibody, sds-page shows that the hybrid protein segregates almost quantitatively into the supernatant fraction (compare lane i with lane 2). in similar conditions an integral membrane protein, the human transferrin receptor, was found to sediment with the membranes into the pellet fraction and a secretory protein, ig light chain, was only recovered in the supernatant (not shown). if the acceptor peptide for glycosylation was included in the in vitro translation and the mixture then analyzed we found that the now unglycosylated but still translocated p62-hdfr hybrids were again mostly found in the supernatant fraction (lanes 3 and 4) . lanes 5 and 6 show the analyses with the control peptide. to see whether the c protein exerts an influence on the translation phenotype of the p62-dhfr protein the p62dhfr plasmid (see fig. 1 ), lacking the c gene, was tested. the results shown in fig. 4 show clearly that the p62-dhfr hybrid is translocated and glycosylated in the same way as when expressed from pcp62dhfr. thus, apart from providing a free nh2-terminal end to the p62-dhfr protein by autoproteolysis of the c-p62-dhfr precursor the c protein has no role in the translocation process. we conclude that the 40 residue peptide from the p62 nh2-terminal region confers a translocation positive phenotype to the p62-globin and p62-dhfr polypeptides and therefore must contain a functional signal sequence. the translocated fusion proteins were also shown to be glycosylated. this must involve asn~3 of the p62 peptide as it is part of the only potential glycosylation site on the hybrid polypeptides (garoff et al., 1980 ; references on dhfr sequence in legend to fig. 1) , finally, we can also conclude that the p62 signal sequence does not provide a stable membrane anchor to the translocated chain. to define at what time point during p62-dhfr chain synthesis the asn13 becomes glycosylated we performed a time-course experiment essentially as described by rothman and lodish (1977) (fig. 5) . in this experiment a 150-#1 translation was initiated. after 1.5 rain ata was added (0.075 ram) to block additional starting of chain synthesis. then, at 0.5-rain intervals, two 7.5/~1 aliquots were removed; one for mixing with 40 #1 of hot page sample buffer (2 % sds) and the other one for further incubation after mixing with 0.75/~1 of 20% tx-100. the first sample from each time point was used for the determination of the time needed for chain completion, which is a function of the translation rate, and the other one allowed determination of the time course of glycosylation of the translocated chain. triton x-100 solubilizes the microsomal membranes and thereby inactivates glycosylation (but not chain elongation). therefore, only those p62-dhfr chains that have presented asnl3 to the glycosylation machinery before tx-100 addition have had the possibility to become glycosylated. in fig. 5, lanes 1-10, one can see that completed p62dhfr chains (197 residues with initiator met) appear after a 3-min incubation from the time point of ata addition. if one assumes constant chain initiation during the figure 5 . time course of p62 dhfr glycosylation. a 150-#1 translation was initiated. after a preincubation time of 1.5 min ata was added to inhibit further initiation of chain synthesis. then, at intervals of 0.5 min (indicated by 0.5', 1.0', 1.5', 2.0', 2.5', 3.0', 3.5', 4.0', 4.5', and 5.0') two 7.5-/zl samples were removed, one for mixing with page sample buffer and another one for mixing with tx-100 (final concentration 1%) and further incubation at 30°c (for a total time of 20 min after ata addition as indicated by the lower row of time points in the figure) . lanes 1-10 show the samples removed for mixing with the page sample buffer. from these results the approximate rate of translation can be derived. completed chains appear in the 3-min sample. lanes 11-20 show the samples in which the membranes have been solubilized with triton x-100 for inactivation of the glycosylation machinery. from these analyses it is possible to estimate when asn13 is modified during p62-dhfr synthesis. the first glycosylated forms are clearly visible in the 1.5-min sample, a time point where only about half of the p62-dhfr chain has been synthesized. the nature of the material in the two weak bands seen in lanes 1-5 is unclear. their transient appearance before the completion of the p62-dhfr chain suggests that they represent complexes of nascent p62-dhfr chains. figure 6 . time course of p62 d-4 glycosylation. seven translations in the presence of microsomal membranes were started in parallel. after a 3-rain initial incubation at 30°c, ata was added in order to inhibit further initiation of chain synthesis. incubation was then continued for 35 rain. at the indicated time points (5, 10, 15, 20, 25, 30 , and 35 rain) tx-100 (tx) was added to stop further chain glycosylation. at the same time one half of each sample was removed and put on ice in order to measure the extent of chain elongation at each time point. all samples were analyzed by sds-page (10%) and autoradiography. lanes 3-9 show the analysis of the samples incubatedwith tx-100 and lanes 10--16 the analysis of the portions put on ice at the different time points. the complete sequence of treatments for each sample is indicated by the labeling in each lane (upper row of time points indicate tx-100 addition and cooling on ice, respectively; lower row of timepoints indicate incubation in the presence of tx-100). lanes i and 2 represent controls. in the experiment shown in lane i, ata was added before starting a 40-rain membrane-supplemented translation. in the experiment shown in lane 2 a translation with membranes was allowed to proceed for 40 min. ata was added as in the time course samples but tx-100 was omitted. the c protein, the unglycosylated (p62) and the glycosylated (gp62) forms of p62 d-4 are labeled at right in the figure. arrowheads at left indicate (from above) the migration of the 53-kd igg heavy chain, the 46-kd ovalbumin, and the 30-kd carbonic anhydrase. note that somewhat different amounts of translation mixtures have been analyzed in the various lanes (compare intensities of c and c-derived bands). 1.5-rain preincubation without ata then the total time for chain synthesis is ,~3.75 rain (3 + 0.75 rain). this corresponds to a mean translation rate of 52.5 peptide bonds per rain. lanes 11-20 show that glycosylated chains appear in all those samples that have had the membranes intact for 1.5 min or more after ata addition. this means that p62dhfr chains that have been elongated for '~2.25 rain (1.5 min incubation after and 0.75 min before ata addition), to the length of ,'~118 residues already carry a sugar unit at asn13. as ~60 residues of the nascent chain are required to span the ribosome and the lipid bilayer we conclude that glycosylation occurs when the first 50-60 residues of p62-dhfr appear within the lumen of the er (malkin and rich, 1967; blobel and sabatini, 1970; bergman and kuehl, 1977; smith et al., 1978; glabe et al., 1980; randall, 1983) . we also studied the timing of the glycosylation of asn~3 in its normal background, i.e., during p62 chain synthesis. for this experiment we used the pgem sfvd-4 construct. this encodes the c and the p62 membrane protein variant, p62 d-4, in which a few residues of the cytoplasmic protein domain have been exchanged as compared to the wild type sequence (see materials and methods and fig, 1) . fig. 6 , lane 2, shows that rna, which has been transcribed from this construct, directs the synthesis of c and p62 d-4 chains. the protein has catalyzed correct c-p62 cleavage and the p62 signal sequence has catalyzed the insertion of about half of the p62 d-4 chains across the added microsomal membranes. these migrate as glycosylated 60-58 kd proteins in contrast to the noninserted molecules which have an apparent molecular mass of 'x,52 kd. the glycosylated and translocated nature of the 60-58 kd material was clearly demon-strated in experiments similar to those described above for the p62-dhfr hybrid molecule (not shown). altogether there are four glycosylation sites within the p62 d-4 sequence. these correspond to asn residues at positions 13, 60, 266, and 328 (see fig. 1 ). fig. 6 (lanes 3-16) shows the time course of the four glycosylation events during c-p62 d-4 translation. a slightly different protocol was followed in this experiment as compared to that with p62-dhfr. seven translations were initiated in parallel and after a 3-min incubation these were put on ice and ata was added. elongation of the already initiated chains was then continued for a total of 35 min, however, so that triton x-100 was added to individual samples at 5, 10, 15, 20, 25, 30, and 35 min. at these time points half of each sample was also removed and translation stopped by cooling on ice. lanes 3-9 show the sds-page of the samples that had received triton x-100 at different time points. we found the sequential appearance of p62 d-4 polypeptides with no carbohydrate (lane 3), with one and two units added (seen as two new bands with slower migration in lanes 4 and 5), with three units (lane 6), and all four sugar units (lanes 7, 8, and 9 ) attached to the protein backbone as the translation proceeded coordinately with time. note that the four glycosylation events result in different degrees of increase of the size of p62 d-4. the second event causes the largest increase and the third one the smallest. as the sugar unit added at each step should be the same we think that these differences reflect some conformational changes in the p62 folding which occur coordinately with glycosylation. in lanes 10--16 we have analyzed the samples that were withdrawn at the different times but were kept on ice. as expected, we see a sequential appearance of first the capsid the journal of cell biology, volume 111, 1990 protein (in the 10-min sample) and then the p62 d-4 protein (barely visible in the 20-min sample). the p62 d-4 protein is partly present in its glycosylated and partly in its unglycosylated form. using 21.5 min as a rough estimate for the translation time of the 746 residue long c-p62 d-4 chain (time point of p62 d-4 detection, 20 min, plus half of the 3-min preincubation time without ata) we have calculated the translation rate and derived the approximate earliest time points when the four glycosylation sites of p62 d-4 should be available for modification. according to these, asn~ and asn60 should be the only sites available for glycosylation in the 10-min sample, shown in lane 4, and the most abundant ones presented for modification in the 15-min sample, shown in lane 5. therefore, it appears reasonable to assume that those chains of these two samples which have obtained two sugar units carry these on the aforementioned two sites. thus, the peptide region with asn,a seems to be target for rapid modification also when present in its normal background, that is with the p62 protein. the fact that the 40 residue fragment of the nh2-terminal region of the p62 protein is able to translocate two different reporter molecules into microsomes constitutes in our mind convincing evidence for signal sequence activity in this protein fragment. a more precise location of the p62 signal sequence within the 40 residue p62 fragment can be done with the aid of the known consensus features of a signal sequence. the most typical characteristic of a signal sequence is a stretch of 10-12 uncharged residues, mostly hydrophobic ones (von heijne, 1985) . this part of the signal sequence probably forms an alpha helix in the er membrane (emr and silhavy, 1983; briggs et al., 1985 briggs et al., , 1986 kendall et al., 1986; batenburg et al., 1988) . the only possible candidate region within the 40 residue p62 fragment having these features is the 13 residue segment between pro 3 and pro 17 (see box in uppermost sequence in fig. 7) . the pro-rich region in the middle of the 40 residue fragment would not form an alpha-helix, and the cooh-terminal part of the p62 segment contains a high number of charged residues. as shown in fig. 7 these features are conserved in all those alphaviruses where the p62 protein has been sequenced. thus, we find the experimental results, together with the structural considerations discussed above, highly indicative that the 16 nh2terminal residues of p62 constitute its signal sequence. eventually, the signal sequence of the p62 protein becomes translocated across the membrane of the er into its lumenal space. in here it is found as a glycosylated peptide which is part of a 66 amino acid residues long "pro-piece of the p62 protein. this pro-peptide, called e3, is cleaved at a late stage during virus assembly (de curtis and simons, 1988) and is then either released into the extracellular medium as a soluble protein (sinbis virus) or remains as a peripheral protein subunit on the virus spike (sfv) (garoff et al., 1982; mayne et al., 1984) . our present tests of the p62-globin and p62 dhfr hybrids in the high ph wash assay of membrane supplemented in vitro mixtures also support the notion that the p62 signal sequence does not remain bound to the membrane where it has exerted its function as a translocation signal. in this work we like to use the glycosylation event at asn,3 of the signal sequence to mark the time point when the latter becomes released into the lumen of the er. the crucial question then becomes whether it is reasonable to assume that the signal peptide has to be released from the er membrane before it can become glycosylated. to answer this question we have to consider what is known about the topology of glycosylation as well as the way by which the p62 signal might interact with the er membrane. today there is no exact information about how a signal sequence might be inserted into the er membrane when exerting its function in chain translocation. however, the typical cytoplasmic orientation of the nh2-termini of membrane protein chains carrying a combined signal sequence-anchoring peptide suggests that signal sequences in general might direct their function in translocation through the insertion of their hydrophobic and uncharged stretch of amino acid residues into the membrane in such an orientation that the nheterminus of the signal remains on the outside of the er mem(garoff et al., 1980; rice and strauss, 1981; dalgarno et al., 1983; kinney et al., 1986; chang and trent, 1987) . amino acid residues are given using the one letter code and they are numbered from the nh2-towards the cooh-terminus. the boxes indicate that region in each sequence which best fulfills the consensus features of a signal sequence (the uncharged and hydrophobic region). the * symbols represent attachment sites for oligosaccharide and the (+) and (-) the presence of a charged amino acid side chain. proline residues are labeled with a dot. the sequences are aligned according to maximum amino acid sequence homology. brane (bos et al., 1984; lipp and dobberstein, 1986; spiess and lodish, 1986; zerial et al., 1986 ; see also shaw et al., 1988) . in addition, it is known from physical studies using synthetic signal peptides and artificial lipid membranes that the signal peptides readily insert into the membrane and there obtain an alpha-helical conformation (briggs et al., 1985 (briggs et al., , 1986 batenburg et al., 1988; cornell et al., 1989) . if the p62 signal sequence adapts such an orientation and conformation in the er membrane it would mean that the glycosylation site at asnt3 would locate inside the membrane (von heijne, 1985) . in this location the site can hardly be accessible for the glycosylation machinery. (note that in the related ross river virus, the venezuelan equine encephalitis virus and eastern equine encephalitis virus the corresponding glycosylation site is even closer towards the nh2terminus, that is at asn 11, see fig. 7 .) according to several recent studies, glycosylation requires the exposure of the glycosylation site in the lumen of er. firstly, it has been shown that the binding protein for the glycosylation site of n-linked oligosaccharides is a lumenal 57-kd protein of the er (geetha-habib et al., 1988) . secondly, one study with the asialoglycoprotein receptor and another one with the corona virus e1 membrane protein demonstrate that lumenally oriented glycosylation sites are not used on transmembrane polypeptides if they locate very close to the membranebinding segments of the chains (mayer et al., 1988; wessels and spiess, 1988) . in the case of the asialoglycoprotein receptor a site was not used if located 12 residues apart from the membrane anchor, however, if moved 8 more residues apart from the anchor it became glycosylated. in the case of the corona virus protein a site just adjacent to the combined signal sequence-anchor peptide remained unglycosylated, whereas an engineered site 24 residues further away was used for glycosylation. such restrictions in glycosylation are most likely to be explained by sterical problems in attaching the very spacious sugar unit (lee et al., 1984 ; see also wier and edidin, 1988) onto acceptor sites that are fixed in a position which is close to the membrane plane. therefore, we assume that the p62 signal sequence, with its glycosylation site at asn13, cannot become glycosylated before it has been released into er lumen. as this glycosylation event was shown to occur at an early stage of chain translocation it follows that this signal sequence can only interact with the er membrane during the beginning of chain translocation. in other words, the signal sequence of p62 can only function at the initiation stage of chain translocation and has no role in completing this transfer process. if the latter would be true we would have expected that the signal sequence glycosylation would have occurred first after all of the lumenal domain of the p62 d-4 chain would have been translated and translocated. the importance of our results in this work lies in the fact that they rule out translocation models in which the signal sequence would have a role throughout the whole process of chain translocation. for instance, if the translocation site is represented by a multisubunit protein complex forming an aqueous channel across the membrane for chain transfer (see signal hypothesis, blobel and dobberstein, 1975 ; amphiphatic tunnel hypothesis, rapoport, 1985; translocator protein hypothesis, singer et al., 1987a,b) , then the signal sequence could be involved in its assembly or "opening" but apparently not for keeping it together or open until chain transfer is completed (as suggested in singer et al., 1987a) . similarly, when considering models in which the chain transfer occurs directly through a lipid membrane (see the helical hairpin hypothesis, engelrnan and steitz, 1981; direct transfer model, von heijne and blomberg, 1979; phospholipid channel hypothesis, nesmayanova, 1982) the interaction of the signal sequence with the lipid bilayer could be of importance only at the stage of translocation initiation but not at the actual chain transfer step. the possibility that our results about p62 protein translocation would be unique to the viral system .and different from the general translocation process in the er we find most unlikely. several results from this and earlier works suggest that the signal sequence of the p62 protein functions much in the same way as cleavable ones do. firstly, studies with a temperature-sensitive mutant of sfv, ts3, have shown that the signal sequence of p62 requires a free nh2-terminal end for function (hashimoto et al., 1981) . at the nonpermissive temperature the ts3 mutant is defective in cleavage between the c and the p62 protein region of the protein precursor because of a mutation that inactivates the autoproteolytic activity of c. this defect results in a translocation negative phenotype for the p62 protein. secondly, the p62 signal sequence has been shown to be srp dependent. if the mrna for the structural proteins of sfv is translated in vitro in a wheat germ-derived system that is supplemented with saltwashed (and srp-deprived) membranes then p62 translocation is observed only in the presence of exogenous srp (bonatti et al., 1984) ~ if srp is supplemented without membranes then p62 translation is arrested. thirdly, our time course study about p62 synthesis and glycosylation in this work clearly demonstrates that the p62 chain is translocated cotranslationally across the er membrane. this was also suggested by earlier studies in vitro (garoff et al., 1978; bonatti et al., 1984) . in these studies it was shown that both microsomal membranes as well as srp have to be added to the synthesis mixture before extensive lengths (,,o100 amino acid residues) of the p62 chains have been translated. it is also possible to speculate on a mechanism in which the p62 signal sequence would be released from a putative translocation site by being replaced by another signal sequence-like structure in the p62 polypeptide. however, such a "rescue" mechanism appears improbable as the p62 signal sequence was found to be glycosylated early during translation of both the p62 polypeptide as well as the signal sequence-dhfr hybrid chain. evidence for an autoprotease activity of sindbis virus capsid protein characterization of the inteffacial behavior and structure of the signal sequence of escherichia coli outer membrane pore protein phoe addition of glucosamine and mannose to nascent immunoglobin heavy chains a rapid alkaline extraction procedure for screening recombinant plasmid dna transfer of proteins across membranes. i. presence of proteolytically processed and unprocessed nascent immunoglobin light chains on membrane-bound ribosomes of murine myeloma controlled proteolysis of nascent polypeptides in rat liver cell fractions. i. location of the polypeptides within ribosomes role of signal recognition particle in the membrane assembly of sindbis viral glycoproteins. fur ribosomemembrane interaction: in vitro binding of ribosomes to microsomal membranes nh2-terminal hydrophobic region of influenza virus neuraminidase provides the signal function in translocation in vivo function and membrane binding properties are correlated for escherichia coli lamb signal peptides conformations of signal peptides induced by lipids suggest initial steps in protein export phenotypic expression in e. coli ofa dna sequence coding for mouse dihydrofolate reductase nucleotide sequence of the genome region encoding the 26s mrna of eastern equine encephalomyelitis virus and the deduced amino acid sequence of the viral structural proteins conformations and orientations of a signal peptide interacting with phospholipid monolayers structure of amplified normal and variant dihydrofolate reductase genes in mouse sarcoma sis0 cells mutants of the membrane-binding region of semliki forest virus e2 protein. i. cell surface transport and fusogenic activity ross river virus 26 s rna: complete nucleotide sequence and decoded sequence of the encoded structural proteins dissection of semliki forest virus glycoprotein delivery from the trans-golgi network to the cell surface in permeabilized bhk cells importance of secondary structure in the signal sequence for protein secretion the spontaneous insertion of proteins into and across membranes: the helical hairpin hypothesis solid phase peptide synthesis assembly of the semliki forest virus membrane glycoproteins in the membrane of the endoplasmic reticulum in vitro nucleotide sequence of edna coding for semliki forest virus membrane glycoproteins structure and assembly of alphaviruses expression of semliki forest virus proteins from cloned complementary dna. ii. the membrane-spanning glycoprotein e2 is transported to the cell surface without its normal cytoplasmic domain glycosylation site binding protein, a component of oligosaccharyl transferase, is highly similar to three other 57 kd luminal proteins of the er synthesis of ranscher murine leukemia virus-specific polypeptides in vitro translocation of secretory proteins across the microsomal membrane occurs through an environment accessible to aqueous perturbants glycosylation of ovalbumin nascent chains: the spatial relationship between translation and glycosylation sequence analysis of three sindbis virus mutants temperature-sensitive in the capsid protein autoprotease evidence for a separate signal sequence for the carboxy-terminal envelope glycoprotein e1 of semliki forest virus preparation and use of nuclease-treated rabbit reticulocyte lysates for the translation of eucaryotic messenger rna dog pancreatic microsomalmembrane polypeptides analysed by two-dimensional gel electrophoresis plasmid cloning vehicles derived from plasmids cole1, f, r6k, and rk2 idealization of the hydrophobic segment of the alkaline phosphatase signal peptide nucleotide sequence of the 26 s mrna of the virulent trinidad donkey strain of venezuelan equine encephalitis virus and deduced sequence of the encoded structural proteins expression of semliki forest virus proteins from cloned complementary dna. i. the fusion activity of the spike glycoprotein assembly of asparagine-linked oligosaccharides binding of synthetic clustered ligands to the gal/galnac lectin on isolated rabbit hepatocytes structural and evolutionary analysis of the two chimpanzee alpha-globin mrnas signal recognition particle-dependent membrane insertion of mouse invariant chain: a membrane-spanning protein with a cytoplasmically exposed amino terminus transport of secretory and membrane glycoproteins form the rough endoplasmic reticulum to the golgi partial resistance of nascent polypeptide chains to proteolytic digestion due to ribosomal shielding molecular cloning: a laboratory manual membrane integration and intracellular transport of the coronavirus glycoprotein el, a class ill membrane glycoprotein biochemical studies of the maturation of the small sindbis virus glycoprotein e3 reinitiation of translocation in the semliki forest virus structural polyprotein: identification of the signal for the el glycoprotein processing of the semliki forest structural polyprotein: role of the capsid protease on the possible participation of acid phospholipids in the translocation of secreted proteins through the bacterial cytoplasmic membrane. febs (fed. fur structure and genomic organization of the mouse dihydrofolate reductase gene translocations of domains of nascent periplasmic proteins across the cytoplasmic membrane is independent of elongation extensions of the signal hypothesis-sequential insertion model versus amphipathic tunnel hypothesis. febs (fed. eur protein translocation across and integration into membranes improved plasmid vectors with a thermoinducible expression and temperature-regulated runaway replication nucleotide sequence of the 26s mrna of sindbis virus and deduced sequence of the encoded virus structural proteins identification of signal sequence binding proteins integrated into the rough endoplasmic reticulum membrane polypeptide chain binding proteins: catalysts of protein folding and related processes in cells synchronized transmembrane insertion and glycosylation of a nascent membrane protein evidence for the loop model of signal-sequence insertion into the endoplasmic reticulum on the translocation of proteins across membranes on the transfer of integral proteins into membranes nascent peptide as sole attachment of polysomes to membranes in bacteria an internal signal sequence: the asialoglycoprotein receptor membrane anchor complete nucleotide sequence of the escherichia coli plasmid pbr 322 subcellular location of enzymes involved in the n-glycosylation and processing of asparagine-linked oligosaccbarides in saccharomyces cerevisiae structural and thermodynamic aspects of the transfer of proteins into and across membranes trans-membrane translocation of proteins: the direct transfer model insertion of a multispanning membrane protein occurs sequentially and requires only one signal sequence multiple mechanisms of protein insertion into and across membranes a signal sequence receptor in the endoplasmic reticulum membrane constraint of the translational diffusion of a membrane glycoprotein by its external domains the transmembrahe segment of the human transferrin receptor functions as a signal peptide we thank ernst bause for constructive discussion; gunnar von heijne and michael baron for critical reading of the manuscript; johanna wahlberg for help with the figures; margareta berg, tuula marminen, and elisabeth servin for technical assistance; and ingrid sigurdson for typing. this work was supported by grants from the swedish medical research council (b88-12x-08272-01a), swedish natural science research council (b-bu 9353-300), and swedish national board for technical development (87-02750p).received for publication 7 march 1990 and in revised form 2 may 1990. key: cord-263987-ff6kor0c authors: holmes, ian h. title: solving the master equation for indels date: 2017-05-12 journal: bmc bioinformatics doi: 10.1186/s12859-017-1665-1 sha: doc_id: 263987 cord_uid: ff6kor0c background: despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. main text: this paper discusses progress in the area of insertion-deletion models, in view of recent work by ezawa (bmc bioinformatics 17:304, 2016); (bmc bioinformatics 17:397, 2016); (bmc bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. conclusions: while approximations that use finite-state machines (pair hmms and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time markov chain also show promise, especially in view of recent advances. models of sequence evolution, formulated as continuoustime discrete-state markov chains, are central to statistical phylogenetics and bioinformatics. as descriptions of the process of nucleotide or amino acid substitution, their earliest uses were to estimate evolutionary distances [1] , parameterize sequence alignment algorithms [2] , and construct phylogenetic trees [3] . variations on these models, including extra latent variables, have been used to estimate spatial variation in evolutionary rates [4, 5] ; these patterns of spatial variation have been used to predict exon structures of protein-coding genes [6, 7] , foldback structure of non-coding rna genes [8, 9] , regulatory elements [10] , ultra-conserved elements [11] , protein secondary structures [12] , and transmembrane structures [13] . they are widely used to reconstruct ancestral sequences [14] [15] [16] [17] [18] [19] [20] [21] [22] , a method that is finding increasing application in synthetic biology [16, [20] [21] [22] . trees built using substitution models used to classify species [23] , predict protein function [24] , inform conservation efforts [25] , or identify novel pathogens [26] . in the analysis of rapidly evolving pathogens, these methods are used to uncover population histories [27] , analyze correspondence: ihh@berkeley.edu dept of bioengineering, university of california, 94720 berkeley, usa transmission dynamics [28] , reconstruct key transmission events [29] , and predict future evolutionary trends [30] . there are many other applications; the ones listed above were selected to give some idea of how influential these models have been. continuous-time markov chains describe evolution in a state space , for example the set of nucleotides = {a,c,g,t}. the stochastic process φ(t) at any given instant of time, t, takes one of the values in . let p(t) be a vector describing the marginal probability distribution of the process at a single point in time: p i (t) = p(φ(t) = i). the time-evolution of this vector is governed by a master equation where, for i, j ∈ and i = j, r i,j is the instantaneous rate of mutation from state i to state j. for probabilistic normalization of eq. 1, it is then required that the general solution to eq. 1 can be written p(t) = p(0)m(t) where m(t) is the matrix exponential m(t) = exp(rt) (2) entry m i,j (t) of this matrix is the probability p(φ(t) = j|φ(0) = i) that, conditional on starting in state i, the system will after time t be in state j. it follows, by definition, that this matrix satisfies the chapman-kolmogorov forward equation: that is, if m i,j (t) is the probability that state i will, after a finite time interval t, have evolved into state j, and m j,k (u) is the analogous probability that state j will after time u evolve into state k, then summing out j has the expected result: this is one way of stating the defining property of a markov chain: its lack of historical memory previous to its current state. equation 1 is just an instantaneous version of this equation, and eq. 3 is the same equation in matrix form. the conditional likelihood for an ancestor-descendant pair can be converted into a phylogenetic likelihood for a set of extant taxon states s related by a tree t, as follows. (i assume for convenience that t is a binary tree, though relaxing this constraint is straightforward.) to compute the likelihood requires that one first computes, for every node n in the tree, the probability f i (n) of all observed states at leaf nodes descended from node n, conditioned on node n being in state i. this is given by felsenstein's pruning recursion: if n is a leaf node in state s n (4) where t mn denotes the length of the branch from tree node m to tree node n. i have used the notation (j) for the unit vector in dimension j, and the symbol • to denote the hadamard product (also known as the pointwise product), defined such that for any two vectors u, v of the same size: supposing that node 1 is the root node of the tree, and that the distribution of states at this root node is given by ρ, the likelihood can be written as where u · v denotes the scalar product of u and v. it is common to assume that the root node is at equilibrium, so that ρ = π . as mentioned above, this mathematical approach is fundamental to statistical phylogenetics and many applications in bioinformatics. for small state spaces , such as (for example) the 20 amino acids or 61 sense codons, the matrix exponential m(t) in eq. 2 can be solved exactly and practically by the technique of spectral decomposition (i.e. finding eigenvalues and eigenvectors). such an approach informs the dayhoff pam matrix. it was also solved for certain specific parametric forms of the rate matrix r by jukes and cantor [1] , kimura [31] , felsenstein [3] , and hasegawa et al. [32] , among others. this approach is used by all likelihood-based phylogenetics tools, such as revbayes [33] , beast [34] , raxml [35] , hyphy [36] , paml [37] , phylip [38] , tree-puzzle [39] , and xrate [40] . many more bioinformatics tools use the dayhoff pam matrix or other substitution matrix based on an underlying master equation of the form eq. 1. there exists a deep literature on markov chains, to which this brief survey cannot remotely do justice, but several concepts must be mentioned in order to survey progress in this area. a markov chain is time-homogeneous if the elements of the rate matrix r in eq. 1 are themselves independent of time. if a markov chain is time-homogeneous and is known to be in equilibrium at a given time, for example p(0) = π, then (absent any other constraints) it will be in equilibrium at all times; such a chain is referred to as being stationary. the time-scaling of these models is somewhat arbitrary: if the time parameter t is replaced by a scaled version t/κ, while the rate matrix r is replaced by rκ, then the likelihood in eq. 2 is unchanged. for some models, the rate is allowed to vary between sites [4, 5] . a markov chain is reversible if it satisfies the instantaneous detailed balance condition π i r i,j = π j r j,i , or its finite-time equivalent π i m i,j = π j m j,i . this amounts to a symmetry constraint on the parameter space of the chain (specifically, the matrix s with elements s i,j = π i / π j r i,j is symmetric) which has several convenient advantages: it effectively halves the number of parameters that must be determined, it eases some of the matrix manipulations (symmetric matrices have real eigenvalues and the algorithms to find them are generally more stable), and it allows for some convenient manipulations, such as the socalled pulley principle allowing for arbitrary re-rooting of the tree [3] . from another angle, however, these supposed advantages may be viewed as drawbacks: reversibility is a simplification which ignores some unreversible aspects of real data, limits the expressiveness of the model, and makes the root node placement statistically unidentifiable. stationarity has similar advantages and drawbacks. if one assumes the process was started at equilibrium, that is one less set of parameters to worry about (since the equilibrium distribution is implied by the process itself ), but it also renders the model less expressive and makes some kinds of inference impossible. the early literature on substitution models involved generalizing from rate matrices r characterized only by a single rate parameter [1] , to symmetry-breaking versions that allowed for different transition and transversion rates [31] , non-uniform equilibrium distributions over nucleotides [3] , and combinations of the above [32] . these models are all, however, reversible. a good deal of subsequent research has gone into the problem, in various guises, of generalizing results obtained for reversible, homogeneous and/or stationary models to the analogous irreversible, nonhomogeneous and nonstationary models. for examples, see [30, [41] [42] [43] . the question naturally arises: how to extend the model to describe the evolution of an entire sequence, not just individual sites? in such cases, when one talks about m(t) i,j "the likelihood of an ancestor-descendant pair (i, j)" (or, more precisely, the probability that-given the system starts in ancestral state i-it will after time t have evolved into descendant state j) one must bear in mind that the states i and j now represent not just individual residues, but entire sequences. as long as the allowed mutations are restricted to point substitutions and their mutation rates are independent of flanking context, then the extension to whole sequences is trivially easy: one can simply multiply together probabilities of independent sites. however, many kinds of mutation violate this assumption of site independence; most notably contextdependent substitutions and indels, where the rates depend on neighboring sites. for these mutations the natural approach is to extend the state space to be the set of all possible sequences over a given alphabet (for example, the set of all dna or protein sequences). this state space is (countably) infinite; eqs. 1-4 can still be used on an infinite state space, but solution by brute-force enumeration of eigenvalues and eigenvectors is no longer feasible, except in special cases where there is explicit structure to the rate matrix that allows identification of the eigensystem by algebraic approaches [44, 45] . it has turned out that whole-sequence evolutionary models have proved quite challenging for theorists. there is extensive evidence suggesting that indels, in particular, can be profoundly informative to phylogenetic studies, and to applications of phylogenetics in sequence analysis [46] [47] [48] [49] [50] [51] . the field of efforts to unify alignment and phylogeny, and to build a theoretical framework for the evolutionary analysis of indels, has been dubbed statistical alignment by hein, one of its pioneers [52] . recent publications by ezawa [53] [54] [55] and rivas and eddy [56] have highlighted this problem once again, directly leading to the present review. in this paper i focus only on "local" mutations: mostly indel events (which may include local duplications), but also context-dependent substitutions. this is not because "nonlocal" events (such as rearrangements) are unimportant, but rather that they tend to defy phylogenetic reconstruction due to the rapid proliferation of possible histories after even a few such events [57] . the discussion here is separated into two parts. in the first part, i discuss the master equation (eq. 1) and exact solutions thereof (eq. 2), along with various approximations and their departure from the chapman-kolmogorov ideal (eq. 3). this is an area in which recent progress has been reported in this journal. in the second part, i review the extension from pairwise probability distributions to phylogenetic likelihoods of multiple sequences, using analogs of felsenstein's pruning recursion (eq. 4). this section begins with various approaches to finding the time-dependent probability distribution of gap lengths in a pairwise alignment, under several evolutionary models. as an approach to models on strings of unbounded length, one can consider short motifs of k residues. these can still be considered as finite state-space models; for example, a k-nucleotide model has 4 k possible states. several such models have been analyzed, including models on codons where k = 3 [47, 58] , dinucleotides involved in rna basepairs where k = 2 [59] [60] [61] , and models over sequences of arbitrary length k [44, 62] . mostly, these models handle short sequences (motifs) and do not allow the sequence length to change over time (so they model only substitutions and not indels). some of the later models do allow the sequence length to change via insertions or deletions [62] though these models have not yet been analyzed in a way that would allow the computation of alignment likelihoods for sequences of realistic lengths. it is a remarkable reflection on the extremely challenging nature of this problem that, to date, the only exactly solved indel model on strings is the tkf91 model, named after the authors' initials and date of publication of this seminal paper [63] . while there has been progress in developing approximate models in the 25 years since the publication of this paper, and in extending it from pairwise to multiple sequence alignment, it remains the only model for which 1. the state space is the set of all sequences (strings) over a finite alphabet, 2. the state space is ergodically explored by substitutions and indels (so there is a valid alignment and evolutionary trajectory between any two sequences φ(0) and φ(t)), 3. equation 2 can be calculated exactly (specifically, as a sum over alignments, where the individual alignment likelihoods can be written in closed form). the tkf91 model allows single-residue contextindependent events only. these include (i) single-residue substitutions, (ii) single-residue insertions (with the inserted residue drawn from the equilibrium distribution of the substitution process), and (iii) single-residue deletions (whose rates are independent of the residue being deleted). the rates of all these mutation events are independent of the flanking sequence. this process is equivalent to a linear birth-death model with constant immigration [64] . thorne et al. showed that an ancestral sequence can be split into independently evolving zones, one for each ancestral residue (or "links", as they call them). this leads to the very appealing result that the length distribution for observed gaps is geometric, which conveniently allows the joint probability p(φ(0), φ(t)) to be expressed as a paired-sequence hidden markov model or "pair hmm" [65] . the conditional probability p(φ(t)|φ(0)) can similarly be expressed as a weighted finite-state transducer [66] [67] [68] . some interesting discussion of why the tkf91 model should be solvable at all can be found in [69] and in [42] . there are several variations on the tkf91 model. the case where there are no indels at all, only substitutions, can be viewed as a special case of tkf91, and can of course be solved exactly, as is well known. another variation on the tkf91 model constrains the total indel rate to be independent of sequence length [70] . in the following section i cover some variants that use different state spaces. it is difficult to extend tkf91 to more realistic models wherein indels (or substitutions) can affect multiple residues at once. in such models, the fate of adjacent residues is no longer independent, since a single event can span multiple sites. as a way around this difficulty, several researchers have developed evolutionary models where the state is not a pure dna or protein sequence, but includes some extra "hidden" information, such as boundaries, markers or other latent structure. in some of these models the sequence of residues is replaced by a sequence of indivisible fragments, each of which can contain more than one residue [56, 69, 71] . these includes the tkf92 model [71] which is, essentially, tkf91 with residues replaced by fragments (so the alphabet itself is the countably infinite set of all sequences over some other, finite alphabet). other models approximate indels as a kind of substitution that temporarily hides a residue, by augmenting the dna or protein alphabet with an additional gap character [72] [73] [74] . these models can be used to calculate some form of likelihood for a pairwise alignment of two sequences, but since this likelihood is not derived from an underlying instantaneous model of indels, the equations do not, in general, satisfy the chapman-kolmogorov forward eq. (3). that is, the probability of evolving from i to k comes out differently depending on whether or not one conditions on an intermediate sequence j. clearly, something about this "seems wrong": the failure to obey eq. 3 illustrates the ad hoc nature of these approaches. ezawa [53] describes the chapman-kolmogorov property (eq. 3) as evolutionary consistency; it can also be regarded as being the defining property of any correct solution to a continuous-time markov chain. the abovementioned approaches may be evolutionarily consistent if the state space is allowed to include the extra information that is introduced to make the model tractable, such as fragment boundaries. lèbre and michel have criticized other aspects of the rivas-eddy 2005 and 2008 models [73, 74] ; in particular, incomplete separation of the indel and substitution processes [42] . models which allow for heterogeneity of indel and substitution rates along the sequence also fall into this category of latent variable models. the usual way of allowing for such spatial variation in substitution models is to assume a latent rate-scaling parameter associated with each site [4, 5] . for indel models, this latent information must be extended to include hidden site boundaries [56] . another variation on tkf91 is the tkf structure tree, which describes the evolutionary behavior of rna structures with stem and loop regions which are subject to insertion and deletion [75] . rather than describing the evolution of a sequence, this model essentially captures the time-evolution grammar of a tree-like graph whose individual edges are evolving according to the tkf91 model. other evolutionary models have made use of graph grammars, for example to model pseudoknots [76] or context-dependent indels [77] . in tackling indel models where the indel events can insert or delete multiple residues at once, several authors have used the approximation that indels never overlap, so that any observed gap corresponds to a single indel event. this approximation, which is justified if one is considering evolutionary timespans t 1/(δ ) where δ is the indel rate per site and is the gap length, considerably simplifies the task of calculating gap probabilities [67, [78] [79] [80] [81] [82] [83] . at longer timescales, it is necessary to consider multiple-event trajectories, but (as a simplifying approximation) one can still truncate the trajectory at a finite number of events. a problem with this approach is that many different trajectories will generally be consistent with an observed mutation. summing over all such trajectories, to compute the probability of observing a particular configuration after finite time (e.g. the observed gap length distribution), is a nontrivial problem. in analyzing the long indel model, a generalization of tkf91 with arbitrary length distributions for instantaneous indel events, miklós et al. [84] make the claim that the existence of a conserved residue implies the alignment probability is factorable at that point (since no indel has ever crossed the boundary). they use a numerical sum over indel trajectories to approximate the probability distribution of observed gap lengths. although they used a reversible model, their approach generalizes readily to irreversible models. this work builds on an earlier model which allows long insertions, but only single-residue deletions [85] . recent work by ezawa has put this finite-event approximation on a more solid footing by developing a rigorous algebraic definition of equivalence classes for event trajectories [53] [54] [55] . solutions obtained using finite-event approximations will not exactly satisfy eq. 3. there will be some error in the probability, and in general the error will be greater on longer branches, as the main assumption behind the approximation (that there are no overlapping indels in the time interval, or that there is a finite limit to the number of overlapping indels) starts to break down. however, since these are principled approximations, it should be possible to form some conclusions as to the severity of the error, and its dependence on model parameters. simulation studies have also been of some help in assessing the error of these approximations. for context-dependent substitution processes, such as models that include methylation-induced cpgdeamination, a clever approach was developed in [44] . rather than considering a finite-event trajectory, they develop an explicit taylor series for the matrix exponential (eq. 2) and then truncate this taylor series. specifically, the rate matrix for a finite-length sequence is constructed as a sum of rate matrices operating locally on the sequence, using the kronecker sum ⊕ and kronecker product ⊗ to concatenate rate matrices. these operators may be understood as follows, for an alphabet of n symbols: suppose that m m is the set of all matrices indexed by m-mers, so that if a ∈ m m , then a is an n m × n m matrix. let i, j be m-mers, k, l be n-mers, and ik, jl the concatenated m + n-mers. if a ∈ m m and b ∈ m n then a ⊕ b and a ⊗ b are both in m m+n and are specified by where commuting terms in the taylor series for exp(rt) can then be systematically rearranged into a quicklyconverging dynamic programming recursion. this approach was first used by [44] and further developed including model-fitting algorithms [86] application to phylogenetic trees [87] and discussion of the associated eigensystem [45, 62] . it remains to be seen to what extent such an approach offers a practical solution for general indel models, where the instantaneous transitions are between sequences of differing lengths. such is the difficulty of solving long indel models that several authors have performed simulations to investigate the empirical gap length distributions that are observed after finite time intervals for various given instantaneous indelrate models. these observed gaps can arise from multiple overlapping indel events, in ways that have so far defied straightforward algebraic characterization. in recent work, rivas and eddy [56] have shown that if an underlying model has a simple geometric length distribution over instantaneous indel events, the observed gap length distribution at finite times (accounting for the possibility of multiple, overlapping indels) cannot be geometric. rivas and eddy report simulation studies supporting this result, and go on to propose several models incorporating hidden information (such as fragment boundaries, a la tkf92) which have the advantage of being good fits to hmms for their finite-time distributions. it has long been known that the lengths of empiricallyobserved indels are more accurately described by a power-law distribution than a geometric distribution [46, 47, [88] [89] [90] [91] and that alignment algorithms may benefit from using such length distributions, which lead to generalized convex gap penalties, rather than the computationally more convenient geometric distribution, which leads to an affine or linear gap penalty [92, 93] . for molecular evolution purposes in particular, it is known that overreliance on affine gap penalties leads to serious underestimates of the true lengths of natural indels [94] . for almost as long, it has been known that using a mixture of geometric distributions, or (considered in score space rather than probability space) a piecewise linear gap penalty, mitigates some of these problems in sequence alignment [94] [95] [96] . taken together, these results suggest that simple hmm-like models, which are most efficient at modeling geometric length distributions, may be fundamentally limited in their ability to fully describe indels; that adding more states (yielding length distributions that arise from sums of geometric random variates, such as negative binomial distributions, or mixtures of geometric distributions) can lead to an improvement; and that generalized hmms, which can model arbitrary length distributions at the cost of some computational efficiency [97] , may be most appropriate. for example, the abovementioned "long indel" model of miklós et al. uses a generalized pair hmm [84] , as does the hmm of [98] . it is even conceivable that some molecular evolution studies in the future will abandon hmms altogether, although they remain very convenient for many applications. the recent work of ezawa has some parallels, but also differences, to the work of rivas and eddy [53] [54] [55] . ezawa criticizes over-reliance on hmm-like models, and insists on a systematic derivation from simple instantaneous models. he puts the intuition of miklós et al [84] on a more formal footing by introducing an explicit notation for indel trajectories and the concept of "local history set equivalence classes" for equivalent trajectories. ezawa uses this concept to prove that alignment likelihoods for long-indel and related models are indeed factorable, and investigates, by numerical computation and analysis (with confirmation by simulation), the relative contribution of multiple-event trajectories to gap length distributions. ezawa's results also show that the effects on the observed indel lengths due to overlapping indels become more significant as the indels get larger, making the problem particularly acute for genomic alignments where indels can be much larger than in proteins. a number of excellent, realistic sequence simulators are available including dawg [99] , indelible [100] , and indel-seq-gen [101] . consider now the extension of these results from pairwise alignments, such as tkf91 and the "long indel" model, to multiple alignments (with associated phylogenies). some of the approaches to this problem use markov chain monte carlo (mcmc); some of the approaches use finite-state automata; and there is also some overlap between these categories (i.e. mcmc approaches that use automata). mcmc is the most principled approach to integrating phylogeny with multiple alignment. in principle an mcmc algorithm for phylogenetic alignment can yield the posterior distribution of alignments, trees, and parameters for any model whose pairwise distribution can be computed. this includes long indel models and also, in principle, other effects such as context-dependent substitutions. of the mcmc methods reported in the literature, some just focus on alignment and ancestral sequence reconstruction [65] ; others on simultaneous alignment and phylogenetic reconstruction [79-81, 83, 102, 103] ; some also include estimation of evolutionary parameters such as dn/ds [104] ; and some (focused on rna sequences) attempt prediction of secondary structure [105, 106] . in practise these mostly use hmms, or dynamic programming of some form, in common with the methods of the following section. it is of course possible to use hmm-based or other mcmc approaches to propose candidate reconstructions, and then to accept or reject those proposals (in the manner of metropolis-hastings or importance sampling) using a more realistic formulation of the indel likelihood. ezawa's methods, and others that build on them or are related to them, may be useful in this context. for example, ezawa's formulation was used to calculate the indel component of the probability of a fixed multiple sequence alignment (msa) resulting from sequence evolution along a fixed tree [53] . he also developed an algorithm to approximately calculate the indel component of the msa probability using all msacompatible parsimonious indel histories [54] , and applied it to some analyses of simulated msas [107] . using such realistic likelihood calculations as a post-processing "filter" for coarser, more rapid mcmc approaches that sample the space of possible reconstructions could be a promising approach. the dynamic programming recursion for pairwise alignment reported for the tkf91 model [63] can be exactly extended to alignment of multiple sequences given a tree [108, 109] . this works essentially because the tkf91 joint distribution over ancestor and descendant sequences can be represented as a pair hmm; the multiple-sequence version is a multi-sequence hmm [65] . this approach can be generalized, using finite-state transducer theory. transducers were originally developed as modular representations of sequence-transforming operations for use in speech recognition [110] . in bioinformatics, they offer (among other things) a systematic way of extending hmm-like pairwise alignment likelihoods to trees [67, 68, 111, 112] . other applications of transducer models in bioinformatics have included copy number variation in tumors [113] , protein family classification [114] , dna-protein alignment [115] and error-correcting codes for dna storage [116] . a finite-state transducer is a state machine that reads an input tape and writes to an output tape [117] . a probabilistically weighted finite-state transducer is the same, but its behavior is stochastic [110] . for the purposes of bioinformatics sequence analysis, a transducer can be thought of as being just like a pair hmm; except where a pair hmm's transition and emission probabilities have been normalized so as to describe joint probabilities, a transducer's probabilities are normalized so as to describe conditional probabilities like the entries of matrix m(t) (eq. 2). more specifically, if i and j are sequences, then one can define the matrix entry a i,j to be the forward score for those two sequences using transducer a. thus, the transducer is a compact encoding for a square matrix of countably infinite rank, indexed by sequence states (rather than nucleotide or amino acid states). the utility of transducers arises since for many purposes they can be manipulated analogously to matrices, while being more compact than the corresponding matrix (as noted above, matrices describing evolution of arbitrary-length sequences are impractically-or even infinitely-large). if a and b are finite transducers encoding (potentially infinite) matrices a and b, then there is a well-defined operation called transducer composition yielding a finite transducer ab that represents the matrix product ab. there are other well-defined transducer operations corresponding to the various other linear algebra operations used in this paper: the hadamard product (•) corresponds to transducer intersection, the kronecker product (⊗) corresponds to transducer concatenation, and the scalar product (·) and the unit vector ( ) can also readily be constructed using transducers. consequently, eq. 4 can be interpreted directly in terms of transducers [67, 68, 82] . this has several benefits. one is theoretical unification: eq. 4, using the above linear algebra interpretation of transducer manipulations, turns out to be very similar to sankoff 's algorithm for phylogenetic multiple alignment [118] . thus is a famous algorithm in bioinformatics unified with a famous algorithm in likelihood phylogenetics by using a tool from computational linguistics. (this excludes the rna structure-prediction component of sankoff 's algorithm; that can, however, be included by extending the transducer framework to pushdown automata [119] .) practically, the phylogenetic transducer can be used for alignment [79, 81] , parameter estimation [104] , and ancestral reconstruction [67] , with promising results for improved accuracy in multiple sequence alignment [112] . more broadly, one can think of the transducer as being in a family of methods that combine phylogenetic trees (modeling the temporal structure of evolution) with automata theory, grammars, and dynamic programming on sequences (modeling the spatial structure of evolution). the tkf structure tree, mentioned above, is in this family too: it can be viewed as a context-free grammar, or as a transducer with a pushdown stack [75] . the hmm-like nature of tkf91, and the ubiquity of hmms and dynamic programming in sequence analysis, has motivated numerous approaches to indel analysis based on pair hmms [56, 71, 74, 78, 120] , as well as many other applications of phylogenetic hmms [6, 7, 12, 121, 122] and phylogenetic grammars [8, 10, 40, 60, 123, 124] . in most of these models, an alignment is assumed fixed and the hmm or grammar used to partition it; however, in principle, one can combine the ability of hmms/grammars to model indels (and thus impute alignments) with the ability to partition sequences into differently evolving regions. the promise of using continuous-time markov chains to model indels has been partially realized by automatatheoretic approaches based on transducers and hmms. recent work by rivas and eddy [56] and by ezawa [53] [54] [55] may be interpreted as both good and bad news for automata-theoretic approaches. it appears that closed-form solutions for observed gap length distributions at finite times, and in particular the geometric distributions that simple automata are good at modeling, are still out of reach for realistic indel models, and indeed (for simple models) have been proven impossible [56] . further, simulation results have demonstrated that geometric distributions are not a good fit to the observed gap length distributions when the underlying indel model has geometrically-distributed lengths for its instantaneous indel events [56] . if the lengths of the instantaneous indels follow biologically plausible power-law distributions, the evolutionary effects due to overlapping indels become larger as the gaps grow longer [54] . that is the bad news (at least for automata). the good news is that the simulation results also suggest that, for short branches and/or gaps (such that indels rarely overlap), the error may not be too bad to live with. approximate-fit approaches that are common in pair hmm modeling and pairwise sequence alignment-such as using a mixture of geometric distributions to approximate a gap length distribution (yielding a longer tail than can be modeled using a pure geometric distribution)may help bridge the accuracy gap [96] . given the power of automata-theoretic approaches, the best way forward (in the absence of a closed-form solution) may be to embrace such approximations and live with the ensuing error. interestingly, the authors of the two recent simulation studies that prompted this commentary come to different conclusions about the viability of automatabased dynamic programming approaches. ezawa [53, 54] , arguing that realism is paramount, advocates deeper study of the gap length distributions obtained from simple instantaneous models-while acknowledging that such gap length distributions may be more difficult to use in practice than the simple geometric distributions offered by hmm-like models. rivas and eddy [56] , clearly targeting applications (particularly those such as profile hmms), work backward from hmmlike models toward evolutionary models with embedded hidden information. these models may be somewhat mathematically contrived, but are easier to tailor so as to model effects such as position-specific conservation, thus trading (in a certain sense) purism for expressiveness. whichever approach is used, these results are unambiguously good news for the theoretical study of indel processes. the potential benefits of modeling alignment as an aspect of statistical phylogenetics are significant. one can reasonably hope that the advance of theoretical work in this area will continue to inform advances in both bioinformatics and statistical phylogenetics. after all, and in spite of the cambrian explosion in bioinformatics subdisciplines, sequence alignment and phylogeny truly are closely related aspects of mathematical biology. evolution of protein molecules a model of evolutionary change in proteins atlas of protein sequence and structure in evolutionary trees from dna sequences: a maximum likelihood approach maximum-likelihood estimation of phylogeny from dna sequences when substitution rates differ over sites maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: approximate methods gene finding with a hidden markov model of genome structure and evolution combining phylogenetic and hidden markov models in biosequence analysis identification and classification of conserved rna secondary structures in the human genome an rna gene expressed during cortical development evolved rapidly in humans a comparative method for finding and folding rna secondary structures within protein-coding regions evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses using protein structural information in evolutionary inference: transmembrane proteins reconstructing large regions of an ancestral mammalian genome in silico evolution of coral pigments recreated ancestral sequence reconstruction crystal structure of an ancient protein: evolution by conformational epistasis palaeotemperature trend for precambrian life inferred from resurrected proteins fast m l: a web server for probabilistic reconstruction of ancestral sequences directed evolution of sulfotransferases and paraoxonases by ancestral libraries aav ancestral reconstruction library enables selection of broadly infectious viral variants enhancing the pharmaceutical properties of protein drugs by ancestral sequence reconstruction synthesis of phylogeny and taxonomy into a comprehensive tree of life protein molecular function prediction by bayesian phylogenomics phylogenetic diversity meets conservation policy: small areas are key to preserving eucalypt lineages identification of a novel coronavirus in patients with severe acute respiratory syndrome bayesian coalescent inference of past population dynamics from molecular sequences unifying the spatial epidemiology and molecular evolution of emerging epidemics patient 0' hiv-1 genomes illuminate early hiv/aids history in north america identifying predictors of time-inhomogeneous viral evolutionary processes a simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences dating the human-ape splitting by a molecular clock of mitochondrial dna revbayes: bayesian phylogenetic inference using graphical models and an interactive model-specification language beast: bayesian evolutionary analysis by sampling trees raxml-vi-hpc: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models hyphy: hypothesis testing using phylogenies paml 4: phylogenetic analysis by maximum likelihood phylip -phylogeny inference package (version 3.2) tree-puzzle: maximum likelihood phylogenetic analysis using quartets and parallel computing developing and applying heterogeneous phylogenetic models with xrate estimation of evolutionary distances under stationary and nonstationary models of nucleotide substitution an evolution model for sequence length based on residue insertion-deletion independent of substitution: an application to the gc content in bacterial genomes a stochastic gene evolution model with time dependent mutations a nucleotide substitution model with nearest-neighbour interactions a generalization of substitution evolution models of nucleotides to genetic motifs empirical and structural models for insertions and deletions in the divergent evolution of proteins empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments indel pdb: a database of structural insertions and deletions derived from sequence alignments of closely related proteins sequence context of indel mutations and their effect on protein evolution in a bacterial endosymbiont alignment of phylogenetically unambiguous indels in shewanella identification of transposable elements using multiple alignments of related genomes statistical alignment: computational properties, homology testing and goodness-of-fit general continuous-time markov model of sequence evolution via insertions/deletions: are alignment probabilities factorable? general continuous-time markov model of sequence evolution via insertions/deletions: local alignment probability computation erratum to: general continuous-time markov model of sequence evolution via insertions/deletions: are alignment probabilities factorable? parameterizing sequence alignment with an explicit evolutionary model multiple genome rearrangement and breakpoint phylogeny analytical expression of the purine/pyrimidine codon probability after and before random mutations analytical solutions of the dinucleotide probability after and before random mutations rna secondary structure prediction using stochastic context-free grammars and evolutionary history evolution probabilities and phylogenetic distance of dinucleotides genome evolution by transformation, expansion and contraction (getec) an evolutionary model for maximum likelihood alignment of dna sequences an introduction to probability theory and its applications evolutionary hmms: a bayesian approach to multiple alignment using guide trees to construct multiple-sequence evolutionary hmms accurate reconstruction of insertion-deletion histories by statistical phylogenetics a note on probabilistic models over strings: the linear algebra approach statistical alignment based on fragment insertion and deletion models evolutionary inference via the poisson indel process inching toward reality: an improved likelihood model of sequence evolution models of sequence evolution for dna sequences containing gaps evolutionary models for insertions and deletions in a probabilistic modeling framework probabilistic phylogenetic inference with insertions and deletions a probabilistic model for the evolution of rna structure pair stochastic tree adjoining grammars for aligning and predicting pseudoknot rna structures a probabilistic model for sequence alignment with context-sensitive indels sequence alignments and pair hidden markov models using evolutionary history joint bayesian estimation of alignment and phylogeny bali-phy: simultaneous bayesian inference of alignment and phylogeny incorporating indel information into phylogeny estimation for rapidly emerging pathogens phylogenetic automata, pruning, and multiple alignment hand align: bayesian multiple sequence alignment, phylogeny, and ancestral reconstruction a long indel model for evolutionary sequence alignment an improved model for statistical alignment chain monte carlo expectation maximization algorithm for statistical analysis of dna sequence evolution with neighbor-dependent substitution rates accurate estimation of substitution rates with neighbor-dependent models in a phylogenetic context patterns of insertion and deletion in mammalian genomes exhaustive matching of the entire protein sequence database pattern and rate of indel evolution inferred from whole chloroplast intergenic regions in sugarcane, maize and rice patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes the size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment problems and solutions for estimating indel rates and length distributions uncertainty in homology inferences: assessing and improving genomic sequence alignment sequence comparison with concave weighting functions probabilistic consistency-based multiple sequence alignment prediction of complete gene structures in human genomic dna indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment dna assembly with gaps (dawg): simulating sequence evolution indelible: a flexible simulator of biological sequence evolution biological sequence simulation for testing complex evolutionary hypotheses: indel-seq-gen version 2.0 statalign: an extendable software package for joint bayesian estimation of alignments and evolutionary trees advances in neural information processing systems erasing errors due to alignment ambiguity when estimating positive selection statalign 2.0: combining statistical alignment with rna secondary structure prediction simulfold: simultaneously inferring rna structures including pseudoknots, alignments, and trees using a bayesian mcmc framework characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map pacific symposium on biocomputing an efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees weighted finite-state transducers in speech recognition automata-theoretic models of mutation and alignment historian: accurate reconstruction of ancestral sequences and evolutionary rates phylogenetic quantification of intra-tumour heterogeneity protein family classification using sparse markov transducers modular non-repeating codes for dna storage a method for synthesizing sequential circuits simultaneous solution of the rna folding, alignment, and protosequence problems evolutionary triplet models of structured rna mcalign2: faster, accurate global pairwise alignment of non-coding dna sequences based on explicit models of indel evolution a hidden markov model approach to variation among sites in rate of evolution phylogenetic estimation of context-dependent substitution rates by maximum likelihood rna secondary structure prediction using stochastic context-free grammars xrate: a fast prototyping, training and annotation tool for phylo-grammars the author thanks kiyoshi ezawa, elena rivas, sean eddy, jeff thorne, benjamin redelings, marc suchard, and one anonymous referee for productive conversations that have informed this review. this work was supported by nih/nhgri grant hg004483. authors' contributions ih wrote the article. the author declares that they have no competing interests. not applicable. not applicable. springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.received: 9 january 2017 accepted: 30 april 2017 key: cord-017932-vmtjc8ct authors: georgiev, vassil st. title: genomic and postgenomic research date: 2009 journal: national institute of allergy and infectious diseases, nih doi: 10.1007/978-1-60327-297-1_25 sha: doc_id: 17932 cord_uid: vmtjc8ct the word genomics was first coined by t. roderick from the jackson laboratories in 1986 as the name for the new field of science focused on the analysis and comparison of complete genome sequences of organisms and related high-throughput technologies. two basic computational methods are used for genome analysis: gene finding and whole genome comparison (2) . gene finding. using a computational method that can scan the genome and analyze the statistical features of the sequence is a fast and remarkably accurate way to find the genes in the genome of prokaryotic organisms (bacteria, archaea, viruses) compared with the still difficult problem of finding genes in higher eukaryotes. by using modern bioinformatics software, finding the genes in a bacterial genome will result in a highly accurate, rich set of annotations that provide the basis for further research into the functions of those genes. the absence of introns-those portions of the dna that lie between two exons and are transcribed into a rna but will not appear in that rna after maturation and therefore are not expressed (as proteins) in the protein synthesis-will remove one of the major barriers to computational analysis of the genome sequence, allowing gene finding to identify more than 99% of the genes of most genomes without any human intervention. next, these gene predictions can be further refined by searching for nearby regulatory sites such as the ribosome-binding sites, as well as by aligning protein sequences to other species. these steps can be automated using freely available software and databases (2) . gene finding in single-cell eukaryotes is of intermediate difficulty, with some organisms, such as trypanosoma brucei, having so few introns that a bacterial gene finder is sufficient to find their genes. other eukaryote organisms (e.g., plasmodium falciparum) have numerous introns and would require the use of special-purpose gene finder, such as glimmerm (3, 4) . whole genome comparison. this computational method refers to the problem of aligning the entire deoxyribonucleic acid (dna) sequence of one organism to that of another, with the goal of detecting all similarities as well as rearrangements, insertions, deletions, and polymorphisms (2) . with the increasing availability of complete genome sequences from multiple, closely related species, such comparisons are providing a powerful tool for genomic analysis. using suffix trees-data structures that contains all of the subsequences from a particular sequence and can be built and searched in linear time-this computational task can be accomplished in minimal time and space. because the suffix tree algorithm is both time and space efficient, it is able to align large eukaryotic chromosomes with only slightly greater requirements than those for bacterial genomes (2) . bacterial genome annotation. the major goal of the bacterial genome annotation is to identify the functions of all genes in a genome as accurately and consistently as possible by using initially automated annotation methods for preliminary assignment of functions to genes, followed by a second stage of manual curation by teams of scientists. the family enterobacteriaceae encompasses a diverse group of bacteria including many of the most important human pathogens (salmonella, yersinia, klebsiella, shigella), as well as one of the most enduring laboratory research organisms, the nonpathogenic escherichia coli k12. many of these pathogens have been subject to genome sequencing or are under study. genome comparisons among these organisms have revealed the presence of a core set of genes and functions along a generally collinear genomic backbone. however, there are also many regions and points of difference, such as large insertions and deletions (including pathogenicity islands), integrated bacteriophages, small insertions and deletions, point mutations, and chromosomal rearrangements (5). the first genome sequence of escherichia coli k12 (reference strain mg1655) was completed and published in 1997 (6) . later, the genome sequence of two other genotypes of e. coli, the enterohemorrhagic e. coli o157:h7 (ehec; strains edl933 and rimd 0509952-sakai) (7, 8) and the uropathogenic e. coli (upec; strain cft073) (9) , were sequenced and the information published. currently, it is accepted that shigellae are part of the e. coli species complex, and information on the genome of shigella flexneri strain 2a has been published (10) . a comparison of all three pathogenic e. coli with the archetypal nonpathogenic e. coli k12 revealed that the genomes were essentially collinear, displaying both conservation in sequence and gene order (5) . the genes that were predicted to be encoded within the conserved sequence displayed more than 95% sequence identity and have been termed the core genes. similar observations were made for the shigella flexneri genome, which also shares 3.9 mb of common sequence with e. coli (10) . a comparison of the three e. coli genomes revealed that genes shared by all genomes amounted to 2,996 (9) from a total of 4,288, and about 5,400 and 5,500 predicted proteincoding sequences for e. coli k12, ehec, and upec, respectively (5) . the region encoding these core genes is known as the backbone sequence. it was also apparent from these comparisons that interdispersed throughout this backbone sequence were large regions unique to the different genotypes. moreover, several studies had shown that some of these unique loci were present in clinical disease-causing isolates but were apparently absent from their comparatively benign relatives (11) . one such well-characterized region is the locus of enterocyte effacement (lee) in the enteropathogenic e. coli (epec). thus, an epec infection results in effacement of the intestinal microvilli and the intimate adherence of bacterial cells to enterocytes. furthermore, epec also subverts the structural integrity of the cell and forces the polymerization of actin, which accumulates below the adhered epec cells, forming cup-like pedestals (12) . this is called an attachment and effacing (ae) lesion. subsequently, lee was found in all bacteria known to be able to elicit an ae lesion (5). the presence of many regions in the backbone sequence similar to lee have been characterized in both gram-negative and gram-positive bacteria (13) . this led to the concept of pathogenicity islands (pais) and the formulation of a definition to describe their features (5) . typically, pais are inserted adjacent to stable rna genes and have an atypical g+c content. in addition to virulence-related functions, the pathogenicity islands often carry genes encoding transposase or integrase-like proteins and are unstable and self-mobilizable (13, 14) . it was also noted that pais possess a high proportion of gene fragments or disrupted genes when compared with the backbone regions (15) . it is generally accepted that the pathogenic e. coli genotypes have evolved from a much smaller nonpathogenic relative by the acquisition of foreign dna. this laterally acquired dna has been attributed with conferring on the different genotypes the ability to colonize alternative niches in the host and the ability to cause a range of different disease outcomes (5) . although sharing some of the features of pais and considered to be parts of the pais, some genomic loci are unlikely to impinge on pathogenicity. to take account of this, the concept of pais has been extended to include islands or strainspecific loops, which represent discrete genetic loci that are lineage-specific but are as yet not known to be involved in virulence (7, 8) . currently, there are more than 2,300 salmonella serovars in two species, s. enterica and s. bongori. all salmonellae are closely related, sharing a median dna identity for the reciprocal best match of between 85% and 95% (16, 17) . despite their homogeneity, there are still significant differences in the pathogenesis and host range of the different salmonella serovars. thus, whereas s. enterica subspecies enterica serovar typhi (s. typhi) is only pathogenic to humans causing severe typhoid fever, s. typhimurium causes gastroenteritis in humans but also a systemic infection in mice and has a broad host range (16) . like e. coli, the salmonellae are also known to possess pais, known as salmonella pathogenicity islands (spis). it is thought that spis have been acquired laterally. for example, the gene products encoded by spi-1 (18, 19) and spi-2 (20, 21) have been shown to play important roles in the different stages of the infection process. both of these islands possess type iii secretion systems and their associated secreted protein effectors. spi-1 is known to confer on all salmonellae the ability to invade epithelial cells. spi-2 is important in various aspects of the systemic infection, allowing salmonella to spread from the intestinal tissue into the blood and eventually to infect, and survive within, the macrophages of the liver and spleen (22) . spi-3, like lee and pai-1 of upec, is inserted alongside the selc trna gene and carries the gene mgtc, which is required for the intramacrophage survival and growth in the low-magnesium environment thought to be encountered in the phagosome (23) . other salmonella spis encode type iii-secreted effector proteins, chaperone-usher fimbrial operons, vi antigen biosynthetic gene, a type ivb pilus operon, and many other determinants associated with the salmonellae enteropathogenicity (15). although the mobile nature of pais is frequently discussed in the literature, there is little direct experimental evidence to support these observations. one possible explanation for this may be that on integration, the mobility genes of the pais subsequently become degraded, thereby fixing their position (5) . certainly, there is evidence to support this hypothesis, as many proposed pais carry integrase or transposase pseudogenes or remnants. one excellent example of this is the high-pathogenicity island (hpi) first characterized in yersinia (24) . the yersinia hpis can be split into two lineages based on the integrity of the phage integrase gene (int) carried in the island: (i) y. enterocolitica biotype 1b and (ii) y. pestis and y. pseudotuberculosis. the y. enterocolitica hpi int gene carries a point mutation, whereas the analogous gene is intact in the y. pestis and y. pseudotuberculosis hpis. the yersinia hpi is a 35-to 43-kb island that possesses genes for the production and uptake of the siderophore yersiniabactin, as well as genes, such as int, thought to be involved in the mobility of the island. hpi-like elements are widely distributed in enterobacteria, including e. coli, klebsiella, enterobacter, and citrobacter spp., and like many prophages, these hpis are found adjacent to asn-trna genes (8) . trna genes are common sites for bacteriophage integration into the genome (25) . integration at these sites typically involves site-specific recombination between short stretches of identical dna located on the phage (attp) and at the integration site on the bacterial genomes (attb). the trna genes represent common sites for the integration of many other pais and bacteriophages, with the secc trna locus being the most heavily used integration site in the enterics (9). integrated bacteriophages, also known as prophages, are also commonly found in bacterial genomes (5) . for example, in the s loops of the e. coli o157:h7 strain edl933 (ehec) unique regions, nearly 50% were phage related. in addition to the 18 prophage sequences detected in the genome of ehec strain sakai (8) , the genomes of e. coli k12, upec, and s. flexneri have all been shown to carry multiple prophage or prophage-like elements (6, 7, 9, 10) . moreover, comparison of the genome sequences of ehec o157:h7 strain edl933 and strain sakai revealed marked variations in the complement and integration sites of the prophages, as did internal regions within highly related phages (8, 26) . in addition to genes essential for their own replication, phages often carry genes that, for example, prevent superinfection by other bacteriophages, such as old and tin (27, 28) . however, other genes carried in prophages appear to be of nonphage origin and can encode determinants that enhance the virulence of the bacterial host by a process known as lysogenic conversion (29) . in addition to the presence of the lee pai and the ability to elicit ae lesion, another defining characteristics of the enterohemorrhagic e. coli (ehec) is the production of shiga toxins (stx). the shiga toxins represent a family of potent cytotoxins that, on entry into the eukaryotic cell, will act as glycosylases by cleaving the 28s ribosomal rna (rrna) thereby inactivating the ribosome and consequently preventing the protein synthesis (30) . other enteric pathogens such as s. typhi, s. typhimurium, and y. pestis are also known to possess significant numbers of prophages (15, 16, 31) . thus, the principal virulence determinants of the salmonellae are the type iii secretion systems, carried by spi-1 and spi-2, and their associated protein effectors (32, 33) . a significant number of these type iii secreted effector proteins are present in the genomes of prophages and have a dramatic influence on the ability of their bacterial hosts to cause disease (5). small insertions and deletions. even though the large pais play a major role in defining the phenotypes of different strains of the enteric bacteria, there are many other differences resulting from small insertions and deletions, which must be taken into account when considering the overall genomic picture of enterobacteriaceae (5) . thus, the comparisons between e. coli k12 and e. coli o157:h7 and between s. typhi and s. typhimurium have indicated the existence of many small differences that exist aside from the large pathogenicity islands. for example, the number of separate insertion and deletion events has shown that there are 145 events of 10 genes or fewer compared with 12 events of 20 genes or more for the s. typhi and s. typhimurium comparison. furthermore, comparison between s. typhi and e. coli revealed 504 events of 10 genes or fewer compared with just 25 events of 20 genes or more. even taking into account that the larger islands contain many more genes per insertion or deletion event, it becomes clear that nearly equivalent numbers of speciesspecific genes are attributable to insertion or deletion events involving 10 genes or fewer as are due to events involving 20 genes or more. these data should lend credence to the assertion that the acquisition and exchange of small islands is important in defining the overall phenotype of the organism (5) . in the majority of cases studied to date, there is no evidence to suggest the presence of genes that may allow these small islands to be self-mobile. it is far more likely that small islands of this type are exchanged between members of a species and constitute part of the species gene pool. once acquired by one member of the species, they can be easily exchanged by generalized transduction mechanisms, followed by homologous recombination between the near identical flanking genes to allow integration into the chromosome (5) . this sort of mechanism of genetic exchange would also make possible nonorthologous gene replacement, involving the exchange of related genes at identical regions in the backbone. a specific example to illustrate such a possibility is the observed capsular switching of neisseria meningitides (34) and streptococcus pneumoniae (35, 36) for which different sets of genes responsible for the biosynthesis of different capsular polysaccharides are found at identical regions in the chromosome and flanked by conserved genes. the implied mechanism for capsular switching involves replacement of the polysaccharide-specific gene sites by homologous recombination between the chromosome and exogenous dna in the flanking genes (5) . point mutations and pseudogenes. one of the most surprising observations to come from enterobacterial genome research has been the discovery of a large number of pseudogenes. the pseudogenes appeared to be untranslatable due to the presence of stop codons, frameshifts, internal deletions, or insertion sequence (is) element insertions. the presence of pseudogenes seems to run contrary to the general assumption that the bacterial genome is a highly "streamlined" system that does not carry "junk dna" (5). for example, salmonella typhi, the etiologic agent of typhoid fever, is host restricted and appears only capable of infecting a human host, whereas s. typhimurium, which causes a milder disease in humans, has a much broader host range. upon analysis, the genome of s. typhi contained more than 200 pseudogenes (15) , whereas it was predicted that the number of pseudogenes in the genome of s. typhimurium would be around 39 (16) . from this observation, it becomes clear that the pseudogenes in s. typhi were not randomly spread throughout its genome-in fact, they were overrepresented in genes that were unique to s. typhi when compared with e. coli, and many of the pseudogenes in s. typhi have intact counterparts in s. typhimurium that have been shown to be involved in aspects of virulence and host interaction. given this distribution of pseudogenes, it has been suggested that the host specificity of s. typhi may be the result of the loss of its ability to interact with a broader range of hosts caused by functional inactivation of the necessary genes (15) . in contrast with other microorganisms containing multiple pseudogenes, such as mycobacterium leprae (37) , most of the pseudogenes in s. typhi were caused by a single mutation, suggesting that they have been inactivated relatively recently. taken together, these observations suggest an evolutionary scenario in which the recent ancestor of s. typhi had changed its niche in a human host, evolving from an ancestor (similar to s. typhimurium) limited to localized infection and invasion around the gut epithelium into one capable of invading the deeper tissues of the human hosts (5) . a similar evolutionary scenario has been suggested for another recently evolved enteric pathogen, yersinia pestis. this bacterium has also recently changed from a gut bacterium (y. pseudotuberculosis), transmitted via the fecal-oral route, to an organism capable of using a flea vector for systemic infection (38, 39) . again, this change in niche was accompanied by pseudogene formation, and genes involved in virulence and host interaction are overrepresented in the set of genes inactivated (31) . yet another example of such an evolutional scenario is shigella flexneri 2a, a member of the species e. coli (which is predicted to have more than 250 pseudogenes), and is again restricted to the human body (10) . all of these organisms demonstrate that the enterobacterial evolution has been a process that has involved both gene loss and gene gain, and that the remnants of the genes lost in the evolutionary process can be readily detected (5). the focus in the postgenomic era is on functional genomics, in which proteomics plays an essential role. the living cell is a dynamic and complex system that cannot be predicted from the genome sequence. whereas genomes will disclose important information on the biological importance of the organism, it is still static and will not reveal information on the expression of a particular gene or of posttranslational modifications or on how a protein is regulated in a specific biological situation (40) . thus, whereas the complete genome sequence provides the basis for experimental identification of expressed proteins at the cellular level, very little has been accomplished to identify all expressed and potentially modified proteins. direct investigation of the total content of proteins in a cell is the task of proteomics. proteomics is defined as the complete set of posttranslationally modified and processed proteins in a well-defined biological environment under specific circumstances, such as growth conditions and time of investigation (40, 41) . proteomics can be studied by following two separate steps: separation of the proteins in a sample, followed by identification of the proteins. the common methodology used for separating proteins is two-dimensional polyacrylamide gel electrophoresis (2d page). the principal method for large-scale identification is mass spectroscopy (ms), but other identification methods, such as n-terminal sequencing, immunoblotting, overexpression, spot colocalization, and gene knockouts, can also be used. because of its high-resolution power, 2d page is currently the best methodology to achieve global visualization of the proteins of a microorganism. in the first dimension, isoelectric focusing is carried out to separate the proteins in a ph gradient according to their isoelectric point (pi). in the second dimension, the proteins are separated according to their molecular weight by sds-page (sodium dodecyl sulfate-page). the resulting gel image presents itself as a pattern of spots in which pi and the relative molecular weight (m r ) can be recognized as in a coordinate system (40) . a critical step during the 2d page procedure is the sample preparation, as there is no single method that can be universally applied because different reagents are superior with respect to different samples. to this end, chaotropes such as urea, which act by changing the parameters of the solvent, are used in most 2d page procedures. major problems to overcome in 2d page sample preparation arise because of limited entry into the gel of high-molecular-weight proteins and the presence of highly hydrophobic and/or basic proteins (42, 43) . for protein separation, the protein mixture is loaded onto an acrylamide gel strip in which a ph gradient is established. when a high voltage is applied over the strip, the proteins will focus at the ph at which they carry zero net charge. the ph gradient is established during the focusing using either carrier ampholytes in a slab gel (44) or a precast polyacrylamide gel with an immobilized ph gradient (ipg) (45) . the latter method is advantageous because of improved reproducibility. samples can be applied to ipg dry strips preferably by rehydration. rehydration of dried ipgs under application of a low voltage (10 to 50 v) has significantly improved the recovery especially of high-molecularweight proteins. mass spectrometry is the method of choice for identifying proteins in proteomics. the proteins are converted into gas phase ions that can be measured with an accuracy better than 50 ppm (40) . two widely used techniques for ionization are matrix-assisted laser desorption ionization (maldi) (46) and electrospray ionization (47) . maldi is usually coupled with a tof (time of flight) device for measuring the masses. the ionized peptides are then accelerated by the application of accelerated field and the tof until they reach a detector to calculate their mass/charge ratio (40) . in electrospray ionization, the peptides are sprayed into the spectrometer (47) . ionization is achieved when the charged droplets evaporate. an alternative procedure for measuring masses is the ion trap (48) , which selects ions with certain mass/charge ratios by keeping them in sinusoidal motion between two electrodes. in 1995, the first microbe sequencing project, haemophilus influenzae (a bacterium causing upper respiratory infection), was completed with a speed that stunned scientists (http:// www3.niaid.nih.gov/research/topics/pathogen/introduction. htm). encouraged by the success of that initial effort, researchers have continued to sequence an astonishing array of other medically important microorganisms. to this end, niaid has made significant investments in large-scale sequencing projects, including projects to sequence the complete genomes of many pathogens, such as the bacteria that cause tuberculosis, gonorrhea, chlamydia, and cholera, as well as organisms that are considered agents of bioterrorism. in addition, niaid is collaborating with other funding agencies to sequence larger genomes of protozoan pathogens such as the organism causing malaria. the availability of microbial and human dna sequences opens up new opportunities and allows scientists to perform functional analyses of genes and proteins in whole genomes and cells, as well as the host's immune response and an individual's genetic susceptibility to pathogens. when scientists identify microbial genes that play a role in disease, drugs can be designed to block the activities controlled by those genes. because most genes contain the instructions for making proteins, drugs can be designed to inhibit specific proteins or to use those proteins as candidates for vaccine testing. genetic variations can also be used to study the spread of a virulent or drug-resistant form of a pathogen. niaid has launched initiatives to provide comprehensive genomic, proteomic, and bioinformatic resources. these resources, listed below, are available to scientists conducting basic and applied research on a broad array of pathogenic microorganisms (http://www3.niaid.nih.gov/research/topics/ pathogen/initiatives.htm): r niaid's microbial sequencing centers (nscs). the niaid's microbial sequencing centers are state-of-theart high-throughput dna sequencing centers that can sequence genomes of microbes and invertebrate vectors of infectious diseases. genomes that can be sequenced include microorganisms considered agents of bioterrorism and those responsible for emerging and re-emerging infectious diseases. resource center is a centralized facility that provides scientists with the resources and reagents necessary to conduct functional genomics research on human pathogens and invertebrate vectors at no cost. the pfgrc provides scientists with genomic resources and reagents such as microarrays, protein expression clones, genotyping, and bioinformatics services. the pfgrc supports the training of scientists in the latest techniques in functional genomics and emerging genomic technologies. r niaid's proteomics centers. the primary goal of these centers is to characterize the pathogen and/or host cell proteome by identifying proteins associated with the biology of the microorganisms, mechanisms of microbial pathogenesis, innate and adaptive immune responses to infectious agents, and/or non-immune-mediated host responses that contribute to microbial pathogenesis. it is anticipated that the research programs will discover targets for potential candidates for the next generation of vaccines, therapeutics, and diagnostics. this will be accomplished by using existing proteomics technologies, augmenting existing technologies, and creating novel proteomics approaches as well as performing early-stage validation of these targets. r administrative resource for biodefense proteomic centers (arbpcs). the arbpcs consolidate data generated by each proteomics research center and make it available to the scientific community through a publicly accessible web site. this database (www.proteomicsresource.org) serves as a central information source for reagents and validated protein targets and has recently been populated with the first data released. r niaid's bioinformatics resource centers. the niaid's bioinformatics resource centers will design, develop, maintain, and continuously update multiorganism databases, especially those related to biodefense. organisms of particular interest are the niaid category a to c priority pathogens and those causing emerging and re-emerging diseases. the ultimate goal is to establish databases that will allow scientists to access a large amount of genomic and related data. this will facilitate the identification of potential targets for the development of vaccines, therapeutics, and diagnostics. each contract will include establishing and maintaining an analysis resource that will serve as a companion to the databases to provide, develop, and enhance standard and advanced analytical tools to help researchers access and analyze data. tb structural genomics consortium. a collaboration of scientists in six countries formed to determine and analyze the structures of about 400 proteins from mycobacterium tuberculosis. the group seeks to optimize the technical and management aspects of highthroughput structure determination and will develop a database of structures and functions. niaid, which is co-funding this project with nigms, anticipates that this information will also lead to the design of new and improved drugs and vaccines for tuberculosis. structural genomics of pathogenic protozoa consortium. this consortium is aiming to develop new ways to solve protein structures from organisms known as protozoans, many species of which cause deadly diseases such as sleeping sickness, malaria, and chagas' disease. the national institute of allergy and infectious diseases is providing support to the microbial genome sequencing centers (mscs) at the j. craig venter institute [formerly, the institute for genomic research (tigr)], the broad institute at the massachusetts institute of technology (mit), and harvard university for a rapid and cost-efficient production of high-quality, microbial genome sequences and primary annotations. niaid's mscs (http://www.niaid.nih.gov/dmid/genomes/mscs/) are responding to the scientific community and national and federal agencies' priorities for genome sequencing, filling in sequence gaps, and therefore providing genome sequencing data for multiple uses including understanding the biology of microorganisms, forensic strain identification, and identifying targets for drugs, vaccines, and diagnostics. in addition, the niaid's mscs have developed web sites that provide descriptive information about the sequencing projects and their progress (http://www.broad.mit.edu/seq/msc/and http://msc.tigr.org/status.shtml). genomes to be sequenced include microorganisms considered to be potential agents of bioterrorism (niaid category a, b, and c), related organisms, clinical isolates, closely related species, and invertebrate vectors of infectious diseases and microorganisms responsible for emerging and re-emerging infectious diseases. in addition, in response to a recommendation from a 2002 niaid-sponsored blue ribbon panel on bioterrorism and its implication for biomedical research to support genomic sequencing of microorganisms considered agents of bioterrorism and related organisms, the mscs will address the institute's need for additional sequencing of such microorganisms and invertebrate vectors of disease and/or those that are responsible for emerging and re-emerging diseases (http://www.niaid.nih.gov/dmid/ genomes/mscs/overview.htm). the panel's recommendation included careful selection of species, strains, and clinical isolates to generate genomic data for different uses such as identification of strains and targets for diagnostics, vaccines, antimicrobials, and other drug developments. the mscs have the capacity to rapidly and costeffectively sequence genomic dna and provide preliminary identification of open reading frames and annotation of gene function for a wide variety of microorganisms, including viruses, bacteria, protozoa, parasites, and fungi. sequencing projects will be considered for both complete, finished genome sequencing and other levels of sequence coverage. the choice and justification of complete versus draft sequence is likely to depend on the nature and scope of the proposed project. large-scale prepublication information on genome sequences is a unique research resource for the scientific community, and rapid and unrestricted sharing of microbial genome sequence data is essential for advancing research on infectious agents responsible for human disease. therefore, it is anticipated that prepublication data on genome sequences produced at the niaid microbial sequencing centers will be made freely and publicly available via an appropriate publicly searchable database as rapidly as possible. niaid-supported investigators have completed 131 genome sequencing projects for 105 bacteria, 8 fungi, 15 parasitic protozoa, 2 invertebrate vectors of infectious diseases, and one plant (http://www.niaid.nih.gov/dmid/genomes/ mscs/req process.htm). in addition, niaid completed the sequence for 1,467 influenza genomes. in 2006, genome sequencing projects were completed for 22 pathogens as described in section 23.16.2. genome sequencing data is publicly available through web sites such as genbank, and data for the influenza genome sequences have been published in 2006. furthermore, through the niaid's microbial sequencing centers, the niaid has funded the sequence, assembly, and annotation of three invertebrate vectors of infectious diseases. in 2006, the final sequence, assembly, and the annotation of aedes aegyptii were released, as well as the preliminary sequence and assembly of the genomes for ixodes scapularis and culex pipiens; the final results for i. scapularis and c. pipiens will be released in 2007. in 2006, niaid supported nearly 40 large-scale genome sequencing projects for additional strains of viruses, bacteria, fungi, parasites, viruses, and invertebrate vectors. new projects included additional strains of borrelia, clostridium, escherichia coli, salmonella, streptococcus pneumonia, ureaplasma, coccidioides, penicillium marneffei, talaromyces stipitatus, lacazia loboi, histoplasma capsulatum, blastomyces dermatitidis, cryptosporidium muris, and dengue viruses, as well as additional sequencing and annotation of aedes aegyptii. in 2004, niaid launched the influenza genome sequencing project (igsp) (http://www.niaid.nih.gov/dmid/genomes/ mscs/influenza.htm), which has provided the scientific community with complete genome sequence data for thousands of human and animal influenza viruses. the influenza sequence data has been rapidly placed in the public domain, through genbank, an international searchable database, and the niaid-funded bioinformatics resource center with accompanying data analysis tools. all of the information will enable scientists to further study how influenza viruses evolve, spread, and cause disease and may ultimately lead to improved methods of treatment and prevention. this sequence information is now providing a larger and more representative sample of influenza than was previously publicly available. the influenza genome sequencing project has the capacity to sequence more than 200 genomes per month and is a collaborative effort among niaid (including the niaid's division of intramural research), the national center for biotechnology niaid is continuing its support for the pathogen functional genomics resource center (pfgrc) (http://www. niaid.nih.gov/dmid/genomes/pfgrc/default.htm) at the institute for genomic research (tigr) (currently part of the j. craig venter institute). the pfgrc was established in 2001 to provide and distribute to the broader research community a wide range of genomic resources, reagents, data, and technologies for the functional analysis of microbial pathogens and invertebrate vectors of infectious diseases. in addition, the pfgrc was expanded to provide the research community with the resources and reagents needed to conduct both basic and applied research on microorganisms responsible for emerging and re-emerging infectious diseases and those considered agents of bioterrorism. one of the priorities for the pfgrc has been to provide the scientific community with access to the reagents and genomic and proteomic data that the pfgrc generated. a new software tool, called snp filtering tool, was developed for affymetrix resequencing arrays to analyze the single nucleotide polymorphism (snp) data. enhancements have been made to other tools for microarray data analysis, including a tool for analyzing slide images. a new layout for the tigr-pfgrc web site (http://pfgrc.tigr.org/) has been developed and launched and has the potential to be more user-friendly for the scientific community to access the pfgrc research and development projects, poster presentations, publications, reagents, and their descriptions and data. the number of organism-specific microarrays produced and distributed to the scientific community increased to 28 pfgrc has continued to collaborate with the national institute of dental and craniofacial research (nidcr/nih) in producing and distributing five organism-specific microarrays, including arrays for actinobacillus actinomycetemcomitans, fusobacterium nucleatum, porphyromonas gingivalis, streptococcus mutans, and treponema denticola. pfgrc has also developed the methods and pipeline for generating organism-specific clones for protein expression. seven complete clone sets are now available for human severe acute respiratory syndrome coronavirus (sars-cov), bacillus anthracis, yersinia pestis, francisella tularensis, streptococcus pneumoniae, staphylococcus aureus, and mycobacterium tuberculosis. in addition, individual custom clone sets are available for more than 20 organisms upon request. comparative genomics analysis using the available bacillus anthracis sequence data and the discovery of the snps were used to develop a new bacterial typing system for screening anthrax strains. this system allowed niaid-funded scientists to define detailed phylogenetic lineages of bacillus anthracis and to identify three major lineages (a, b, c) with the ancestral root located between the a+b and c branches. in addition, a genotyping genechip, which has been developed and validated for bacillus anthracis, will be used to genotype about 300 different strains of bacillus anthracis. pfgrc has developed additional comparative genomic platforms for both facilitating the resequencing a bacterial genome on a chip to identify sequence variation among strains and to discover novel genes. a pilot project has been completed with streptococcus pneumoniae for sequencing different strains using resequencing chip technology. in collaboration with the department of homeland security (dhs), a resequencing chip has been developed and is now being used to screen a number of francisella tularensis strains to identify snps and genetic polymorphisms. sixteen francisella tularensis strains are being genotyped by using the newly developed resequencing chip. additional collaboration with dhs led to the development of a gene discovery platform aimed at discovering novel genes among different strains of yersinia pestis. to this end, nine strains are being analyzed using this platform to discover novel gene sets. pfgrc is developing proteomics technologies for protein arrays and comparative profiling of microbial proteins. a protein expression platform is under development, and a pilot comparative protein profiling project using staphylococcus aureus has already been completed and published. a protein profiling project using yersinia pestis to compare proteomes in different strains is now under way, complementing ongoing proteomics projects supported by niaid; numerous proteins are currently being identified that are differently abundant during different growth conditions. a new project was added in 2006 for comparative profiling of proteins on the proteomes of e. coli and shigella dysenteriae to provide the scientific community with reference data on differential protein expression in animal models versus cultured systems infected with the pathogen. in 2006, niaid continued to support the population genetics analysis program: immunity to vaccines/infections. a joint project between niaid's division of allergy, immunity, and transplantation (dait) and the division of microbiology and infectious diseases (dmid), this program is aimed to identify associations between specific genetic variations or polymorphisms in immune response genes and the susceptibility to infection or response to vaccination, with a focus on one or more niaid category a to c pathogens and influenza. niaid awarded six centers to study the genetic basis for the variable human response to immunization (smallpox, typhoid fever, cholera, and anthrax) and susceptibility to disease (tuberculosis, influenza, encapsulated bacterial diseases, and west nile virus infection). the centers are comparing genetic variance in specific immune response genes as well as more generally associated genetic variance across the whole genome in affected and nonaffected individuals. the physiologic differences associated with these genome variations will also be studied. in 2006, these centers focused on recruiting the samples needed for genotyping. for example, more than 1,100 smallpox-vaccinated individuals and controls were recruited and blood and peripheral blood mononuclear cell (pbmc) samples were obtained for whole genome association studies, which were conducted in 2007. in another example, one of the centers used genome-wide linkage approaches to map, isolate, and validate human host genes that confer susceptibility to influenza infection. nearly 1,000 individuals with susceptibility to influenza and 2,000 control individuals were recruited using an iceland genealogy database. by late 2006, the center had recruited more than 600 individuals and had genotyped more than 500 in this subproject of the study. during 2006, niaid continued its support of the eight bioinformatics resource centers (brcs) (http://www. niaid.nih.gov/dmid/genomes/brc/default.htm) with the goal of providing the scientific community with a publicly accessible resource that allows easy access to genomic and related data for the niaid category a to c priority pathogens, invertebrate vectors of infectious diseases, and pathogens causing emerging and re-emerging infectious diseases. the brcs are supported by multidisciplinary teams of scientists to develop new and improved computational tools and interfaces that can facilitate the analysis and interpretation of the genomic-related data by the scientific community. in 2006, each publicly accessible brc web site continued to be developed, the user interfaces were improved, and a variety of genomics data types were integrated, including gene expression and proteomics information, host/pathogen interactions, and signaling/metabolic pathways data. a public portal of information, data, and open-source software tools generated by all the brcs is available at http://www.brccentral.org/. in 2006, many genomes of microbial species were sequenced by the niaid's microbial sequencing centers as well as by other national and international sequencing efforts, and the brcs provided either long-term maintenance of the genome sequence data and annotation or the initial annotation for a number of particular microbial genomes. for example, niaid's brc vectorbase collaborated with niaid's mscs to annotate the genome of aedes aegyptii with the scientific community and will continue the curation of this genome. in 2006, niaid continued to support contracts for seven biodefense proteomics research centers (bprcs) to characterize the proteome of niaid category a to c bioweapon agents and to develop and enhance innovative proteomic technologies and apply them to the understanding of the pathogen and/or host cell proteome (http://www. niaid.nih.gov/dmid/genomes/prc/default.htm). these centers conducted a range of proteomics studies, including six category a pathogens, six category b pathogens, and one category c emerging disease organism. data, reagents, and protocols developed in the research centers are released to the niaid-funded administrative resource for biodefense proteomics research centers (www.proteomicsresource.org) web site within 2 months of validation. the administrative resource web site was created to integrate the diverse data generated by the bprcs. in 2005, more than 700 potential targets for vaccines, therapeutics, and diagnostics were generated. examples of progress include: in 2006, more than 2,400 potential new pathogen targets for vaccines, therapeutics, and diagnostics were identified, and more than 5,700 new corresponding host targets were generated. in addition: (i) two more sars-cov structures were solved. (ii) ninety-six percent of the orfs for b. anthracis were cloned with 47% sequence validated. (iii) a custom b. anthracis affymetrix genechip was developed. (iv) fifty-three polyclonal sera generated against novel toxoplasma gondii and cryptosporidium parvum proteins were characterized, and accurate time and mass tag databases were populated for salmonella typhi, monkeypox, and vaccinia virus. r niaid staff are participating in two related nih-wide genomic initiatives that focus on examining and identifying genetic variations across the human genome (genes) that may be linked or influence susceptibility or risk to a common human disease, such as asthma, autoimmunity, cancer, eye diseases, mental illness, and infectious diseases, or response to treatment as a vaccine. the approach is to conduct genome-wide association studies in which a dense set of snps across the human genome is genotyped in a large defined group of controls and diseases samples to identify genetic variations that may contribute to or have a role in the disease, with the hope of identifying an association between a genetic variant in a gene or group of genes and the disease. r niaid has continued to participate in a coordinated federal effort in biodefense genomics and is a major participant in the national inter-agency genomics sciences coordinating committee (nigscc), which includes many federal agencies. this committee was formed in 2002 to address the most serious gaps in the comprehensive genomic analysis of microorganisms considered agents of bioterrorism. a comprehensive list of microorganisms considered agents of bioterrorism was developed that identifies species, strains, and clinical and environmental isolates that have been sequenced, that are currently being sequenced, and that should be sequenced. in 2003, the committee focused on category a agents and provided the cdc with new technological approaches for sequencing additional smallpox viral strains. affymetrixbased microarray technology for genome sequencing was established, as well as additional bioinformatics expertise for analyzing the genomic sequencing data. in 2004, as a result of this continuing coordination of federal agencies in genome sequencing efforts for biodefense, niaid developed a formal interagency agreement with the department of homeland security (dhs) to perform comparative genomics analysis to characterize biothreat agents at the genetic level and to examine polymorphisms for identifying genetic variations and relatedness within and between species. r niaid continues to participate in the microbe project interagency working group (iwg), which has developed a coordinated, interagency, 5-year action plan on microbial genomics, including functional genomics and bioinformatics in 2001 (http://www.ostp. gov/html/microbial/start.htm). in 2003, the microbe project interagency working group developed guidelines for sharing prepublication genomic sequencing data that serve as guiding principles, so that federal agencies have consistent policies for sharing sequencing data with the scientific community and can then implement their own detailed version of the data release plan. in 2004, the microbe project iwg supported a workshop on "an experimental approach to genome annotation," which was coordinated by the american society for microbiology, and discussed issues faced in annotating microbial genome sequences that have been completed or will be completed in the next few years. in 2005, the microbe project iwg developed a strategic plan and implementation steps as an updated action plan for coordinating microbial genomics among federal agencies, and the plan was finalized in 2006. r niaid continues to participate with other federal agencies in coordinating medical diagnostics for biodefense and influenza across the federal government and in facilitating the development of a set of contracts to support advanced development toward the approval of new or improved point-of-care diagnostic tests for the influenza virus and early manufacturing and commercialization. r niaid continues to participate in the nih roadmap initiatives, including lead science officers for one of the national centers for biomedical computation and one of the national technology centers for networks and pathways. seven biomedical computing centers are developing a universal computing infrastructure and creating innovative software programs and other tools that would enable the biomedical community to integrate, analyze, model, simulate, and share data on human health and disease. five technology centers were created in 2004 and 2005 to cooperate in a u.s. national effort to develop new technologies for proteomics and the study of dynamic biological systems. r supramolecular architecture of severe acute respiratory syndrome coronavirus (sars-cov). coronaviruses derive their name from their protruding oligomers of the spike glycoprotein (s), which forms a coronal ridge around the virion. the understanding of the virion and its organization has previously been limited to x-ray crystallography of homogenous symmetric virions, whereas coronaviruses are neither homogenous nor symmetric. in this study, a novel methodology of single-particle image analysis was applied to selected coronavirus features to obtain a detailed model of the oligomeric state and spatial relationships among viral structural proteins. the two-dimensional structures of s, m, and n structural proteins of sars-cov and two other coronaviruses were determined and refined to a resolution of approximately 4 nm. these results demonstrated a higher level of supramolecular organization than was previously known for coronaviruses and provided the first detailed view of the coronavirus ultrastructure. understanding the architecture of the virion is a necessary first step to defining the assembly pathway of sars-cov and may aid in developing new or improved therapeutics (49). r large-scale sequence analysis of avian influenza isolates. avian influenza is a significant global human health threat because of its potential to infect humans and result in a global influenza pandemic. however, very little sequence information for avian influenza virus (aiv) has been in the public domain. a more comprehensive collection of publicly available sequence data for aiv is necessary for research on influenza to understand how flu evolves, spreads, and causes disease, to shed light on the emergence of influenza epidemics and pandemics, and to uncover new targets for drugs, vaccines, and diagnostics. in this study, the investigators released genomic data from the first large-scale sequencing of aiv isolates, doubling the amount of aiv sequence data in the public domain. these sequence data include 2,196 aiv genes and 169 complete genomes from a diverse sample of birds. the preliminary analysis of these sequences, along with other aiv data from the public domain, revealed new information about aiv, including the identification of a genome sequence that may be a determinant of virulence. this study provides valuable sequencing data to the scientific community and demonstrates how informative large-scale sequence analysis can be in identifying potential markers of disease (50) . genome sequencing project. the analysis of the first 209 full genome sequences from human influenza strains, deposited in genbank through the niaid influenza genome sequencing project, was published in 2006 (51) . influenza isolates were chosen in a relatively unbiased manner, allowing a comprehensive look at the influenza virus population circulating within the same geographic region over several seasons, which provided a real picture of the dynamics of influenza virus mutation and evolution. analysis demonstrated that the circulating strains of influenza included alternative minor lineages that could provide genetic variation for the dominant strain. this may allow a novel strain to emerge within a human host and would explain the unexpected emergence of the fujian influenza strain in 2003-2004 that resulted in a vaccine mismatch. these findings demonstrate the usefulness of full genomic sequences for providing new information on influenza viruses and lend further support for the need for large-scale influenza sequencing and the availability of sequence data in the public domain. within the influenza community, public availability of influenza sequence data and sharing of strains has been an important issue. the niaid has been instrumental in promoting the sharing of influenza sequence information, notably by sequencing more than 1,400 complete influenza genome sequences and depositing the sequences in the public domain through gen-bank as soon as sequencing has been completed. history of microbial genomics tools for gene finding and whole genome comparison interpolated markov models for eukaryotic gene finding computational gene finding in plants the genomes of pathogenic enterobacteria the complete genome sequence of escherichia coli k-12 genome sequence of enterohemorrhagic escherichia coli o157:h7 complete genome sequence of enterohemorrhagic escherichia coli o157:h7 and genomic comparison with a laboratory strain k-12 extensive mosaic structure revealed by the complete genome sequence of uropathogenic escherichia coli genome sequence of shigella flexneri 2a: insights into pathogenicity through comparison with genomes of escherichia coli k12 and o157 large, unstable inserts in the chromosome affect virulence properties of uropathogenic escherichia coli o6 strain 536 escherichia coli that cause diarrhea: enterotoxigenic, enteropathogenic, enteroinvasive, enterohemorrhagic, and enteroadherent pathogenicity islands of virulent bacteria: structure, function and impact on microbial evolution excision of large dna regions termed pathogenicity islands from trna-specific loci in the chromosome of an escherichia coli wild-type pathogen complete genome sequence of multiple drug resistant salmonella enterica serovar typhi ct18 complete genome sequence of salmonella enterica serovar typhimurium lt2 cloning and nucleotide sequence of the salmonella typhimurium lt2 gnd gene and its homology with the corresponding sequence of escherichia coli k12 a 40 kb chromosomal fragment encoding salmonella typhimurium invasion genes is absent from the corresponding region of the escherichia coli k-12 chromosome molecular genetic bases of salmonella entry into host cells identification of a virulence locus encoding a second type iii secretion system in salmonella typhimurium identification of a pathogenicity island required for salmonella survival in host cells pathogenicity islands and host adaptation of salmonella serovars the salmonella selc locus contains a pathogenicity island mediating intramacrophage survival the 102-kb unstable region of yersinia pestis comprises a high-pathogenicity island linked to a pigmentation segment which undergoes internal rearrangement transfer rna genes frequently serve as integration sites for prokaryotic genetic elements complete nucleotide sequence of the prophage vt2-sakai carrying the verotoxin 2 genes of the enterohemorrhagic escherichia coli o157:h7 derived from the sakai outbreak a novel mechanism of virus-virus interactions: bacteriophage p2 tin protein inhibits phage t4 dna synthesis by poisoning the t4 single-stranded dna binding protein, go32 the old exonuclease of bacteriophage p2 filamentous phages linked to virulence of vibrio cholerae shiga toxin: purification, structure, and function genome sequence of yersinia pestis, the causative agent of plague salmonella pathogenicity islands encoding type iii secretion systems the salmonella pathogenicity island-1 type iii secretion system capsule switching of neisseria meningitides capsules and cassettes: genetic organization of the capsule locus of streptococcus pneumoniae genetic and molecular characterization of capsular polysaccharide biosynthesis in streptococcus pneumoniae type 3 massive gene decay in the leprosy bacillus yersinia pestis -etiologic agent of plague yersinia pestis, the cause of plague, is a recently emerged clone of yersinia pseudotuberculosis microbial proteomics from proteins to proteomes: large scale protein identification by twodimensional electrophoresis and amino acid analysis membrane proteins and proteomics: un amour impossible? two-dimensional electrophoresis of membrane proteins: a current challenge for immobilized ph gradients new developments in isoelectric focusing isoelectric focusing in immobilized ph gradients: principle, methodology and some applications laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons electrospray ionization for mass spectrometry of large biomolecules ion trap mass spectrometry supramolecular architecture of severe acute respiratory syndrome coronavirus revealed by electron cryomicroscopy large-scale sequence analysis of avian influenza isolates large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution key: cord-213136-euv6pqh5 authors: singh, kulveer; rabin, yitzhak title: sequence effects on internal structure of droplets of associative polymers date: 2020-05-17 journal: nan doi: nan sha: doc_id: 213136 cord_uid: euv6pqh5 we used langevin dynamics simulations of short associative polymers with two stickers placed symmetrically along their contour to study the effect of the primary sequence of these polymers on their organization inside condensed droplets. we observed that the shape, size and number of sticker clusters inside the condensed droplet change from a single cylindrical fiber to many compact clusters, as one varies the location of stickers along the chain contour. aging due to conversion of intramoleclular to intermolecular associations was observed in droplets of telechelic polymers, but not for other sequences of associating polymers. the relevance of our results to condensates of intrinsically disordered proteins is discussed. membrane-less subcellular compartments such as p granules, nucleoli, cajal bodies, etc., perform specialized biochemical roles inside cells [1, 2] . the formation of these biomolecular condensates is governed by liquid-liquid phase separation where biopolymers such as proteins and nucleic acids condense into liquid droplets [3] [4] [5] [6] . this phase separation depends on various factors such as polymer-polymer, polymer-solvent and solvent-solvent interactions, concentration of polymers in solvent and environmental conditions such as temperature, ph, etc. a significant fraction of all proteins in a cell are flexible proteins which do not adopt a well-defined three dimensional structure and are known as intrinsically disordered proteins (idps) [7] . studies have revealed that idps are important ingredients of most biomolecular condensates in cells [8] [9] [10] . a characteristic feature of idps is that their backbone contains short sequences of hydrophobic aminoacids that are strung together by flexible linkers that consist of hydrophilic aminoacids [11] . these hydrophobic segments facilitate phase separation and gelation of idps in solution and give rise to variety of self-assembled structures such as micelles [12] . because of the presence of strongly associating sequences, idps can be considered as biological equivalents of associative polymers which contain segments or blocks of monomers known as stickers, that promote aggregation of these polymers in selective solvents [13, 14] . associative polymers undergo gelation (formation of system-spanning polymer networks) by forming physical crosslinks between stickers at sufficiently high concentration [15] [16] [17] [18] [19] , and form flower-like micelles at low concentration [20] . one example of such associative polymers are telechelic polymers which contain stickers at the two ends of the polymer chain. in aqueous solution telechelic polymers with hydrophobic stickers form flowerlike micelles which are connected (bridged) by other telechelic polymers with two ends in two different micelles [21, 22] . upon their formation gels made of associative polymers crosslinked by clusters of stickers show aging behavior as they relax towards equilibrium [23] [24] [25] , due to slow structural reorganization produced by interconversion of intermolecular and intramolecular assocations between stickers [22] . if the average inter-polymer attraction exceeds solvent-solvent and polymer-solvent interactions (poor solvent conditions), a solution of these associative polymers/idps undergoes phase separation into a polymer-rich phase that coexists with a dilute polymer solution [14, 26] . if the average polymer concentration is sufficiently small, the process will take place via formation of droplets of the polymer-rich minority phase, that will grow by polymer exchange and by coalescence of droplets [27, 28] . while this process has much in common with phase separation of homogeneous (i.e., made of identical monomers) polymers, the presence of strong associations between the stickers raises interesting questions about the internal morphology of these droplets. in particular, one would like to characterize the size and the shape of clusters of stickers inside the droplets and to establish the connection between the internal morphology of the droplets and the sequence (primary structure) of the associative polymers/idps. one would also like to explore the kinetics of droplet formation and the temporal evolution of its internal structure. finally, one would like to understand what happens on the molecular level i.e., whether and how the balance between interchain and intrachain associations changes with time following the onset of phase separation. in order to address these questions, in model and methods section we introduce a simple model of associating polymers having two stickers symmetrically positioned along their contour. in results section we use langevin dynamics to simulate the relaxation of a dilute associating polymer solution to equilibrium, following a fast quench (e.g., by change of temperature [23] or ph [29] or by rapid mixing in a microfluidic device [30] ) to poor solvent conditions. we study the evolution of internal structure of large droplets (morphology of clusters of stickers) and the kinetics of interconversion between intramolecular and intermolecular associations, for different sequences of our model polymers. in discussion secton we summarize our results on the polymer sequence dependence of the internal morphology and of the observed aging phenomena and discuss possible ramifications of our results for experiments on liquid idp droplets. we have performed implicit-solvent simulations of a solution of m polymers of n=10 monomers (beads). beads interact with each other via lennard-jones (lj) potential given by which is truncated and shifted to zero at cutoff distance r cut ij such that each polymer contains two types of beads designated as stickers and non-stickers respectively, such that ǫ ij = ǫ s if i th and j th beads are stickers and ǫ ij = ǫ ns if at least one of those beads is a non-sticker. neighboring beads along the backbone of the chain interact via finitely extensible nonlinear elastic (fene) potential given by where we take k = 30.0 and r 0 = 1.5. we use lammps [31] (large-scale atomic/molecular massively parallel simulator) to carry out langevin dynamics simulations in the nvt ensemble. the simulation is performed in a box of size 61 × 61 × 61 in units of σ, using periodic boundary conditions. the motion of each bead is given by the langevin equation, neglecting hydrodynamic interactions where u (sum over all u ij ), ζ and η i are the total potential energy, bead friction coefficient and random thermal force due to implicit solvent, respectively. the rms amplitude of the random noise is proportional to (ζk b t /∆t) 1/2 , where k b , t and ∆t are boltzmann's constant, temperature and integration time step, respectively. in the following, all the time scales are expressed in lj time units τ lj = (mσ 2 /ǫ) 1/2 = 1 (mass m, particle diameter σ, interaction parameter ǫ, and temperature k b t are all set to 1). we set the integration time-step to be ∆t = 0.005 and friction coefficient ζ = 0.02. we took two stickers per polymer which were symmetrically placed along it contour (see fig. 1 ). to obtain an initially uniform polymer solution, we placed all the polymers in an array inside the simulation box and equilibrated the system under good solvent conditions. we then changed the interaction parameters to poor solvent conditions and continued to monitor the system through the processes of drop formation and aging. we begin each simulation with dilute polymer solution in good solvent. the polymer volume fraction, φ = 0.011, is chosen to be below the overlap volume fraction φ * ≈ 0.27 defined as the volume fraction of a single polymer in its pervaded volume. in order to ensure good solvent conditions in the state of preparation we take ǫ s = ǫ ns = 0.8 with cutoff distance r cut ij = 2 1/6 σ, corresponding to purely repulsive interactions between all beads. simulations are then performed starting with a random initial configuration obtained by equilibrating the system under good solvent conditions. we first looked into the formation of a polymer droplet under poor solvent conditions assuming the same lennard-jones interaction between all beads, ǫ s = ǫ ns = 0.8 and r cut ij = 2.5σ (note that this cutoff corresponds to both short range repulsion and long-range attraction between the beads). we verified that with this choice of parameters, phase separation between polymers and solvent occurs and a spherical polymer droplet condenses out of the solution (not shown). since our aim is to study the effect of primary sequence of associative polymers on their organization inside condensed droplets, we model each polymer as a chain of eight weakly attractive beads and two strongly attractive stickers and vary the location of the stickers along its contour. the five different symmetric sequences of such a polymer shown in fig. 1 range from the s8s sequence which has two stickers at the ends and eight non-sticker beads in between (a telechelic polymer), to the 4ss4 sequence in which the two stickers at the center of the chain are flanked by four bead long tails (here s denotes a sticker and the number specifies the length of a sequence of non-sticker beads). the two possible states of the polymers are depicted in fig. 1 where states 1 and 2 represent open and closed loop chain configurations, respectively. as shown in this figure, four of these sequences (s8s, 1s6s1, 2s4s2 and 3s2s3) can form loops due to intramolecular bonds between the stickers, while the fifth one (4ss4) can not. we proceed to examine droplet formation in these associating polymer systems. after preparing a random initial state in good solvent, we increased the interaction parameter between the two stickers by a factor of five to ǫ s = 4.0. the interaction parameter between the non-sticker beads (and that between stickers and non-stickers) remained ǫ ns = 0.8 but all the cutoff distances were increased to r cut ij = 2.5σ. as we have shown before, this choice of interaction parameters guarantees phase separation via formation of polymer droplets. we monitored the evolution of the five systems corresponding to the different sequences shown in fig. 1 . snapshots of one such system (the s8s sequence), from the state of preparation at t = 0 till t = 25, 000 (in units of lj time τ lj ), are shown in fig. 2 . growth occurs by coalescence of small droplets which are formed by aggregation of neighboring polymers immediately upon quenching the system to poor solvent conditions. this process continues until a single large droplet remains. all other sequences undergo a similar evolution process of droplet formation and growth through coalescence (not shown). how long does it take for the final droplet to form? in order to answer this question we monitored the time evolution of the radius of gyration r g = (i,j) (r i − r j ) 2 /2n 2 of all the monomers in the s8s system (see fig. 3 ). initially, all the polymers are uniformly distributed in the entire simulation box which gives a large value of r g but as time progresses r g decreases and eventually saturates at a plateau value which corresponds to the formation of a large droplet that contains all the polymers in the system. the decrease in the r g value is non-monotonic with time as the system evolves. this happens because of the presence of many droplets during intermediate times (see snapshots at t = 500 and t = 5, 000 in fig. 2 ). the continuous random motion of these droplets leads to fluctuations of inter-droplet distances and to non-monotonic dependence of r g on time before it saturates, as shown in figure 3. similar time evolution is observed in all other systems with different polymer sequences and in all cases the time it takes a single droplet to form is below 20, 000. having explored the dynamics of droplet formation we proceed to study its local structure in order to characterize the clustering of stickers and the competition between intra and inter-molecular associations of the polymers inside the droplets. to this end, droplets formed at some time ≤ 20, 000 are further evolved till t = 300, 000 (till t = 500, 000 for the s8s system) to ensure equilibration. clusters with very different structures were observed for different sequences (see fig. 4 ). clusters of stickers were defined operationally as follows: a sticker is assumed to belong to a cluster if it is found within a range of 1.5σ from any other sticker that belongs to this cluster. inspection of the s8s droplet shows that almost all stickers belong to a single cluster that has the shape of a long cylindrical fiber which forms a spiral inside the droplet. a more thorough examination revealed the presence of another (smaller) compact cluster at the center of the droplet. a similar spiral fiber formed by the stickers is observed in the 4ss4 droplet. in the other three cases corresponding to 1s6s1, 2s4s2, and 3s2s3 sequences, many small elongated clusters whose size and number depends on the sequence, are present in the equilibrium droplet (see figure 4 ). for example, the 2s4s2 sequence has smaller and more numerous clusters compared to sequences 1s6s1 and 3s2s3. figure 5 shows a histogram of the number of clusters in a droplet for the five different sequences, at time t = 300, 000. in order to test the dependence of our results on polymer concentration, we performed simulations for polymer volume fraction φ = 0.006, for s8s and 2s4s2 sequences, and did not observe any qualitative changes of size and shape of clusters compared to the φ = 0.011 case shown in fig. 4 . next, we studied the aging of a droplet starting from its formation and following its evolution on timescales that are several orders of magnitude larger than droplet formation time, till it settles into equilibrium. two types of structural rearrangements were monitored: on a mesoscopic scale, we followed the association and dissociation of clusters of stickers until a steady state was reached in which the average number and size of these clusters do not change. for all polymer sequences we found that the system attains equilibrium with respect to cluster number and size shortly after the formation of the droplet (not shown). on a molecular level we followed the change of the distribution of polymer states (open and bound states in fig. 1 ) as time progresses. to this end we computed the average (over all polymers in the droplet) distance between two stickers of the same polymer at different times. the distributions of ensemble averaged rms distance r ss between two stickers on a polymer, measured at different aging times, are shown in fig. 6 where each of the five panels represents a different sequence. at all aging times the distributions are bimodal for polymers with sequences s8s, 1s6s1 and 2s4s2 and unimodal for sequence 4ss4. the r ss distribution for the 3s2s3 sequence has a single peak followed by a broad shoulder and is intermediate between the bimodal and the unimodal cases. in the bimodal case the two peaks can be associated with open (large r ss ) and with bound (small r ss ) states of the corresponding polymer, with the latter peak located at r ss ≈ 1, in agreement with expectations. the position of the second peak that corresponds to open chain configurations (state 1 in fig. 1 ) is sequence-dependent and increases with the length n of the sequence of beads between the two stickers. in order to understand the origin of this sequence dependence note that the spatial separation between the stickers averaged over open chain conformations, increases with the length n of the sequence of beads between the two stickers (for long sequences and no interaction between the beads one expects the gaussian chain result, r ss ∝ n 1/2 ) and therefore, the value of r ss should monotonically decrease from s8s (n = 8) to 1s6s1 (n = 6) to 2s4s2 (n = 4) sequence, as observed in fig. 6 . as expected, only open configurations are observed for 4ss4 polymers which cannot form intramolecular loops, with a peak at r ss ≈ 0.5 (the smaller value of r ss is due to attractive fene bond between neighboring stickers). a single smeared peak is observed for the 3s2s3 sequence because of the very small difference between the sticker-sticker distance in open (state 1) and bound (state 2) states. note that practically all the polymers are confined within the droplets and all their stickers are arranged in clusters. therefore, the presence of loops and open chain configurations that correspond to the two peaks (for sequences s8s, 1s6s1 and 2s4s2) means that some of the polymers form intramolecular loops that bind together to a cluster while other polymers form intermolecular bridges between points on the same or on different clusters (see fig. 7 ). the numbers of intramolecular loops and intermolecular bridges depends on the sequence (compare the areas under the two peaks in the first three panels in fig. 6) . we now proceed to examine aging effects. inspection of fig. 6 shows that for four of the sequences, no significant molecular rearrangement is observed in the time interval 20, 000 − 300, 000. the only exception is the s8s sequence for which the number of intermolecular bridges increases monotonically with time at the expense of intramolecular loops, and eventually saturates around t = 400, 000. even though polymers with 4ss4 sequence form a single cylindrical fiber that is quite similar to that of s8s sequence, no aging is observed for 4ss4 sequence, presumably because it cannot form intermolecular bridges. clearly, competition between intramolecular loops and intermolecular bridges is a necessary but not a sufficient condition for aging since it is not observed in 1s6s1 and 2s4s2 droplets where both open and closed polymer conformations (and multiple small clusters of stickers) are present. in order to get some intuition about the effects of intermolecular interactions on the conformation of associating polymers in the droplet, we proceed to compare the r ss distribution of polymers inside the droplet to that of an isolated associative polymer in poor and in good solvent. we first simulated an isolated associative polymer in poor solvent, keeping all parameters the same as in the droplet case discussed above. for four of the five sequences we observed a strong single peak at r ss = 1 (at r ss = 0.5 for the 4s4 sequence), with an extended shoulder for larger r ss values (see figure s1 in si). this concurs with the expectation that open chain configurations are strongly suppressed and collapsed ones are enhanced for isolated polymers in poor solvent. next, we simulated an isolated associative polymer in good solvent, with ǫ s = 4.0 interaction between stickers and only repulsive interactions between the other beads (ǫ ns = 0.8 with a cutoff at r cut ij = 2 1/6 σ). the results are shown in figure 8 where the r ss distribution of a polymer in a droplet is compared with that in good solvent, for s8s, 1s6s1 and 2s4s2 sequences. in all these cases a bimodal distribution is observed and the probability of bound state formation is higher in a droplet than in good solvent. for the 1s6s1 and 2s4s2 sequences, the open chain peaks are quite similar in amplitude and position in the droplet and in good solvent, but for the s8s sequence the droplet peak is lower and is shifted to higher inter-sticker distances compared to good solvent. a tentative explanation is that in the 1s6s1 and 2s4s2 cases there are multiple small clusters of stickers inside the droplet and intermolecular bridges can form between clusters whose separation happens to coincide with the average end-to-end distance of a free polymer chain. in the s8s case the stickers form a long helical fiber that is folded inside the droplet and chains have to stretch in order to form intermolecular bridges between neighboring turns of the fiber. since at t = 0 we begin with a dilute solution of associating polymers in poor solvent in which most of the chains contain intramolecular bonds between their stickers, the observation of a second peak that corresponds to intermolecular bridges means that major molecular rearrangement takes place inside droplets formed by polymers with s8s, 1s6s1 and 2s4s2 sequences. the fact that significant aging after the formation of a large droplet is observed only for the s8s sequence implies that for the 1s6s1 and 2s4s2 sequences, the distance between the stickers is optimized by partial conversion from intramolecular to intermolecular associations, already during the growth of smaller droplets and is completed by the time the large droplet is formed by their coalescence. inspection of figs. 4 and 5 shows remarkable similarity between internal droplet morphologies (size, shape and number of clusters of stickers) of sequences s8s and 4ss4 and of sequences 1s6s1 and 3s2s3. in order to understand the origin of this sequence-morphology correlation we splitted each of the 10-long polymer sequences in the middle to yield two identical sequences of length five, with one sticker each (see figure 9 ). clearly, both s8s and 4ss4 consist of two identical s4 repeat units and both 1s6s1 and 3s2s3 contain two 1s3 repeat units. in order to check whether a n = 5 polymer with a single sticker produces similar internal droplet morphology to the corresponding n = 10 polymer with two stickers, we performed simulations of droplet formation and aging in dilute solutions of n = 5 polymers, keeping the volume fraction of polymers and other simulation parameters the same as in the n = 10 case. in accord with expectations, we obtained droplets with similar internal morphology to those of the corresponding n = 10 sequence (see figure s2 in si). we therefore conclude that the number, size and shape of clusters is controlled mainly by the sequence of the repeat unit ("monomer") of the associating polymer (e.g., s4) and not by the way these repeat units are joined together to form a polymer (i.e., s8s or 4ss4). note, however, that even though internal structure of droplets is similar for corresponding n = 5 and n = 10 sequences, there are profound differences between the two cases at the molecular level. while the formation of intramolecular loops and intermolecular bridges is strictly forbidden in droplets formed by single-sticker n = 5 polymers (e.g., s4 sequence), it can be either forbidden or allowed in clusters formed by the corresponding n = 10 chains, depending on the way they are joined together (forbidden for 4ss4 and allowed for s8s). in this work, we modeled intrinsically disordered proteins (idps) as short associative polymers with two stickers and studied the effect of the primary sequence of idps on their organization inside condensed droplets. we considered all 5 possible sequences of n = 10 long polymer with two stickers symmetrically positioned along its contour and observed that growth of condensed droplets occurs via coalescence until a single large spherical droplet is formed, in all the systems. we found a striking dependence of the morphology of clusters of stickers on the sequence: while a long helical fiber was observed for both s8s and 4ss4 sequences, many small compact (and somewhat elongated) clusters were seen for the other three sequences. we also found that the morphology of the clusters is determined mostly by the repeat unit of the associating polymer: both s8s and 4ss4 (1s6s1 and 3s2s3) sequences are formed from s4 repeat unit and both have a similar internal morphology (a similar conclusion applies to 1s6s1 and 3s2s3 sequences that have a common 1s3 repeat unit). for three of the sequences (s8s, 1s6s1 and 2s4s2) we found that the average spatial distance r ss between the two stickers of a polymer inside the condensed droplet has a bimodal distribution, such that one of the peaks corresponds to intramolecular bonds and the other to intermolecular bridges between clusters (or between different parts of a long fiber of stickers). only a single peak that corresponds to intramolecular associations between stickers was observed for the other two sequences, 3s2s3 and 4ss4, in accord with the expectation that intermolecular bridges can form only for sequences that have sufficiently long spacers between stickers. telechelic polymers (but no other sequences in our study) exhibited slow evolution of the distribution of inter-sticker distances inside the droplet, as the peak corresponding to small (large) values of r ss decreased (increased) with time. we identified this "aging" phenomenon with molecular rearrangement in which loops formed by intramolecular association open up to form intermolecular bridges between different folds of the helical fiber of stickers inside the droplet. the evolution from intramolecular to intermolecular associations in droplets of telechelic polymers in poor solvent should be contrasted with the opposite phenomenon that was observed in gels produced by self assembly of telechelic polymers in good solvent [22] . finally we would like to comment on the relevance of our results to biomolecular condensates of intrinsically disordered proteins. clearly our simple model misses many of the molecular details of real idps such as hydrogen bonding and electrostatic interactions between amino acids and formation of partial secondary structure (such as α-helices and β-strands). still, it does capture some of the generic feature of flexible proteins such as their ability to change their conformation in order to associate with other macromolecules (in our model this is illustrated by the replacement of intramolecular by intermolecular associations between stickers). it also captures, albeit qualitatively, the phenomenon of aging of liquid-like idp droplets and in particular their tendency to form strong gels and fibers following a long incubation time. such phenomena have been observed in liquid droplets of tau proteins [32] and of nucleoporin domains [30, 33] . another similarity between our associative polymer model and idps is illustrated by the fact that, just like in our simple model, fibril formation by amyloid peptides [34] , formation of nanostructures via self-assembly of short peptides [35] and nanoscale organization of nucleoporins in the nuclear pore complex [36] , strongly depend on their sequence. annual review of cell and developmental biology key: cord-193910-7p3f3znj authors: zhang, xiangxie; beinke, ben; kindhi, berlian al; wiering, marco title: comparing machine learning algorithms with or without feature extraction for dna classification date: 2020-11-01 journal: nan doi: nan sha: doc_id: 193910 cord_uid: 7p3f3znj the classification of dna sequences is a key research area in bioinformatics as it enables researchers to conduct genomic analysis and detect possible diseases. in this paper, three state-of-the-art algorithms, namely convolutional neural networks, deep neural networks, and n-gram probabilistic models, are used for the task of dna classification. furthermore, we introduce a novel feature extraction method based on the levenshtein distance and randomly generated dna sub-sequences to compute information-rich features from the dna sequences. we also use an existing feature extraction method based on 3-grams to represent amino acids and combine both feature extraction methods with a multitude of machine learning algorithms. four different data sets, each concerning viral diseases such as covid-19, aids, influenza, and hepatitis c, are used for evaluating the different approaches. the results of the experiments show that all methods obtain high accuracies on the different dna datasets. furthermore, the domain-specific 3-gram feature extraction method leads in general to the best results in the experiments, while the newly proposed technique outperforms all other methods on the smallest covid-19 dataset the first successful isolation of dna by friedrich miescher in 1869 was a groundbreaking step in biology as it laid the groundwork of understanding the blueprints of all organic life. dna, which is short for deoxyribonucleic acid, is a hereditary material that can be found in the cells of all humans and other living organisms. it carries the necessary information which decides the biological traits of our bodies and works as a genetic blueprint for an evolving organism. an isolated dna sequence can be represented by a character string, which consists of only a, c, g, or t. this format is named fasta. the analysis of dna is crucial, as it allows doctors to diagnose diseases, helps in analyzing the spread of new infections, and it can also be used to solve crimes or conduct paternity tests. therefore, dna analysis has become a vital interest in computational biology [1] . in traditional biology, primers are essential tools for dna analysis. primers are short single-stranded nucleotide sequences important for the initiation phase of dna synthesis of all living organisms. in molecular biology, synthetic primers are utilized for different purposes such as the detection of viruses [3] , bacteria [4] or parasites [5] . primers, which are often present in human dna sequences that are infected by a specific type of virus, are utilized for these purposes. with the help of the polymerization chain reaction (pcr) method, the dna fragment of the existing virus is amplified significantly, and researchers are able to detect the virus. primers are also utilized in various dna classification problems [6, 7] and bioinformatics [8] . for this study, it is important to note that they can be considered as comparison patterns that can be searched for to diagnose diseases. by calculating the edit distances between an isolated dna sequence and the primers of a specific virus, the level of the virus being expressed in the human dna sequence can be obtained, which can then be used to build up the feature vectors. machine learning algorithms can be trained on the feature vectors of the dna sequences. the resulting model can be used to detect viruses and diagnose viral diseases. contributions. synthesizing primers for a particular virus is often difficult and expensive. this paper proposes an alternative method that uses randomly generated dna sequences to replace the primers. the advantage is that the analysis and processing necessary for finding the primer patterns can be ignored. in the experiments, the performances of feature extraction using primers and random dna sequences will be compared to several other machine learning approaches. another feature extraction method, which will be referred to as the 3-gram method throughout this paper, is also developed. additionally, other state-of-the-art algorithms, namely convolutional neural networks [23, 24] (cnns) and deep neural networks (dnns), which extract the features directly from sample dna sequences, are evaluated. they are compared with the two feature extraction methods combined with machine learning algorithms such as adaboost [21] , support vector machines (svms) [17, 2] , and others. an additional algorithm, the n-gram probabilistic model [30, 31] , which is often used in natural language processing, is also implemented and compared to the other machine learning approaches. to provide accurate and convincing final results, we conducted each of the experiments and tested all methods on four different data sets. one data set is concerned with the detection of the hepatitis c virus in human dna, one with the classification of influenza virus and coronavirus, another with the classification of hiv into hiv type 1 and hiv type 2, and the last with a classification problem based on human dna samples infected with sars-cov-2. a detailed description of the data sets can be found in the subsequent section. paper outline. this paper is organized as follows. in section 2, each of the four data sets is described, and the decisions that motivated the choices of data are explained. section 3 explains the used feature extraction methods and machine learning algorithms. this is followed in section 4 by the description of the experimental setup. the results for each experiment are presented in section 5, while section 6 discusses the results and concludes the paper. the dna sequence classification methods are tested on four different data sets of various sizes. they include different types of viruses, and different datasets were used for different aims. one data set that is commonly used and might be considered the standard for dna analysis by some, the molecular biology (splice-junction gene sequences) data set, was not used. this decision was made because of the length of the samples this data set contains. while the samples of the tested data sets contained up to almost 30,000 characters, the samples of the splice data set are only sequences of 61 characters. this reason, in addition to the age of the data, the data set was created in 1992, led to the decision of using newer data sets with samples more resembling data encountered in actual applications. the hepatitis c virus (hcv) is a single-stranded rna virus that can infect rna sequences in the human body. rna is the messenger that contributes to the formation of dna. therefore, if the rna is infected, the dna is also modified. unlike the hepatitis b virus (hbv), an effective vaccine against hcv has not yet been developed [9] . hcv can cause severe diseases like hepatitis c and liver cancer. thus it is vital to detect potential infections with hcv as early as possible. this hcv dataset was obtained from the world gene bank and consists of 500 hcv positive dna sequences and 500 hcv negative dna sequences. the length of the dna sequences in this data set varies widely, with the longest sequences being 12,722 characters long and the shortest only 73 characters long. most sequences, however, fall in the range from 9,000 to 12,000. the coronavirus is an rna virus that can infect humans' respiratory tract and cause many different diseases. potential diseases could be mild like the common cold, but it could also be lethal like sars, mers, or covid-19. on the other hand, the influenza virus is responsible for seasonal flu and has caused many epidemics in history; for example, the spanish influenza in 1918 and the outbreak of h1n1 in 2009. influenza viruses and coronaviruses may cause similar symptoms to patients. however, different measures might need to be taken in order to support patients in their recoveries, depending on the type of virus they are infected with. therefore it is crucial to know which kind of infection a patient has before a decision about the treatment is made. the dataset was obtained from the national center for biotechnology information (ncbi) and consists of 7500 influenza virus positive dna sequences and 7500 coronavirus positive dna sequences. the dna sequences that this data set contains are all between 95 and 2995 characters long, with most sequences falling in the range of 1350 to 2500. the human immunodeficiency virus (hiv) can attack the human immune system and cause acquired immunodeficiency syndrome (aids). the estimated incubation period is around 8 to 9 years, during which there could be no symptoms. however, after this long period, the risk of getting opportunistic infections increases significantly and can cause many diseases. in addition to the immunosuppression, hiv can also directly have impact on the central nervous system and cause severe mental disorders [10] . there are two subtypes of hiv, namely hiv-1 and hiv-2. hiv-1 has relatively higher virulence and infectivity than hiv-2. an hiv dataset containing 1600 hiv-1 positive dna sequences and 1600 hiv-2 positive dna sequences, was acquired from ncbi and used to evaluate the algorithms. the sequences in this data set are between 774 and 2961 characters long. in this section, the feature extraction methods and machine learning algorithms that are used are described. two feature extraction methods were compared. the first method is based on the edit distance between two dna strings. the second method relies on the 3-gram method [12] , which will be described later in detail. six machine learning algorithms are combined with the two feature extraction methods. finally, three state-of-the-art methods, namely a con-volutional neural network (cnn), a deep neural network (dnn), and an n-gram probabilistic model, which were fed the unprocessed dna sequences without prior feature extraction, were tested. the levenshtein distance, also known as edit distance, is used to measure the difference between two strings. the smaller the distance, the more similar the two strings are. there are three edit operations, inserting a character into a string, deleting a character from a string, or substituting a character in a string. the levenshtein distance between string a and b denotes the minimum number of edit operations that need to be performed on string a, in order to transform string a into string b. the levenshtein distance between the first i characters of string a and the first j characters of string b is denoted by d ab (i, j), which can be calculated using equation (1) . in this equation, a i and b j represent the i-th and j-th character in string a and string b. if either string a or string b has no character, then the levenshtein distance equals the maximum length among them. this point is easy to understand because if one string is empty, then simply inserting all characters from the other string into the empty string is enough. if both strings are not empty, then the last character in both strings, namely a i and b j , should be examined. if they have the same terminal character, then both of them could be ignored. only looking at the first (i − 1) characters in string a and the first (j − 1) characters in string b is enough. in this scenario, d ab (i−1, j −1) equals d ab (i, j). if the terminal characters are different, then the costs of three possible options are supposed to be compared. the first option is deleting the terminal character in string a. such that d ab (i − 1, j), which is the levenshtein dis-tance between the first (i − 1) characters in string a and the first j characters in string b should be calculated, plus 1 caused by the deletion. the second option is inserting one character, which should be the same as the terminal character of string b, to the end of string a. by the insertion operation, the terminal character of string b can be ignored, and we only need to calculate d ab (i, j −1)+1. the insertion causes the addition of 1. the last option is substituting the terminal character of string a by the terminal character of string b. in this scenario, the terminal character in both strings can be ignored, and this is denoted by d ab (i − 1, j − 1) + 1. formula (1) suggests a recursive procedure to calculate the levenshtein distance between two strings. in each recursion, the last character of one or both strings could be ignored. the recursion should halt when one string is empty. according to this formula, the levenshtein distance between two string a and b is denoted by d ab (|a|, |b|), where |a| and |b| represent the length of string a and b. as mentioned in the introduction section, primers are short dna sequences that are used in medical research to detect possible viruses by conducting pcr on the dna sequence. in bioinformatics, primers can be used to extract features. in order to do so, the levenshtein distance between the primer and the dna sequence is calculated. if the distance is small, then the similarity between the dna sequence and the primer is considerable. in other words, a person is likely to be tested positive for the virus corresponding to the primer. however, in an infected dna sequence, the target virogenes are small fragments hidden somewhere in the dna sequence. therefore, directly calculating the levenshtein distance between the dna sequence and the primer is not accurate. in this research, the matching process is done between the primer and a short sub-string of the dna sequence. then, the window on the dna sequence slides by one character, and the levenshtein distance between the primer and the next sub-string of the dna sequence is calculated. this process is repeated until the whole dna sequence is traversed. finally, the minimum distance of all these calculated distances is taken and considered as the final distance. for example, the given dna sequence is "tttgactcgt" and the window size is 8. the levenshtein distance between the primer and the first sub-string "tttgactc" is calculated, then "ttgactcg" and "tgactcgt". afterward, the minimum distance of the three distances is taken. in reality, many viruses have existed in the world for a long time, and they have mutated severely. therefore, only calculating the levenshtein distance between the dna sequence and one primer is not enough to make the result accurate. however, for many viruses, multiple different primers exist. thus, the minimum levenshtein distance between the dna sequence and various primers can be calculated and combined into a single feature vector. these feature vectors can then be used to train and test different machine learning methods. obtaining a virus's primers can be expensive and takes time, especially for a newfound virus like sars-cov-2 in early 2020 or viruses that have a high mutation rate. our novel method to solve this problem is using randomly generated short dna sequences to replace the primers. the feature extraction is then achieved by calculating the minimum levenshtein distance between the randomly generated dna strings and the dna sequences needing classification. since nothing except for the dna strings used to calculate the minimum levenshtein distance (primers vs. randomly generated dna sequences) changed, the resulting feature vectors are of the same format. different machine learning algorithms will be trained and tested using each set of feature vectors in the experiments. in this method, before the feature vectors are fed into the machine learning algorithms, the usage of a normalization process is crucial. it is helpful to consider that the difference between smaller distances is more significant than the difference between larger distances. for example, the difference between distance 3 and 4 should be given more weight than between distance 30 and 40. therefore, finding a suitable normalization function is necessary for this method. the elements in the feature vectors were processed with the function shown by formula (2) , where x is the computed distance. extracting the features of a dna sequence using what will be referred to as the 3-gram method is based on knowledge from biology and medicine. before diving into the algorithm, it is helpful to know how the human body builds the proteins that it needs. dna sequences store the necessary information for building proteins. in biological cells, a dna sequence is first reformed into an mrna sequence. this process is called transcription, and mrna works as a messenger. after this, the mrna sequence is translated into a series of amino acids, which build up the proteins. researchers have found that one amino acid is coded by a group of three nucleobases [13] . the groups which contain three nucleobases are called codons. the corresponding translation between codons and amino acids is illustrated in fig. 1 . there are in total 64 different codons, and 61 of them can be translated into amino acids. the other three, taa, tag, and tga are stop codons. they mark the halt point of translation. there are 20 possible amino acids, meaning that several different codons can be translated into the same amino acid. the 3-gram method simulates the process from dna sequences to amino acids. a window of size three is used to traverse the whole dna sequence with a sliding unit of 1 at each time. at each time, the group of three nucleobases is acquired from the dna sequence, and the corresponding amino acid is recorded. stop codons are neglected. after the whole dna sequence is traversed, all different types of amino acids are counted. then the proportion of each amino acid is calculated and put in a histogram. take the same dna sequence as an example again. there are eight codons in the dna sequence "tttgactcgt". they are "ttt", "ttg", "tga", "gac", "act", "ctc", "tcg" and "cgt". they can be translated into one phenylalanine, two leucines, one aspartic acid, one threonine, one serine, and one arginine. in a nutshell, each dna sequence is represented by a 20-d feature vector after using the 3-gram method. these feature vectors are then used as the input vectors for machine learning algorithms [12] . unlike the previously introduced method, which extracts features based on the levenshtein distance, the feature vectors extracted by the 3-gram method do not need to be normalized before feeding them into the machine learning algorithms. this is simply because the calculations of the proportion of each amino acid themselves already normalize the data. the comparison between the two previously introduced feature extraction methods was done by training and testing six machine learning models on the feature vectors acquired by using the two feature extraction methods. since all the experiments consisted of binary classification tasks, each processed vector was labeled with either 1 or 0, depending on its class. for example, whether it was associated with an infected dna sequence or an uninfected one. the labeled data was then used to train each of the six machine learning methods described in the following subsections. k-fold cross-validation was used to ensure accurate results. the processed samples were separated into k folds randomly. the training and testing process was run k times. at each time, one fold was used as the testing set, and the remaining k-1 folds together was seen as the training set. by doing such, all folds were used as a part of the training set for k-1 times and as the testing set once. therefore, one model was trained and tested k times, and the average test accuracy was used to evaluate the model. the multi-layer perceptron (mlp) is a supervised learning model which is often used in classification problems. after trying out different architectures, we observed an mlp with three hidden layers to perform best with the number of neurons being 500, 250, and 125, respectively. the activation function of the neurons in the hidden layers was the rectified linear unit (relu). the sigmoid function was used as the only neuron's activation function in the output layer. in the training phase, the binary cross-entropy function was used as the loss function. this loss function calculates the error between the actual output and the target label, on which the training and update of the weights are based [14] . the adaptive moment estimation optimizer (adam) was used in this research. the learning rate of the optimization decides how fast the model learns. it is critical to the model and should be set carefully. a large learning rate might make the model never find the optimal solution, while a small learning rate causes inefficiency. mini-batch learning was used during training. this means that several examples were fed into the mlp together, and then the weights were updated. minibatch learning makes sure that the learning process goes on the right track. in each epoch, all data in the training set were used. when the model was trained on all examples, the next epoch started. in preliminary experiments, it was found that the model overfitted severely. therefore, the dropout technique was used to prevent overfitting. the key idea of dropout is that during the training phase, some units and their connections are randomly dropped, enabling the model to generalize well [15] . logistic regression is a linear model that is used to carry out binary classification. similar to the mlp model, the sigmoid function was used for the output unit. additionally, l2 regularization was used to prevent possible overfitting. this method adds a regularization term at the end of the loss function, which is illustrated by formula (3). the additional term is also known as l2 regularization. the hyperparameter c in this term controls to what degree the l2 regularization should be executed. smaller values specify stronger regularization. usually, it is assumed that the independent variables in the input vector x have a multivariate normal distribution. however, most of the time, this assumption is not satisfied. in such situations, logistic regression is a good alternative model [16] . the support vector machine (svm) is another linear model for the classification task [17, 2] . an svm is capable of handling a small amount of data and is less sensitive to noise in a dataset, and therefore it has an excellent generalization ability [18] . the svm aims to find the hyperplane which maximizes the margin separating the two classes. the solution to this can be found by using the lagrange multiplier method. powerful non-linear svm models can be trained if kernel functions are appropriately used [17] . kernel functions create new feature vectors that usually have more dimensions than the original input. the svm finds the new hyperplane, which is linear in the new feature space. however, in the original feature space, the separation will be non-linear if a non-linear kernel is used. in the training process of an svm, the inner product of two samples x i and x j needs to be calculated. the kernel method provides a solution that allows the model to get the inner product in the higherdimensional feature space directly. this idea is illustrated by formula (4), where k is the kernel function. φ(x i ) and φ(x j ) are the new feature vectors in the higher-dimensional feature space. by mapping like this, the transformation from the lower-dimensional space to the higher-dimensional space of each individual sample is unnecessary, which saves a lot of memory and computational resources. in our implementation, the radial basis function kernel (rbf kernel) was used. the rbf kernel function is shown in formula (5) . the hyperparameter γ decides the distribution of the feature vectors in the higher-dimensional feature space. the other hyperparameter in the svm model is c. similar to the usage of c in logistic regression, the hyperparameter c here decides the regularization, or in other words, a penalty degree. before talking about the random forest algorithm, it is essential to know how a decision tree classifier [19] works. the decision tree is a tree-like model that simulates how humans make decisions. there is a judgment at each node, and the data are classified into different child nodes. the leaf nodes show the final results. the impurity drop is used to evaluate the decision, and a good query should maximize the impurity drop. the decision tree is also a supervised learning model in which the training set is used by the model to learn how to make queries and split the data until a specific criterion or threshold is reached. the decision tree is the fundamental model of the random forest algorithm [20] . as its name implies, a forest is built up from many trees. the random forest model trains multiple different decision trees on different data, and the average output of these trees is taken as the final output. the bootstrap aggregating method is used to create different training datasets. this method samples some data with replacement from the original training data set, and they are used to train one individual tree. the random forest is a simple, easy-to-understand algorithm which is capable of handling complex non-linear classification task. therefore, it is often used in the machine learning field. two hyperparameters needed to be tuned in our implementation. one of them is the number of estimators. it controls how many trees should be created during the experiment. the other one is the maximum depth of each tree. this value should neither be too large nor too small. larger numbers may cause overfitting, while lower depths could lead to underfitting. adaboost [21] , which is short for adaptive boosting, is another decision-tree-based algorithm similar to the random forest. the basic idea of adaboost is training multiple weak classifiers, and their combination is a more robust classifier. the similarity between random forest and adaboost is that they both train multiple classifiers, while the difference lies in the data used to train them. unlike random forest, which uses part of the data at each time, adaboost uses all data in the original training set to train a single classifier. those samples which were classified incorrectly are given more weight, and the updated data set will be used to train the next classifier. similar to the random forest algorithm, two hyperparameters needed to be tuned in our implementation. they are the number of estimators and the maximum depth of the trees. xgboost [22] is another ensemble learning algorithm like random forest and adaboost. the difference between these three algorithms lies in how they train one individual decision tree classifier. random forest uses different sampling data, while adaboost manipulates the weights of data. different from both of them, xgboost is built based on the idea of the gradient boosting decision tree (gbdt) and developed from that. gbdt trains an individual decision tree to fit the residual from the previous decision tree [22] . this procedure is done by considering the whole decision tree as a function f (x), and calculating the gradient of the loss function with respect to the func-tion f (x). xgboost introduces a regularization term to the loss function to prevent overfitting. additionally, each leaf node is given a score, such that the loss function can be computed more efficiently. xgboost enables researchers to solve large-scale problems in the real world by using a relatively small amount of resources [22] . in our implementation, three hyperparameters needed to be tuned. they are the number of trees that should be trained, the maximum depth in each tree, and the λ value which controls the degree of regularization. in this section, three more complex and computationally expensive state-of-the-art algorithms are described that will be compared to the simpler machine learning algorithms using the two feature extraction methods. for these more complex algorithms, the dna sequences are used as feature vectors without the in-between step of feature extraction via one of the previously described methods. with the development of computational capacity, the convolutional neural network (cnn) [23, 24] has been widely used in many fields such as computer vision and has achieved great success. a cnn is a neural network in which some matrix multiplication operations between layers are replaced by convolutions [25] . the cnn is able to learn to extract features and train a classifier at the same time. recently, several researchers have applied the cnn model to bioinformatics, especially the task of classifying dna sequences. the cnn model has been found to be capable of handling the classification of the nucleotides of dna sequences with a, g, c, and t [26] . another study proved the feasibility of using cnns to classify non-coding rna sequences, and accuracies higher than 95% were achieved on multiple datasets [27] . based on these previous studies, we developed a cnn model and compared it to the other algorithms. before training a cnn model, an individual dna sequence should be transformed into a 2-dimensional matrix by using one-hot encoding. in the resulting matrix, each column represents a character in the original dna sequence. the number 1 appears at the place which stands for the corresponding character, while the other places in the same column are filled with 0. an example of one-hot encoding is illustrated in fig. 2 . the first to the last position in the column represents a, c, g, t, respectively. by using the one-hot encoding, each character in the original dna sequence is represented by 4 channels. the channels are shown below each other in the same column in fig. 2 . since the dna sequences have different lengths, all of them are padded with columns with only zeros, to the same length. in the convolution layer, a neuron uses a kernel (filter) and performs a convolution operation to compute a single output in the resulting feature map. afterward, the filter slides to the next region and repeats the convolution operation. in this way, the features from different parts of the input can be extracted. in order to decrease the size of the feature maps, a maxpooling layer is added after the convolution layer. it reduces the size by only keeping the maximum value from several neighboring values in a feature map. after experimeting with different cnn architectures, finally nine convolution layers were used for the sars-cov-2 dataset and seven were used for the others. each convolution layer was followed by a maxpooling layer. one hundred filters were used in each convolution layer. each filter in all layers received all four channels as input and has a window length of 3. therefore, a filter did not only take one character each time but integrated them together with a width of 3. the stride in all convolution layers was set to 1, while in all pooling layers the stride was set to 3. all the outputs resulting from the last max-pooling layer were fully connected to a dense layer with five hundred neurons. the final output layer with a single output followed the dense layer. similar to the mlp model that we implemented, relu was used as the activation function on the neurons in the convolution layer and dense layer. the activation function used in the output layer was the sigmoid function. the loss function of this model was the binary cross-entropy function. similar to the mlp, three hyperparameters were coarsely tuned. they were the number of epochs, the batch size, and the learning rate of the gradient descent optimization. these hyperparameters also apply to the dnn algorithm that will be described next. another approach is to use a multi-layer perceptron with multiple hidden layers, also referred to as a deep neural network or dnn. dnns have been increasingly used successfully in bioinformatics. an earlier study demonstrated that a dnn could perform with a state-of-the-art level at the task of predicting species-of-origin and species classification of short dna sequences [28] . the training phase of dnns uses each of the dna sequences in their entirety, and also does not use the features extracted by either taking minimum levenshtein distances or the 3-gram method. similar to the cnns, the dna sequences were transformed into a 2-dimensional matrix using one-hot encoding. after testing multiple different configurations, a model consisting of one dense layer with 40 neurons, followed by another dense layer with 20 neurons appeared to provide the most accurate results. for this model, the activation function of the neurons in the hidden layers was again relu, and the one for the output layer was the sigmoid function. the loss function used was the binary cross-entropy function as well. a third state-of-the-art method that uses the entire dna sequences as the features it is trained on is an ngram probabilistic model. this method is commonly used in natural language processing (nlp) [29] . it has also been successfully applied to dna classification problems [30, 31] with resulting classification accuracies up to 99.6%. an n-gram is a sequence of n items. these items are e.g. words in nlp or the letters a, c, g, or t, representing nucleobases in dna sequences in fasta format. the n-gram probabilistic model can be used to predict the probability of the next item x in a sequence given the history of the n-1 previous items h: p (x|h). it can for example be used to predict the probability of the next item being the letter a in a dna sequence, given that the previous four letters were "acgt". this probability would be calculated as p (a|g, t ) in an n-gram model with an n-value of 3 and p (a|a, c, g, t ) for one with an n-value of 5. for our experiments, the n-gram probabilistic model was used as a classifier, so the probabilities of the next item were calculated using the previous n-1 items for each class separately, e.g. p (x t |x t−1 , x t−2 , class) for n=3. eventually, the prior probability p (class) can be used in combination with the n-gram probabilistic model(s) to compute the probability of a sequence belonging to a certain class using bayes's rule: p (class|x1, x2, ...xt ) = p (class) * p (x2|x1, class) * p (x3|x2, x1, class)... * p (xt|xt−1, xt−2, class)... * p (xt |xt −1, xt −2, class) in order to prevent underflow when working with dna sequences containing more than 30,000 items, in this experiment, the probabilities were not multiplied, but the logarithmic values of these probabilities were added. the number of occurrences of each n-gram is counted for each class. these counts are then used to calculate the probabilities. for testing purposes on a novel dna sequence, the computed probabilities are used to calculate the log probability of belonging to both classes. the dna sequence is then classified ac-cording to the class with the highest probability. we tested different values for n, and the final best performing value of n that was used in all experiments was 6. each machine learning algorithm was trained and tested on the processed feature vectors obtained by using the two different feature extraction methods. reliable primers of hcv could be acquired. therefore, when testing the random dna sequences method on the hcv dataset, the feature extraction using primers based on distance was done in order to make comparisons with the random dna sequences method. since 37 primers of hcv were acquired, we generated three groups of random dna sequences, and each contains 37 dna sequences. the lengths of the dna sequences of the three groups were 25, 100, and 200, respectively. additionally, a fourth group was generated, in which there were 100 dna sequences of length 200. for other datasets, since the primers could not be obtained, the random dna sequences method was tested using 50 random dna sequences, with a length of 25, 50, and 100. furthermore, each algorithm was trained and tested on each of the four datasets using the dna sequences in their entirety as features. for every experiment, the accuracy of the binary classification was tested across ten folds of cross-validation. as all datasets consisted of two kinds of dna sequences, the training and testing procedure was the same for each dataset. for each experiment, the feature vectors were assigned labels according to their class. during the training phase, the classifiers were trained on each feature vector of the training set and its corresponding label. during the testing phase, each feature vector of the testing set was classified as either 'positive' or 'negative' for the hcv dataset, hiv-1 or hiv-2 for the hiv dataset 'influenza virus' or 'coronavirus' for the influenza/corona dataset, or as either 'originating from the usa' or 'not originating from the usa' for the sars-cov-2 dataset. the result of each classification was recorded and compared to the correct label of each feature vector in the testing set to calculate the accuracy. for each trial, the accuracy was recorded to compute the mean accuracy and standard deviation across the ten folds of the cross-validation. the whole experiment was divided into two parts. the first part was a preliminary experiment, which was used to tune the hyperparameters and decide the best set of hyperparameters of each algorithm. this process is done by repeating the training and testing procedure using different sets of hyperparameters. the one which gave the highest accuracy was selected. after the optimal hyperparameters were found, the second experiment was conducted using those found hyperparameters. the comparison across different algorithms was based on the results of the second experiment. all the hyperparameters that needed to be tuned have already been discussed in the method section, and the following tables (1) to (7) show the best found values for the hyperparameters. all results presented in this section are the mean accuracy and standard deviation over ten folds of cross-validation. the comparison of using primers and random dna sequences was only made on the hcv dataset. the results of the experiment on the hcv dataset are shown in figure ( 3). for each data set, the results of all six machine learning algorithms using the random dna sequence feature extraction method are presented in table ( 8) containing mean accuracy and standard deviation over the ten folds of the cross-validation. for each machine learning algorithm, multiple lengths and amounts of the random dna sequences were considered. however, only the ones showing the best results are displayed in the table. for the displayed results of the random sequence feature extraction method, 50 randomly generated dna sequences of length 25 were used on the hiv and influenza/corona data sets. for the sars-cov-2 data set, 50 randomly generated dna sequences of length 50 were used. for the 3-gram feature extraction method, the results of the six machine learning algorithms for each of the four data sets are displayed in table (9) . an overview of all results, in which the feature extraction methods are compared with state-of-the-art algorithms, are provided in table ( 10) . for this table, only the best results of the 3-gram and random dna sequence feature extraction method were considered. the best results stem from different machine learning algorithms for different data sets. however, the exact results for each algorithm on each data set are displayed in table ( 8) and table ( 9) . figure ( 3) suggests that using primers has the highest accuracy when the classifier is trained by using the mlp, adaboost, or xgboost algorithm. if the primers are replaced by random dna sequences, the highest accuracy is obtained using an svm classifier. although using primers leads to better results, the results indicate that using primers (m=99.9, sd=0.32) does not have a significantly higher accuracy (t(18)=1.90, p=0.07) than using the random dna sequences (m=99.3, sd=0.95). it can be concluded that the levenshtein distance feature extraction yields the best and most consistent results across the six different machine learning algorithms when the distance between a primer and a dna sequence is taken. however, the random dna sequences can be used to replace primers when they are not available. furthermore, it can be observed that even though the svm produces the highest accuracy for three of the four data sets, there is not one machine learning method that consistently yields the best results across all the different lengths of the randomly generated strings nor across each of the various data sets (see table (8)) . also, there is no clear indication about the best length of the random dna strings (for simplicity, we do not show all these results). for the 3-gram feature extraction method, the results show a similar pattern. even though here the svm is among the machine learning algorithms yielding the best results for three out of the four data sets, adaboost provides the highest mean accuracy for the hiv data set again. a significant difference to the random dna string feature extraction is that the difference between the machine learning algorithms becomes much smaller. different algorithms show identical results for two of the four data sets using table 4 : the best hyperparameters of random forest on the four datasets, using the two feature extraction methods the 3-gram feature extraction method. table 10 shows that overall the 3-gram feature extraction method combined with either an svm (for hcv and inf./cor.) or adaboost (for hiv) obtains the highest mean accuracy of all tested methods in 3 out of 4 data sets. for these datasets also the other methods perform very well, especially the cnn, and the differences are quite small. the dnn seems to perform a bit worse on most datasets. for the sars-cov-2 data set, the levenshtein distance with the random dna string feature extraction method obtains significantly higher accuracies than table 8 : mean accuracy ± standard deviation for all methods using the random dna-sequence feature extraction across the four data sets. the other methods. for this small dataset, it outperforms the second best method (the 3-gram) with around 4.3%. the above results suggest that the 3-gram method obtains better performances on larger datasets, while the random dna sequences method might be bet-ter at handling relatively smaller datasets. if large amounts of data are not readily available, the results of the random dna sequence method are promising. it obtains an accuracy as high as 97% with as little as 292 samples to train on. table 10 : mean accuracy ± standard deviation for all methods across the four data sets. this paper aimed to provide an extensive comparison of different methods for dna sequence classifi-cation. five different methods were compared across four different data sets of various sizes. examining the proposed novel method using random dna sequences to extract features based on distance is one of the main novelties in this paper. we wanted to test whether it is good enough to replace primers. the results showed that modern state-of-the-art methods from fields like computer vision and natural language processing as cnns or n-gram probabilistic models can achieve very high accuracies above 99% on dna sequence classification problems provided that enough sample data is available. although the dnn has a slightly worse performances in some of the experiments, the achievements are acceptable. therefore, we can conclude that these algorithms can be successfully applied to different dna classification problems. the results also showed that the use of feature extraction methods is useful to obtain the best results. the 3-gram method is quite simple but very effective in handling different datasets. the novel feature extraction method based on random dna sequences led to the best result on the smallest sars-cov-2 dataset and can therefore be promising for dna classification problems when little data is available. the potential applications of the proposed methods are plenty. a potential field in which the methods could be deployed is the diagnosis of diseases. especially the 3-gram feature extraction method seems promising to be used for diagnosing viral infections such as hcv or hiv. for future studies, it would be interesting to investigate some further applications of different methods. for example, the field of ancestral research using genetic samples or the detection of genetic predispositions are possible applications. if the same techniques perform similarly well for problems of this kind needs to be determined. our results also indicate that the sars-cov-2 viruses spreading in the usa seems to be different from other countries. therefore, it would be interesting for biologists to further investigate the origin of sars-cov-2 with the help of machine learning. prediction of function in dna sequence analysis the nature of statistical learning theory importance of primer selection for the detection of hepatitis c virus rna with the polymerase chain reaction assay specific amplification of bacterial dna by optimized so-called universal bacterial primers in samples rich of plant dna primers targeting mitochondrial genes of avian haemosporidians: pcr detection and differential dna amplification of parasites belonging to different genera classification of plant trypanosomatids (phytomonas spp.): parity between random-primer dna typing and multilocus enzyme electrophoresis factors that affect large subunit ribosomal dna amplicon sequencing studies of fungal communities: classification method, primer choice, and error bioinformatics for geneticists: a bioinformatics primer for the analysis of genetic data hepatitis c virus: risk factors and disease progression organic mental disorders caused by hiv clinical characteristics of coronavirus disease 2019 in china research of the dna sequence classification algorithm based on machine learning. dissertation general nature of the genetic code for proteins a novel mlp network implementation in cmol technology. engineering science and technology dropout: a simple way to prevent neural networks from overfitting detection of high gs risk group prostate tumors by diffusion tensor imaging and logistic regression modelling. magnetic resonance imaging a training algorithm for optimal margin classifiers a comparative study of support vector machines and artificial neural networks for predicting precipitation in iran induction of decision trees random forest a decision-theoretic generalization of on-line learning and an application to boosting xgboost: a scalable tree boosting system generalization and network design strategies gradient-based learning applied to document recognition deep learning dna sequence classification by convolutional neural network convolutional neural networks for classification of alignments of non-coding rna sequences a deep learning approach to pattern recognition for short dna sequences the effects of n-gram probabilistic measures on the recognition and production of four-word sequences the effects of n-gram probabilistic measures on the recognition and production of fourword sequences a private dna motif finding algorithm key: cord-010273-0c56x9f5 authors: simmonds, peter title: virology of hepatitis c virus date: 2001-10-10 journal: clin ther doi: 10.1016/s0149-2918(96)80193-7 sha: doc_id: 10273 cord_uid: 0c56x9f5 hepatitis c virus (hcv) has been identified as the main causative agent of post-transfusion non-a, non-b hepatitis. through recently developed diagnostic assays, routine serologic screening of blood donors has prevented most cases of post-transfusion hepatitis. the purpose of this paper is to comprehensively review current information regarding the virology of hcv. recent findings on the genome organization, its relationship to other viruses, the replication of hcv ribonucleic acid, hcv translation, and hcv polyprotein expression and processing are discussed. also reviewed are virus assembly and release, the variability of hcv and its classification into genotypes, the geographic distribution of hcv genotypes, and the biologic differences between hcv genotypes. the assays used in hcv genotyping are discussed in terms of reliability and consistency of results, and the molecular epidemiology of hcv infection is reviewed. these approaches to hcv epidemiology will prove valuable in documenting the spread of hcv in different risk groups, evaluating alternative (nonparenteral) routes of transmission, and in understanding more about the origins and evolution of hcv. hepatitis c virus (hcv) has been identified as the main causative agent of posttransfusion non-a, non-b hepatitis. 1,2 the identification of hcv led to the development of diagnostic assays for infection, based either on detection of antibody to recombinant polypeptides expressed from cloned hcv sequences or direct detection of virus ribonucleic acid (rna) sequences by polymerase chain reaction (pcr) using primers complimentary to the hcv genome. routine serologic screening of blood donors now prevents most or all cases of posttransfusion hepatitis. assays 0149-2918/96/$ 3.50 for antibody also are important diagnostic tools and have been used to investigate the prevalence of hcv in different risk groups, such as intravenous drug users, patients with hemophilia, and other recipients of blood products, and to conduct epidemiologic studies of hcv transmission. the complete genomic sequence of hcv has been determined for several isolates, revealing both its overall genome organization and its relationship to other rna viruses. deducing possible methods of replication by analogy with related viruses is possible, although such studies currently are hampered by the absence of a satisfactory in vitro culture method for hcv. as a consequence, most conventional virologic studies are difficult and artificial. hcv contains a positive-sense rna genome approximately 9400 bases in length. in overall genome organization and presumed method of replication, it is most similar to members of the family flaviviridae, particularly in coding for a single polyprotein that is then cleaved into a series of presumed structural and nonstructural proteins (figure 1 ). 3 the roles for these different proteins have been inferred by comparison with related viruses and by in vitro expression of cloned hcv sequences in prokaryotic and eukaryotic systems. these artificial systems allowed the investigation of protein expression, cleavage, and posttranslational modifications. there are numerous positive-stranded rna virus families whose coding capacity is contained within a single open read-ing frame (orf) as is found in hcv, and with which it may be usefully compared (table i) . among human viruses, these include both picomaviridae (eg, poliovirus, coxsackievirus a and b, and hepatitis a virus) and flaviviridae (eg, dengue fever and yellow fever virus). the genomes of those viruses have a similar organization with structural proteins at the 5' end and nonstructural proteins at the 3' end. however, virus families differ in genome size, the number of proteins produced, the mechanism by which the polyprotein is cleaved, and the detailed mechanism of genome replication. for example, the genome of the picornaviridae is shorter than that of hcv (approximately 7200 to 8400 bases), contains four nucleocapsid proteins (compared with the single protein of hcv), is nonenveloped (and therefore contains no homologues of the two hcv-encoded hcv glycoproteins e 1 and e2), and uses exclusively virus-encoded proteases to cleave its polyprotein. this is different from both hcv and the flaviviridae, in which cleavage of the structural proteins is thought to be carried out by the host cell--derived signalase. members of the flaviviridae have many features in common with hcv. they have a similar genome size (yellow fever virus has 10,862 bases 4 compared with 9379 for hcv 3) and package a viral-encoded glycoprotein into the virus envelope (el). the homologue of e2 in flaviviruses (a membrane-bound glycoprotein called ns 1; "ns" stands for nonstructural) is expressed only on the infected cell surface. like hcv, the polyprotein is cleaved by a combination of viral and host cell proteases. although there is no close sequence similarity between hcv and other known viruses, at least two regions with conserved amino acid residues provide another fundamental aspect of genome organization that differs between the flavivirus and picornavirus families is the structure of the 5' and 3' untranslated regions (utrs). these parts of the genome are involved in hcv replication and initiation of translation by cellular ribosomes of the virus-encoded polyprotein. pestiviruses and hcv show evidence for a highly structured 5'utr and 3'utr, in which internal base-pairing produces a complex set of stem-loop structures that are thought to interact with various host cell and virus proteins during replication. 7,8 in particular, studies have shown that for the picornaviridae and, more recently, for hcv and pestiviruses, 7,9a° such structures are involved in internal initiation of translation, in which binding to the host cell ribosome directs translation to an internal methionine (aug) codon. this contrasts strongly with translation of flavivirus genomes, which act much like cellular messenger rna in which ribosomal binding initially occurs to the capped 5' end of the rna, followed by scanning of the sequence in the 5' to 3' direction with translation commencing from the first aug codon. structurally, hcv is also more similar to the pestiviruses than the flaviviruses, with an exceptionally low buoyant density in sucrose (1.08 to 1.11 g/cm3), 11 similar to that reported for pestiviruses and attributable in both cases to heavily gly-cosylated external membrane glycoproteins in the virus envelope. by contrast, flavivirus envelope glycoproteins contain few sites for n-linked glycosylation, and the virion itself is relatively dense (1.2 g/cm3) . the arrangement and number of cleavage sites of the hcv polyprotein are more similar to pestiviruses, particularly in the further cleavage of both ns4 and ns5 proteins into two subunits, in both cases with ns5b corresponding to the rna polymerase. recently, two distinct rna viruses have been discovered in new world primate tamarins (sanguinis). this monkey species had previously been shown to harbor an infectious agent causing chronic hepatitis originally derived from inoculation with plasma from a surgeon (gb) in whom chronic hepatitis of unknown etiology had developed. 12 parts of the genome of the two viruses (provisionally called gbv-a and gbv-b) show measurable sequence similarity to certain regions of hcv. for example, a 200-amino-acid sequence of part of ns3 of gbv-a and gbv-b shows 47% and 55% sequence similarity with the homologous region in hcv (positions 1298 to 1497 in the hcv polyprotein) 3 and 43.5% sequence similarity to each other. similarly, in ns5, the region around the active site of the rna polymerase (including the gdd motif and positions 2662 to 2761 in hcv) shows 36% and 41% sequence similarities and 43% between gbv-a and gbv-b. 12 in these nonstructural regions, these similarity values are greater than those between hcv and pestiviruses or flaviviruses, although little homology can be found on comparison of the regions of the genomeencoding structural proteins (ie, the core and envelope), nor with the normally highly conserved 5'utr. the degree of relatedness between hcv and other positive-stranded rna viruses can be more formally analyzed by phylogenetic analysis of highly conserved parts of the genome, such as the ns5 region (and homologues in other viruses) encoding the rna-dependent rna polymerase, which invariably contains the canonical gdd motif necessary for the enzymatic activity of the protein. comparisons of a 100-amino-acid sequence surrounding this motif indicate a close relationship between hcv and gbv-a and gbv-b, an intermediate degree of relatedness with the pestivirus bovine viral diarrhea virus, and a much more distant relatedness to fiaviviruses ( figure 2 ). 6 '13 remarkably, a series of plant viruses that are structurally distinct from each of the mammalian virus groups, and with different genome organizations, have rna-dependent rna polymerase amino acid sequences that are perhaps more similar to those of hcv than are the flaviviruses. hcv replication has been studied using a variety of experimental techniques. however, little progress has been made toward the development of a practical hcv culture. hcv does not produce obvious cytopathology, and the amount of hcv released from cells infected in vitro often is low. 14-18 this might be because the cells used for culture are not representative of those infected in vivo, or because productive infection requires a combination of cytokines and growth factors that might be present in the liver but which cannot be recreated in cell culture. the observation that low levels of hcv replication might be detected in lymphocyte 14,18 and hepatocyte cell lines 16, 17 indicates that either the tropism of hcv for different cell types may be greater than first imagined or that the virus replication detected so far does not represent the full replicative cycle of hcv that occurs in vivo. transfection of full-length dna sequences of the hcv genome might be expected to initiate the full replicative cycle of hcv, as it does when similar experiments are done in picornavirus sequences. however, only a low level of expression of virus proteins was observed when a complete hcv sequence was transfected into a transformed hepatocyte (hepatoma) cell line (huh7). 19 despite this, there was evidence of replication of the hcv genome and the production of low concentrations of progeny virus particles. such models provide an important experimental system for future investigations of hcv replication. in common with other positive-strand rna viruses, hcv is presumed to replicate its rna genome through the production of a replication intermediate (ie, an rna copy of the complete genome) and is synthesized by the activity of a virally encoded rna-dependent rna polymerase. the minus-strand copy would then be used to generate positive-stranded copies. because templates can be reused, several minus-strand copies can be synthesized from the infecting positive strand, and each of these transcripts can be used several times to produce positive-strand progeny sequences. in this way, a single input sequence may be amplified several thousandfold. although initiation of transcription is well understood for some positive-strand rna viruses (such as the picornaviridae), no information currently is available on how rna synthesis of hcv or other flakoonin 6 for sources of non-hcv sequences. gbv-a, gbv-b, and hcv (genotypes la, lb, 2a, 2b, and 3a shown) were aligned using the program clustal, and phylogenetic analysis was done using the programs protdist (pam matrix), neighbor, and drawtree in the phylip package. 13 viviruses is primed. hcv lacks homopolymeric tracts (such as poly [u] in the picornaviruses) at the 5' end of the genome, whereas the 3' end is variable, containing either poly(u) or poly(a) tracts, or possibly neither, as now appears to be the case with the related pestiviruses. furthermore, there appears to be no homologue of the vpg protein of picornaviridae. for these reasons, it is likely that the mechanism of transcription initiation for hcv is different. using a strand-specific pcr method, antisense hcv rna sequences have been detected in the liver of hcv-infected patients, confirming the presumed method of replication of hcv via a replication intermediate. 2° such assays provide a valuable technique for detecting hcv replication, as both a sensitive method of monitoring hcv replication in virus culture experiments and a way of investigating the range of cell types and distribution of hcv infection in hcv-infected patients. in particular, the possibility of replication at extrahepatic sites has been proposed on the basis of such assays; these studies have been reviewed by lau et al. 21 the 5'utr is thought to play a significant role in initiating and regulating translation of the large orf of hcv. this region is approximately 341 to 344 bases long, and a combination of computer analysis, nuclease mapping experiments, and studies of covariance has led to a proposed secondary structure model for this part of the genome ( figure 3 ). 22 using the same methods, researchers have predicted a remarkably similar structure for pestiviruses, s despite the virtual absence of nucleotide sequence similarities with hcv, indicating the importance of the overall structure of this region in interactions with viral and cellular proteins or other rna sequences. direct evidence for internal initiation of translation has been obtained from in vitro translation of reporter genes downstream from the 5'utr sequence placed in mono-or dicistronic vectors. 9'23'24 the nonpaired tip of the stem-loop structure 3 is partially complimentary to the 18s subunit of ribosomal rna and may, therefore, be the site of binding during internal initiation. 25 the internal ribosomal entry site activity of 5'utr is consistent with the hypothesis that translation is initiated from the aug methionine codon at position 341.22 there is no evidence for translation from any of the variable number of aug triplets upstream from position 341, although production of the small proteins from these upstream potential orfs may play some role in regulating expression of the large orf. 26 in the absence of a cell culture system for hcv, most information available on the expression and processing of hcv proteins has been obtained from transfection experiments with cloned dna sequences corresponding to the different proteins, and more recently by direct observations of the cellular distributions and properties of hcv proteins detected in liver or plasma in vivo. transfection of prokaryotic or eukaryotic cells with dna copies of different parts of the hcv genome under the control of artificial promoters allows expression of the encoded proteins, and provides a useful technique for studying their synthesis, biochemical properties, and table ii ). expression of this part of the genome in cells 27-29 or reticulocyte-lysate-containing microsomal membranes 3°,31 leads to the synthesis of a polyprotein and its cleavage into a series of proteins. the protein identified as the capsid protein on the basis of comparisons with related viruses is expressed as a protein of approximate size 21 to 22 kd. the assignment of this protein as the nucleocapsid protein is supported by the presence of regions within the protein containing numerous basic (positively charged) amino acids that may have rna-binding properties associated with the encapsidation of hcv rna during virus assembly. binding of core protein to ribosomal rna has recently been reported. 31 using similar techniques, expression of the putative envelope proteins of hcv (el and e2) leads to the synthesis in mammalian cells of two heterogeneous proteins with sizes ranging from 31 to 35 kd and 68 to 72 kd, respectively. 27-31 cleavage between the capsid protein and el, e1 and e2, and e2 and ns2 depends on .~o the addition of microsomal membranes, implying that the host cell signalase has a role in these processing steps. the sizes of e1 and e2 are greater than could be explained by their amino acid sequences alone and support biochemical evidence for extensive glycosylation of both proteins after translation. both e1 and e2 have a large number of potential n-linked glycosylation sites, although the details of which sites are used, the extent to which the glycoprotein moieties are modified, and whether there is also o-linked glycosylation await further biochemical analysis. two cleavage sites between e2 and ns2 (both microsome dependent) have recently been identified, 32 leading to the production of e2 proteins differing in size by 80 amino acid residues. 33 evidence for intermolecular associations between e1 and e2 has been obtained through immunoprecipitation experiments, in which antibody to e1 or e2 could precipitate both proteins under nondenaturing conditions. 28,29.34 the nature or significance of this association is unclear, although current evidence suggests that the association is predominantly noncovalent 34 and does not occur simply through hydrophobic interactions between the membrane anchors of the two proteins. 33,35 recently, monoclonal antibodies to either e1 or e2 were shown to coprecipitate ns2 and ns3, 35 and there also is evidence for associations between e2, ns2, and ns4b. 33 in vitro translation of the rest of the genome leads to the production of proteins of sizes 23, 70 to 72, 4, 27, 56 to 58, and 66 kd, corresponding to ns2, ns3, ns4a, ns4b, ns5a, and ns5b, respectively ( figure 1 ; table ii ). proteolytic cleavage pathways that generate the nonstructural proteins are mediated by ns2 and ns336 and have been extensively studied by several groups, as they represent possible targets of antiviral treatment. ns3 is a serine protease that catalyzes cleavage reactions between ns3/ns4a, ns4a/ns4b, ns4b/ns5a, and ns5a/ ns5b. 36~° ns2a is a metalloproteinase that cleaves the ns2/ns3 junction. 36.41 the ns3/ns4a cleavage reaction mediated by ns3 and the ns2/ns3 cleavage mediated by ns2 occur in cis, whereas other reactions can occur through intermolecular associations between ns3 and the rest of the polyprotein. accounts of the complex sequence of events and the interactions between nonstructural proteins involved in cleavage reactions differ in detail depending on the experimental methods used. however, cleavage may be a sequential process modulated by the activities of other proteins, such as ns4a. ns2 protease activity is zinc dependent and contains an active site dependent on residues in ns3. therefore, after the cis cleavage of the ns2/ns3 junction, the protease is inactivated and will not act in trans on other substrates. this cleavage reaction has been shown to be essential for activating ns3 protease, and that natural variation in the efficiency of the reaction may modulate the pathogenicity of hcv in vivo. 42 when released, ns3 cleaves other sites with varying efficiencies. the active site of ns3 has been mapped by deletion experiments to lie at the amino terminus of the protein (residues 1409 to 1215). 43 the substrate specificity of the serine protease activity has been defined by sequence comparisons and mutagenesis experiments 44-47 and generally conforms to the consensus sequence d/e----c/t$s/a in the target protein. there is some evidence for a less stringent requirement for spe-cific amino acids around the cis cleavage site (ns3/ns4a) than for those cleaved in trans. 46 several investigators have described the requirement for other protein cofactors for the activity of ns3. in particular, it appears that binding of ns4a to ns348,49 is necessary at least for the cleavage of ns4b/ns5a and may modulate the activity of ns3 in other ways. although there is now some information on the proteolytic cleavage steps used to process the hcv polyprotein, the difficulty associated with in vitro culture of hcv and production of infectious molecular clones of hcv so far has prevented a more detailed understanding of the sites of hcv replication in cells and the processes of virus assembly and release from the cell. future research should reveal the nature of the interaction between the capsid protein and virus rna and how this is packaged into the assembled provirion, the posttranslation modifications to the envelope proteins and where these occur in the cell, and the sites of budding of hcv through cellular membranes. to understand replication more fully, we must also identify the mechanism of priming of rna synthesis from the ends of the genome, the nature of the primers, or whether circularization is necessary for transcription. because a cell culture system to investigate differences in neutralization and cytopathic properties of hcv is not available, nucleotide sequence comparisons and typing assays developed from se-quence data have become the principal techniques for characterizing different variants of hcv. this type of analysis is fairly easy to perform, especially since virus sequences can be amplified by pcr directly from clinical specimens. in common with other rna viruses, variants of hcv show considerable sequence variability, many differing considerably from the prototype hcv (hcv-pt). 3 differences of up to 29% have been found between the complete genomic sequences of the most extremely divergent variants analyzed to date, 5° comparable to those observed between serotypes of other human positive-strand rna viruses such as poliovirus, coxsackievirus, and coronaviruses. sequence variability is evenly distributed throughout all virus genes (table 111 )9 -57 apart from the highly conserved nucleotide (and amino acid) sequence of the core (nucleocapsid) protein and 5'utr and the greater variability of the envelope gene (table iii) . nucleotide sequence comparison of complete genomes or subgenomic fragments between variants has shown that variants of hcv obtained from japan are substantially different from the hcv-pt variant obtained in the united states. 3 comparison of the complete genome sequence of hcv-j 53 and hcv-bk 51 from japan showed 92% sequence similarity to each other but only 79% with hcv-pt. at that time, the former variants were classified as the "japanese" type (or type ii), while those from the united states (hcv-pt and hcv-h) were classified as type i. comparisons of subgenomic regions of hcv, such as el, 58 core, 59,60 and ns5, 61 sources of sequences: la = hcv-h52; lb = hcv-j53; lc = hc-j954; 2a = hc-j655; 2b = hc-j85°; 3a = nzli56; 3b = tr. 57 in all comparisons, the 5'ncr is the most conserved subgenomic region (maximum 9% nucleotide sequence divergence), whereas highly variable regions are found in parts of the genome encoding el and ns2 (35% to 44% nucleotide sequence and 34% to 45% amino acid sequence differences). provide evidence of at least six major groupings of hcv sequences, each of which contains a series of more closely related clusters of sequences ( figure 4 ). 62 the current widely used nomenclature for hcv variants reflects this hierarchy of sequence relationships between different isolates. based on previous suggestions, 63,64 the major branches in the phylogenetic tree are referred to as "types," while "subtypes" correspond to the more closely related sequences within most of the major groups ( figure 4) . although ns5 sequences are analyzed in figure 4 , equivalent sequence relationships exist in other parts of the genome. the types have been numbered 1 to 6 and the subtypes a, b, and c, in both cases in order of discovery. therefore, the sequence cloned by chiron 3 is assigned type la, hcv-j and hcv-bk are type lb, hc-j6 is type 2a, and hc-j8 is type 2b. this nomenclature closely follows the schemes originally described by enomoto (type 3a) on the basis of phylogenetic analysis of sequences in the ns5, ns3, core, and 5'utr noncoding regions. this approach avoids the inconsistencies of earlier systems and should be easier to extend when new genotypes are discovered. some genotypes of hcv (types la, 2a, and 2b) show a broad worldwide distribution, whereas others, such as types 5a r simmonds and 6a, are found only in specific geographic regions. blood donors and patients with chronic hepatitis from countries in western europe and the united states frequently are infected with genotypes la, lb, 2a, 2b, and 3a, although the relative frequencies of each may vary. 58,61,65-76 there is a trend for more frequent infection with type lb in southern and eastern europe. in many european countries, genotype distributions vary with the age of the patients, reflecting rapid changes in genotype distribution with time within a single geographic area. a striking geographic change in genotype distribution is apparent between southeast europe and turkey (both mainly type lb) and several countries in the middle east and parts of north and central africa where other genotypes predominate. for example, a high frequency of hcv infection is found in egypt (20% to 30%), 77-79 of which almost all corresponds to type 4a. 8°,81 hcv type 4 also is the principal genotype in countries such as yemen, kuwait, iraq, and saudi arabia in the middle east 6° and in zaire, burundi, and gabon in central africa. 58, 69, 71 hcv genotype 5a is frequently found among patients with non-a, non-b hepatitis and blood donors in south africa 58,61,7°,82 but is found only rarely in europe and elsewhere. 58, 72 in japan, taiwan, and some parts of china, genotypes lb, 2a, and 2b are the most frequently found. 63,83-90 infection with type l a in japan appears to be confined to patients with hemophilia who received commercial (us-produced) blood products, such as factor viii and xi clotting concentrates. 63,91 the geographic distribution of type 3 varies; it is only rarely found in japan 92 and is also infrequent in taiwan, hong kong, and macau. 93 how-ever, this genotype is found with increasing frequency in countries to the west, frequently occurring in singapore and accounting for most hepatitis infections in thailand. 93,94 in a small sample, it was the only genotype found in bangladesh and eastern india. 6° as with type 4 in africa, there is now evidence of considerable sequence diversity within the type 3 genotype, with at least 11 different subtypes of type 3 identified in nepal, 95 india, and bangladesh. 6° a genotype with a highly restricted geographic range is type 6a. this type was originally found in hong kong 69,8° and was shown to be a new major genotype by sequence comparisons in the ns5 and e1 regions. 58,61 approximately one third of anti-hcv-positive blood donors in hong kong are infected with this genotype, as are an equivalent proportion in neighboring macau sl and vietnam. 96 a series of novel genotypes has been found in vietnam 96 and thailand6°; these genotypes are distinct from types 1 to 6 classified to date but are more closely related to type 6 than to other genotypes, 96 consistent with their overlapping geographic range with type 6 in southeast asia. numerous investigations are being conducted into possible differences in the course of disease associated with different hcv genotypes, such as the rate of development of cirrhosis and hepatocellular carcinoma, and whether certain genotypes are more or less likely to respond to interferon treatment. a large number of clinical investigations have documented severe and progressive liver disease in patients infected with each of the well-char-acterized genotypes (types la, lb, 2a, 2b, 3a, and 4a), so there is little evidence thus far of variants of hcv that are completely nonpathogenic. however, possible variation in the rate of disease progression, differences between genotypes in routes and frequency of person-to-person transmission, or differences in the probability of achieving a sustained response to antiviral treatment would indicate the potential usefulness of identifying the infecting genotype in certain clinical situations. several clinical studies have catalogued a variety of factors (including genotype) that correlate with the severity of liver disease and show predictive value for response to antiviral treatment. factors that frequently have been shown to influence response to interferon treatment include age and duration of infection, presence of cirrhosis before treatment, genotype, and pretreatment level of circulating viral rna in plasma. 97 a consistent finding reported by several different groups that used a variety of typing assays has been the greatly increased rate of long-term response found when treating patients infected with genotypes 2a, 2b, and 3a compared with type lb. 74 '85'98-106 for example, chemello et al m2 found that long-term (>12 months) normalization of alanine aminotransferase levels was achieved in only 29% of patients infected with type 1 variants, compared with 52% of those infected with type 2 and 74% of those infected with type 3. in a study by tsubota et al, m6 infection with type lb, the presence of cirrhosis, and a high pretreatment virus load were each independently associated with a reduced chance of response (relative risks of 16, 5, and 4, respectively) . the mechanism by which different genotypes differ in response to treatment remains obscure. for treatments such as in-terferon, we do not know whether the effect of the drug is directly antiviral or whether the inhibition of virus replication is secondary to increased expression of major histocompatibility complex class i antigens on the surface of hepatocytes and greater cytotoxic t-cell activity against virus-infected cells. elucidating the mechanism of action of interferon and whether there are virologic differences between genotypes in sensitivity to antiviral agents awaits a cellculture model for hcv infection. although determination of the nucleotide sequences is the most reliable method of identifying different genotypes of hcv, this method is not practical for large clinical studies. many of the published methods for "genotyping" are based on amplification of viral sequences in clinical specimens, either by using type-specific primers that selectively amplify different genotypes, by analyzing the pcr product by hybridization with genotype-specific probes, or by using restriction fragment length polymorphisms (rflp). the assays have different strengths and weaknesses. for example, methods based on amplification and analysis of 5'ncr sequences have advantages of sensitivity, because this region is highly conserved and can be more frequently amplified from hcv-infected patients than other parts of the genome. however, few nucleotide differences are found between different genotypes. although reliably differentiating six major genotypes by using rflp or by type-specific probes is possible, it is not always possible to reliably identify virus subtypes. types 2a and 2b consistently differ at position -124, allowing them to be differentiated by the restriction enzyme scrf167 orby probes 10 to 13 in the inno-lipa (innogenetics, zwijnaarde, belgium). 71 however, sequences of type 2c are indistinguishable from some of those of type 2a. similarly, some of the novel subtypes of type 1 often show sequences identical to those of type la or lb, 54,60 and a small proportion of type la variants are identical to type lb and vice versa. 22 typing methods based on coding regions, such as core and ns5, can reliably identify subtypes as well as major genotypes because the degree of sequence divergence is much greater (table iii) . however, amplifying sequences in coding regions of the genome generally is difficult because sequence variability in the primer-binding sites may reduce the effectiveness of sequence amplification by pcr. nevertheless, the variation is exploited in a genotyping assay that uses type-specific primers complimentary to variable regions in the core gene. currently, this assay can identify and differentiate types la, lb, 2a, 2b, and 3a, 1°7a°8 although the method is technically complicated to perform reliably 1°9 and may be difficult to extend to the great range of hcv genotypes now described. serologic typing methods have advantages over pcr-based methods in terms of the speed and simplicity of sample preparation and the use of simple equipment found in any diagnostic virology laboratory. by careful optimization of reagents, such assays may show high sensitivity and reproducibility. for example, type-specific antibody to ns4 peptides can be detected in approximately 95% of patients with non-a, non-b hepatitis. h° furthermore, the assays can be readily extended to detect new genotypes. one ns4based assay can reliably identify type-spe-cific antibody to six major genotypes, n° although the antigenic similarity between subtypes currently precludes the separate identification of types la and lb and 2a and 2b using the ns4 peptides alone. in contrast to the highly restricted sequence diversity of the 5'ncr and adjacent core region, the two putative envelope genes are highly divergent between different variants of hcv (table iii) 111-114 and show a three-to-four-times higher rate of sequence change with time in persistently infected patients, ll5 because these proteins are likely to lie on the outside of the virus, they would be the principal targets of the humoral immune response to hcv elicited on infection. changes in the e1 and e2 genes may alter the antigenicity of the virus to allow "immune escape" from neutralizing antibodies, 112 therefore accounting for both the high degree of envelope sequence variability and the observed persistent nature of hcv infection. supporting this model is the observation that much of the variability in the e1 and e2 genes is concentrated in discrete "hypervariable" regions, 112-114 possibly reflecting the pressures on the virus to evade immune recognition at specific sites where hcv may be neutralized. experimental evidence supporting this theory includes the observations that variants of hcv with changes in the e1 and e2 genes are antigenically distinct, and, in many cases, the in vivo appearance of variants with different sequences in the hypervariable region is followed by development of antibodies that specifically recognize the new variants. 1~2,116-h8 in one report, 119 persistent hcv infection developed in a patient with deficient anti-body responses (agammaglobulinemia), but without the development of sequence variability in e2 consistent with the role of antibody-driving variation in immunocompetent persons. on the other hand, envelope sequences obtained sequentially from persistently infected patients sometimes show no significant change, 12°-122 whereas in others, variants coexist with antibodies that recognize the corresponding hypervariable region peptides, u6 cytotoxic t-cell responses also may play a protective role in hcv infection, 123,124 as they do in other virus infections for which they are more important in virus clearance than antibody response to infection. although circumstantial evidence supports the theory of immune escape, additional studies are needed to confinn this as a plausible model of virus persistence. many of the current uncertainties may be resolved when a satisfactory in vitro neutralization assay is developed for hcv that enables the effect of amino acid changes in the envelope gene to be directly investigated. additional information also is needed on the relative importance of humoral and cell-mediated immunity to hcv and to determine which is more important in virus clearance and protection from reinfection. persistent infection with hcv entails continuous replication of hcv over years or decades of infection in hcv carders. the large number of replication cycles, combined with the relatively error-prone rna-dependent rna polymerase, leads to measurable sequence drift of hcv over time. for example, over an 8-year interval of persistent infection in a chim-panzee, the rate of sequence change for the genome as a whole was 0.144% per site per year, 115 similar to the rate calculated for sequence change in the 5' half of the genome over 3 years observed in a human carrier (0.192%) 125 and in a crosssectional study. 126 using this "molecular clock," it is possible in principle to calculate times of divergence between hcv variants and therefore to establish their degree of epidemiologic relatedness. for example, the finding of relatively few sequence differences between variants infecting two individuals would provide evidence of recent hcv transmission between them. sequence comparisons in variable regions, such as e2 and ns5, of the hcv genome have been used to document transmission between persons, either from mother to child, 127 within families, 128 by iatrogenic routes, 129-132 or by sexual contact. [133] [134] [135] in these studies, the possibility of transmission by different risk behaviors was assessed by measuring the degree of relatedness of hcv recovered from implicated persons. phylogenetic analysis of nucleotide sequences provides a more formal method of investigating relationships between sequences. phylogenetic trees produced by such methods indicate the degree of relatedness between sequences, while the branching order of the different lineages shows the most likely evolutionary history of the sampled population. for example, clustering of hcv sequences into a single phylogenetic group among recipients of an hcv-contaminated blood product (anti-d immunoglobulin) was still apparent 17 years after infection ( figure 5 ). these approaches to hcv epidemiology will prove valuable in documenting the spread of hcv in different risk groups, evaluating alternative (nonparenteral) routes of transanti-d ig recipients pt figure 5 . phylogenetic relationships between sequences from the ns5 region of patients exposed to an implicated batch of anti-d immunoglobulin (ig) in 1977 (o) and those of epidemiologically unrelated type lb variants from japan (j), the united states (u), and europe (e). b250 = ns5 sequence of hepatitis c virus recovered from batch b250 of anti-d ig; donor ---sequence of variant infecting suspected donor to plasma pool used to manufacture batch b250. phylogenetic analysis was done on a segment (222 base pairs; positions 7975 to 8196) of the ns5 gene that was amplified, sequenced, and analyzed as previously described. 61 sequence distances were calculated using the program dnaml in a data set containing the prototype hepatitis c virus (type la) as an outgroup. sequences were obtained from published sources. 61'76 mission, and understanding more about the origins and evolution of hcv. this paper attempts to review a rapidly expanding area of research. it is hoped that a combination of basic science and clinical studies may eventually lead to a greater understanding of the ways in which hcv infection may be prevented or cured by the use of antiviral vaccines. the information provided here will clearly form the basis of many of these developments. of the hepatitis c virus core protein. j virol. 1994; 68:3631-3641. 32 isolation of a cdna derived from a bloodborne non-a, non-b hepatitis genome. science an assay for circulating antibodies to a major etiologic virus of human non-a, non-b hepatitis genetic organization and diversity of the hepatitis c virus nucleotide sequence of yellow fever virus: implications for flavivirus gene expression and evolution hepatitis c virus shares amino acid sequence similarity with pestiviruses and flaviviruses as well as members of two plant virus supergroups the phylogeny of rna-dependent rna polymerases of positivestrand rna viruses internal ribosome entry site within hepatitis c virus rna secondary structure of the 5' nontranslated region of hepatitis c virus and pestivirus genomic rnas a conserved helical element is essential for internal initiation of translation of hepatitis c virus rna pestivirus translation initiation occurs by internal ribosome entry extraordinarily low density of hepatitis c virus estimated by sucrose density gradient centrifugation and the polymerase chain reaction identification of two flavivirus-like genomes in the gb hepatitis agent phylip inference package version 3.5. seattle, wash: department of genetics evidence for in vitro replication of hepatitis c virus genome in a human t-cell line correlation between the infectivity of hepatitis c virus in vivo and its infectivity in vitro susceptibility of human liver cell cultures to hepatitis c virus infection multicycle infection of hepatitis c virus in cell culture and inhibition by alpha and beta interferons susceptibility of human t-lymphotropic virus type i infected cell line mt-2 to hepatitis c virus infection transfection of a differentiated human hepatoma cell line (huh7) with in vitro-transcribed hepatitis c virus (hcv) rna and establishment of a long-term culture persistently infected with hcv demonstration of in vitro infection of chimpanzee hepatocytes with hepatitis c virus using strand-specific rt/pcr. virology in situ detection of hepatitis c virus: a critical appraisal variation of the hepatitis c virus 5'-non coding region: implications for secondary structure, virus detection and typing translation of human hepatitis c virus rna in cultured cells is mediated by an internal ribosome-binding mechanism complete 5' noncoding region is necessary for the efficient internal initiation of hepatitis c virus rna unusual folding regions and ribosome landing pad within hepatitis c virus and pestivirus rnas end-dependent translation initiation of hepatitis c viral rna and the presence of putative positive and negative translational control elements within the 5' untranslated region expression, identification and subcellular localization of the proteins encoded by the hepatitis c viral genome characterization of hepatitis c virus envelope glycoprotein complexes expressed by recombinant vaccinia viruses expression and identification of hepatitis c virus polyprotein cleavage products gene mapping of the putative structural region of the hepatitis c virus genome by in vitro processing analysis a second hepatitis c virus-encoded proteinase hepatitis c virus ns3 serine proteinase: transcleavage requirements and processing kinetics identification of the domain required for trans-cleavage activity of hepatitis c viral serine proteinase substrate requirements of hepatitis c virus serine proteinase for intermolecular polypeptide cleavage in escherichia coli specificity of the hepatitis c virus ns3 serine protease: effects of substitutions at the 3/4a, 4a/4b, 4b/5a, and 5a/5b cleavage sites on polyprotein processing substrate determinants for cleavage in cis and in trans by the hepatitis c virus ns3 proteinase nucleotide sequence of hepatitis c virus (type 3b) isolated from a japanese patient with chronic hepatitis c at least 12 genotypes of hepatitis c virus predicted by sequence analysis of the putative e1 gene of isolates collected worldwide sequence analysis of the core gene of 14 hepatitis c virus genotypes investigation of the pattern of hepatitis c virus sequence diversity in different geographical regions: implications for virus classification classification of hepatitis c virus into six major genotypes and a series of subtypes by phylogenetic analysis of the ns-5 region a proposed system for the nomenclature of hepatitis c viral genotypes there are two major types of hepatitis c virus in japan analysis of a new hepatitis c virus type and its phylogenetic relationship to existing variants serological responses to infection with three different types of hepatitis c virus two french genotypes of hepatitis c virus: homology of the predominant genotype with the prototype american strain detection of three types of hepatitis c virus in blood donors: investigation of type-specific differences in serological reactivity and rate of alanine aminotransferase abnormalities identification of hepatitis c viruses with a nonconserved sequence of the 5' untranslated region sequence analysis of the 5' noncoding region of hepatitis c virus at least five related, but distinct, hepatitis c viral genotypes exist typing of hepatitis c virus isolates and new subtypes using a line probe assay sequence analysis of the 5' untranslated region in isolates of at least four genotypes of hepatitis c virus in the netherlands use of the 5' non-coding region for genotyping hepatitis c virus genotypes of hepatitis c virus in italian patients with chronic hepatitis c heterogeneity of hepatitis c virus genotypes in france genotypic analysis of hepatitis c virus in american patients hepatitis c virus infection in egyptian volunteer blood donors in riyadh risk factors associated with a high seroprevalence of hepatitis c virus infection in egyptian blood donors high hcv prevalence in egyptian blood donors sequence variability in the 5' non coding region of hepatitis c virus: identification of a new virus type and restrictions on sequence diversity geographical distribution of hepatitis c virus genotypes in blood donors: an international collaborative survey new genotype of hepatitis c virus in south-africa typing of hepatitis c virus (hcv) genomes by restriction fragment length polymorphisms distribution of plural hcv types in japan clinical backgrounds of the patients having different types of hepatitis c virus genomes genomic typing of hepatitis c viruses present in china hcv genotypes in china hcv genotypes in different countries differences in the hepatitis c virus genotypes in different countries prevalence, genotypes, and an isolate (hc-c2) of hepatitis c virus in chinese patients with liver disease imported hepatitis c virus genotypes in japanese hemophiliacs genotypic subtyping of hepatitis c virus survey of major genotypes and subtypes of hepatitis c virus using restriction fragment length polymorphism of sequences amplified from the 5' non-coding region a new type of hepatitis c virus in patients in thailand hepatitis c virus variants from nepal with novel genotypes and their classification into the third major group hepatitis c virus variants from vietnam are classifiable into the seventh, eighth, and ninth major genetic groups prediction of response to interferon treatment of chronic hepatitis c hcv genotypes in chronic hepatitis c and response to interferon detection of hepatitis c virus by polymerase chain reaction and response to interferon-alpha therapy: relationship to genotypes of hepatitis c virus factors useful in predicting the response to interferon therapy in chronic hepatitis c hepatitis c virus genotypes--an investigation of type-specific differences in geographic origin and disease simmonds p. hepatitis c serotype and response to interferon therapy prediction of interferon effect in chronic hepatitis c by both quantification and genotyping of hcv-rna genotypes and titers of hepatitis c virus for predicting response to interferon in patients with chronic hepatitis c antiviral effect of lymphoblastoid interferon-alpha on hepatitis c virus in patients with chronic hepatitis type c factors predictive of response to interferon-alpha therapy in hepatitis c virus infection typing hepatitis c virus by polymerase chain reaction with type-specific primers: application to clinical surveys and tracing infectious sources characterization of the genomic sequence of type v (or 3a) hepatitis c virus isolates and pcr primers for specific detection application of six hepatitis c virus genotyping systems to sera from chronic hepatitis c patients in the united states use of ns-4 peptides to identify typespecific antibody to hepatitis c virus genotypes 1, 2, 3, 4, 5 and 6 characterization of hypervariable regions in the putative envelope protein of hepatitis c virus evidence for immune selection of hepatitis c virus (hcv) putative envelope glycoprotein variants: potential role in chronic hcv infections marked sequence diversity in the putative envelope proteins of hepatitis c viruses hypervariable regions in the putative glycoprotein of hepatitis c virus genetic drift of hepatitis-c virus during an 8.2-year infection in a chimpanzee--variability and stability humoral immune response to hypervariable region-1 of the putative envelope glycoprotein (gp70) of hepatitis c virus hypervariable 5'-terminus of hepatitis c virus e2/ns1 encodes antigenically distinct variants a structurally flexible and antigenically variable n-terminal domain of the hepatitis c virus e2/ns1 protein--implication for an escape from antibody hypervariable region of hepatitis c virus envelope glycoprotein (e2 ns1) in an agammaglobulinemic patient the degree of variability in the amino terminal region of the e2/ns 1 protein of hepatitis c virus correlates with responsiveness to interferon therapy in viraemic patients sequence variation in the large envelope glycoprotein (e2/ns 1) of hepatitis c virus during chronic infection dynamics of genome change in the e2/ns1 region of hepatitis c virus in vivo intrahepatic cytotoxic t lymphocytes specific for hepatitis-c virus in persons with chronic hepatitis hepatitis c virus (hcv)-specific cytotoxic t lymphocytes recognize epitopes in the core and envelope proteins of hcv nucleotide sequence and mutation rate of the h strain of hepatitis c virus analysis of genomic variability of hepatitis c virus a unique, predominant hepatitis c virus variant found in an infant born to a mother with multiple variants risk of hepatitis c virus infections through household contact with chronic carriers--analysis of nucleotide sequences comparison of hepatitis c virus strains obtained from hemodialysis patients hepatitis c viral markers in patients who received blood that was positive for hepatitis c virus core antibody, with genetic evidence of hepatitis c virus transmission hepatitis c transmission in a hemodialysis unit: molecular evidence for spread of virus among patients not sharing equipment confl,~mation of hepatitis c virus transmission through needlestick accidents by molecular evolutionary analysis heterosexual transmission of hepatitis c virus analysis of nucleotide sequences of hepatitis c virus isolates from husband-wife pairs acute hepatitis c infection after sexual exposure key: cord-001974-wjf3c7a7 authors: friis-nielsen, jens; kjartansdóttir, kristín rós; mollerup, sarah; asplund, maria; mourier, tobias; jensen, randi holm; hansen, thomas arn; rey-iglesia, alba; richter, stine raith; nielsen, ida broman; alquezar-planas, david e.; olsen, pernille v. s.; vinner, lasse; fridholm, helena; nielsen, lars peter; willerslev, eske; sicheritz-pontén, thomas; lund, ole; hansen, anders johannes; izarzugaza, jose m. g.; brunak, søren title: identification of known and novel recurrent viral sequences in data from multiple patients and multiple cancers date: 2016-02-19 journal: viruses doi: 10.3390/v8020053 sha: doc_id: 1974 cord_uid: wjf3c7a7 virus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. we have developed a species independent pipeline that utilises sequence clustering for the identification of nucleotide sequences that co-occur across multiple sequencing data instances. we applied the workflow to 686 sequencing libraries from 252 cancer samples of different cancer and tissue types, 32 non-template controls, and 24 test samples. recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. we provide examples of identified inhabitants of the healthy tissue flora as well as experimental contaminants. unmapped sequences that co-occur with high statistical significance potentially represent the unknown sequence space where novel pathogens can be identified. the international agency for research on cancer (iarc) lists several biological species with carcinogenic potential in humans [1] . this list comprises a bacterium (species helicobacter pylori), three parasitic flukes (clonorchis sinensis, opisthorchis viverrini and schistosoma haematobium), and seven viruses: human papillomaviruses (hpv), human immunodeficiency virus-1 (hiv-1), epstein-barr virus (ebv), hepatitis b and c virus (hbv and hcv), kaposi's sarcoma-associated herpesvirus (kshv), and human t-cell lymphotropic virus type 1 (htlv-1). with the advent and spread of low-cost sequencing technologies, many viruses were discovered in the last decade [2] [3] [4] [5] [6] [7] [8] . one interesting discovery that fuelled the search for oncoviruses was merkel cell polyomavirus (mcpyv) found to be clonally integrated in merkel cell carcinomas [9, 10] . the computational biology community has promptly responded to the growing need for specialised algorithms and pipelines to analyse the wealth of data [9, [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] . table s1 summarises the main features of some of the common approaches. in spite of particularities in the implementation, these methodologies share key conceptual similarities: first, sequencing reads or assembled contigs that originate from the host are identified and discarded, a process termed computational subtraction [9, 13] . when the genomes or the concentrations of foreign species are small compared to host genomes, this step eliminates a substantial fraction of the total sequencing reads. second, the remaining non-host sequences are compared to a library of known reference sequences for taxonomic characterisation. the aforementioned methods identify species present across multiple samples, and the recurrence of a given viral entity may indicate an association to disease [10, 26] . albeit conceptually valid, this bottom-up approach is inherently limited to the pre-existence of the organism in the reference databases, whereas novel oncoviruses showing partial or no similarity to known sequences will be missed. current efforts aiming at estimating and characterising metagenomic diversity are far from a complete mapping of the (viral) sequence-space [27] . in fact, it is common to observe that a small but significant amount of unknown sequences, the so-called dark matter [28] , goes through the current analysis pipelines without proper characterisation and is discarded from further analyses [24, 29, 30] . here, we propose a method capable of identifying the recurrence of sequences across related samples independently of their existence in reference databases. our top-down approach compares samples and establishes recurrence prior to the taxonomic characterisation of the sequences. thus, enabling the identification of both known and novel biological entities. our method has conceptual similarities to the work of andreatta et al. [31] where clustering of genes is used to find families that are predominantly found in pathogenic bacteria. attending to koch's postulates as modified by fredericks and relman [32] , sequences from biological entities with a causative or facilitator role would be present in diseased samples and absent in healthy controls. in addition, recent studies documented the presence of contaminating and/or artefactual sequences that source from the laboratory kits and reagents used for sample processing and library preparation [14, [33] [34] [35] [36] [37] . if not properly addressed, these confounding observations may lead to erroneous conclusions [38, 39] . our method ascertains the statistical associations between recurrent sequences and a collection of features that describe the samples with respect to tissue, disease type, laboratory method, etc. additionally, the presence of other known technical problems, such as cluster invasion on the sequencing flow cells [40] , might be detected. the study was conducted in accordance with the declaration of helsinki. two ethical boards reviewed the protocol of this study: the regional committee on health research ethics (case no. h-2-2012-fsp2) and the national committee on health research ethics (case no. 1304226). because the study used only samples that were anonymised at collection both boards waived the need for informed consent in compliance with the national legislation in denmark. two hundred and fifty-two cancer samples of 17 different types were collected from various locations in denmark and hungary. cancer samples of malignant melanoma, acute myeloid leukaemia (aml), b-cell chronic lymphocytic leukaemia (b-cll), chronic myelogenous leukaemia (cml), and t-lineage acute lymphoblastic leukaemia (t-all; n = 9) were obtained from aarhus university hospital, denmark. b-cell precursor acute lymphoblastic leukaemia (bcp-all), oropharyngeal head and neck cancer, testicular cancer, and t-all (n = 2) were obtained from rigshospitalet, denmark (copenhagen university hospital). basal cell carcinoma, and mycosis fungoides (cutaneous t-cell lymphoma) were obtained from bispebjerg hospital (copenhagen university hospital). samples of bladder cancer, breast cancer, colon cancer, as well as ascites fluid of breast cancer, colon cancer, ovarian cancer, and pancreatic cancer were obtained from the danish cancer biobank, herlev hospital, denmark. b-cell lymphoma cell lines were obtained from aalborg university hospital, denmark. vulva cancer samples were obtained from the national institute of oncology, budapest, hungary. libraries were prepared at the center for geogenetics (cgg), university of copenhagen, denmark based on seven different methods for sample processing comprising five different enrichment methods and shotgun sequencing targeting total dna or rna (table s3 ). the enrichment methods used in the current work were circular genome amplification, sequence capture with retrovirus probes, virion enrichment (dna and rna), and mrna enrichment. further details on sample processing and library preparation have been published elsewhere [37, 41, 42] , except for mrna enrichment which was performed using dynabeads mrna direct extraction kit (thermo fisher scientific, waltham, ma, usa) followed by scriptseq v2 rna-seq library preparation kit as for total rna analysis [41] . ultimately, the data set consisted of 686 dna and rna libraries, for which 2ˆ100 bp paired end sequencing was performed using the illumina hiseq 2000 platform at bgi-europe, copenhagen, denmark. the 686 sequencing libraries thus originated from 252 different cancer samples, 32 non-template controls, and 24 exogenous controls. the distribution of methods, libraries and controls for each sample type is provided in table s2 . samples were preferably analysed with multiple methods, thus 165 out of 252 samples were analysed with more than one laboratory method (table s3 ). the datasets went through a sequential pipeline with modules (in order) of preprocessing, computational subtraction of host sequences, low-complexity sequence removal, sequence assembly, clustering, association to metadata features, and taxonomical annotation. figure 1 provides a schematic representation of the pipeline used to identify recurrent sequences across related samples. demultiplexing was performed using a local python script to partition the reads based on exact matches in the fastq header lines to the multiplexed indices provided. preprocessing of reads was performed for all datasets in parallel using adapterremoval [43] with the following parameters {-trimns, -trimqualities, -minquality 2, -minlength 30, -collapse, -outputcollapsed, -outputcollapsedtruncated, -singleton}. read ends were trimmed for low quality base calls. reads were discarded if the length after trimming fell below 30 bp. in these cases, the other read in a pair was kept as a singleton. overlapping paired reads from short inserts were collapsed into a single read if the overlap was longer than 11 bp, according to the default behaviour of adapterremoval. preprocessed reads were filtered if they showed homology to the human reference genome, which included decoys and alternative sequences from version gca_000001405.15 (grch38) of the genome reference consortium (downloaded august 20, 2014) . mapping to the human genome was done using bwa [44] version 0.7.10-r789 with the mem alignment algorithm and default parameters. all mapped reads without sequence alignment/map (sam) [45] flag 4 were discarded. single-unmapped reads from read pairs were kept. human depleted reads were filtered for low-complexity regions using the ncbi-blast associated module dustmasker [46] and default parameters. reads containing low-complexity stretches of 25 bp or longer were discarded. assembly of the remaining (non-human, high complexity) reads was performed with idba-ud [47] and parameters {-precorrection}. contigs shorter than 200 bp were discarded. a total of 1,387,377 contigs, originating from the 686 data sets, went through the entire pipeline. contigs ranged from 200 bp to 418,807 bp with an overall n50 of 817 bp. contigs from all data sets were pooled and clustered based on pairwise sequence homology using cd-hit [48] , in fast mode. we chose parameters for clustering that maximised grouping of similar sequences while minimising inclusion of unrelated sequences. we table s4 from where we chose the final settings {-c 0.90 -as 0.90 -g 1} the datasets were described with a panel of 404 different binary metadata features, for example tissue or disease characteristics (table s5) . features logically assessed whether they related to a particular dataset or not. features describing less than five datasets were removed. additionally, features that correlated perfectly in terms of matthew's correlation coefficient (mcc =˘1) were merged. these filters resulted in 143 unique features (table s5) . biological features (n = 25) defined sample type, for instance tissue or disease category. methodological features (n = 49) described specifics for sample preparation such as extraction kits, enrichment methods, polymerases, primers, buffer, filters used, or the laboratory where the work was performed, etc. technical features (n = 69) defined the flow cell lane identifiers and whether resequencing was performed. the distributions of datasets and samples across the features are provided in table s5 . associations in the clustered contigs and metadata features were evaluated with fisher's exact test using a one-tailed alternative hypothesis (greater) and calculated in r using the function fisher.test [49] . annotation of taxonomy was performed in two rounds. first aligning contigs with blastn [50] with parameters {-evalue 0.001} using default {-task megablast} to a frozen version of the ncbi nucleotide database nt (downloaded 3 february 2015). secondly, using blastx with parameters {-evalue 0.001} of all unmapped contig stretches to a frozen version of the ncbi non-redundant protein database nr (downloaded 3 february 2015). the best hit by highest bit-score was kept for each contig. the taxonomy database (downloaded 3 february 2015) was used to translate all genbank identifiers from hits to taxonomy identifiers. the taxonomy identifiers were then used to obtain the complete taxonomical lineage and extract scientific names of species. the abundances of all species in each cluster were used to calculate the species evenness index as defined by mulder et al. [51] . clusters were annotated as the most abundant species in each cluster. the software to use after the assembly step has been uploaded at https://github.com/ jensfriisnielsen/sequence_recurrence. sequence clusters that have been described in detail throughout the manuscript have been included as supplementary files. clustering performance depends on the adequate selection of parameters. we experimented with a variety of configurations described by c0xay0ygz where x,y,y,z varied. the variables denote minimum percentage of sequence identity x (c0x), minimum percentage of alignment length y (ay0y) based on mode y of shortest (as) or longest (al) contig in alignment, as well as using local (g0) or global (g1) alignment mode z (gz). for example, a configuration encoded c090as090g1 would represent a clustering that requires global alignments with a 90% minimum sequence identity over 90% of the length of the shortest contig. the full list of investigated parameter combinations can be found in table s4 . we chose the parameters based on the performance of the clustering of expected contaminant sequences from avian leukosis virus (accession id ay350569) [37] and other related avian retroviruses (ars) such as avian myeloblastosis virus [52] . ars are used in the manufacture of the reverse transcriptase failsafe pcr enzyme (epicentre, madison, wi, usa) included in the utilized scriptseq v2 rna-seq library preparation kit (illumina, san diego, ca, usa). this kit is commonly used for preparation of rna libraries [52] . we identified clusters containing contigs that aligned to species of the alpharetrovirus genus (ncbi taxa-id: 153057) according to blastn and blastx hereafter referred to as ar clusters. all contigs in ar clusters were resolved with blastn and blastx and two metrics were considered for ar clusters. as the first performance metric, we computed the odds ratios (ors) of the associations between the presence of ar in the clusters and the use of the scriptseq kit. we used a 2ˆ2 contingency table defining the sets of libraries: ar positive and scriptseq positive (arpssp); ar positive and scriptseq negative (arpssn); ar negative and scriptseq positive (arnssp); ar negative and scriptseq negative (arnssn). or is then defined as the ratio arpsspˆarnssn / (arpssnˆarnssp) and describes the strength of the association between clusters and features. ors above 1 indicate association between the presence of the ar virus and the use of the scriptseq kit. ors for all ar clusters were inspected in different parameter settings ( figure s1 ). the ors varied mostly block-wise with the parameters. the largest differences observed were between usages of the shortest or longest sequence in alignments with the alignment length filter. associations from the shortest mode tended to have higher dispersion in the range of ors. furthermore, one block of clustering results using global alignment mode, alignment length based on the shortest contig, and a minimum sequence identity of 90% (c09ˆasyg1), had an overall high range of ors as well as the highest minimum values. this suggested that the clustering was able to reproduce the association between ar clusters and the scriptseq kit. in contrast, the clustering with parameter settings c080as030g0 had a very broad range of ors corresponding to a skewed clustering where some clusters had incorporated most sequences and left other clusters with only a few contigs. as a second performance metric we computed the species evenness [51] indices of the ar clusters represented in figure s1 . the species evenness index is a score that derives from the shannon's diversity index [53] and compares the abundance of each species within a cluster. an index of zero is assigned to clusters that are constituted uniquely by contigs mapping to a single species. contrarily, scores closer to 1 would indicate that the cluster points to several species and that these are equally abundant. in our experiment, we favoured lower evenness indices as they indicate that clusters were able to single out species correctly. for example, parameter settings c080al030g1 generally had a high level of species evenness (median 0.73) in clusters, suggesting an incorrect separation of species. in stark contrast, a block of parameters using global alignment mode, alignment length based on shortest sequence, 90% minimal sequence identity, and a minimum alignment length of 80/85/90/95/99% of the shortest sequence (c090asyg1) all had a median species evenness of 0. this group of parameter settings also showed desirable performance in terms of or, as mentioned before. generally it seemed that global mode (g1) had better ors than local mode (g0) when keeping other parameters fixed. additionally keeping 90% minimal sequence identity (c090) and varying minimal length of alignment in shortest mode (as) seemed stable in both ors and species evenness indices indicating that these close parameter settings were generally good. we chose to proceed with a clustering based on global alignments with a 90% minimum sequence identity over 90% of the length of the shortest contig (c090as090g1). this configuration resulted in a total of 681,858 clusters. of these, 23,205 clusters contained contigs from at least five different data sets and represented 546,735 different contigs. the full distribution of cluster sizes can be found in table s6 . the associations between the clusters and the binary metadata features were assessed using a fisher's one tailed exact test. there were 16,567 significant associations having p-value < 3.01e-10, corresponding to a 0.001 significance level when using bonferroni's correction for multiple testing [54] . the significant associations were arranged in 6165 unique clusters and with 73 unique features. the distribution of the significant associations showed that recurrent sequences originated from diverse sources and that individual clusters often associated to more than one feature (figure 2 ). furthermore it is evident that the clusters tend to group in their associations. likely, these groupings represent one or more organisms. we investigated the nature of the clusters accounting only for the associated feature with the smallest p-value; hereafter described as the strongest associations. there were 50 unique features involved in all the strongest associations. the distribution of p-values for each feature is represented in figure 3 . the 6165 strongest associations were distributed according to 602 biological, 5045 methodological and 518 technical associations. these unique features were arranged in 3 biological, 24 methodological and 23 technical features. most p-values were above 1e-24 and associations with lower p-values were to a few methodological features annotated as extraction kits: qiaamp dna mini kit (f056) (qiagen, hilden, germany), dnase/rnase: promega dnase (f068) (promega, madison, wi, usa), and dnase/rnase: promega dnase stop solution (f069), purification kit: rneasy minelute, qiagen (f076), library build: nebnext, new england biolabs (f079) (new england biolabs, ipswitch, ma, usa), and scriptseq v2 rna-seq, illumina (f084); the latter with a minimum p-value of 3.04e-89. the boxes span the first and third quartiles. the dark band inside each box represents the median. the whiskers of the boxes extend to the lowest and highest values within a distance of 1.5 times the interquartile range. as can be seen, most p-values were above 1e-24, but a few methodological features have associated clusters with very low p-values, such as f056, f068, f069, f076, f079, and f084. the library preparation kit scriptseq v2 rna-seq, illumina (f084) displays strongly associated clusters with p-values as low as 3.04e-89 that mapped as species avian myeloblastosis-associated virus. clusters that were annotated as ncbi species parvovirus nih/cqv were associated to laboratory-kit rneasy minelute, qiagen (f076) with minimal p-value 5.48e-38. finally, a cluster annotated as acanthocystis turfacea chlorella virus mn0810.1 (atcv) was associated to dnase/rnase: promega dnase stop solution (f069) with p-value = 4.19e-12. using blast and the ncbi taxonomy database a taxonomic characterisation was attempted for the 546,735 contigs in the 6165 clusters. this resulted in a taxonomical annotation of 3553 clusters using blastn and an additional 1630 clusters when using blastx. for 982 clusters, neither blastn nor blastx found significant species in the database. these clusters remained uncharacterised (table 1) . we found that almost all clusters significantly associated to biological features could be annotated (598 of 602) in contrast to non-biologically associated clusters (4584 of 5563). a total of 1524 unique species were annotated corresponding to 5183 clusters. the human microbiome project (hmp) [55] defines a collection of reference genomes built from metagenomic samples and associates these to specific sites and tissues across human body sites. we used this data set of 1317 associations as a confirmation that our pipeline was able to correctly detect and taxonomically characterise recurrent biologically relevant sequences. hmp provides a list of commensal organisms commonly found in the three sites that relate to our samples: the gastrointestinal tract, oral cavity and urogenital tract. we observed the strongest, significant associations between the expected organisms and biopsies from colon cancer, oral cavity cancer, and vulva cancer. the taxonomical characterisation of these clusters is described in table 2 . seven clusters significantly associated to colon cancer biopsies describing four different organisms that inhabit the gastrointestinal tract according to hmp, and 342 clusters significantly associated to oral cavity cancer describing 13 different organisms present in the oral cavity in hmp. finally, we also discovered a cluster significantly associated to vulva cancer annotated as species campylobacter ureolyticus (p-value = 1.03e-12), an inhabitant of the urogenital tract as described by hmp. table 2 . taxonomical characterisation of certain biologically associated clusters. the clusters are significantly associated with lowest p-values to biological features and the species annotations are described by hmp. in cases where several clusters shared the annotated species, the lowest p-value of the associations is given #sig: number of significant clusters. cluster in the methodological associations, we correctly detected the strong known association (p-value: 3.04e-89) of avian myeloblastosis-associated virus (accession l10922.1) used in the manufacture of the scriptseq v2 rna-seq library preparation kit (f089). as the clustering parameters were evaluated with this known contaminant, this is expected. furthermore, we annotated 19 clusters as ncbi taxonomy species parvovirus nih-cqv (accession kc617868.1; ncbi taxa-id 1341019), an established contaminant [34, 35] . the associated feature with lowest p-value to the parvovirus clusters suggested a contamination from the rneasy minelute purification kit (f076) manufactured by qiagen (p-value: 5.48e-38). in addition, a single cluster annotated as ncbi taxonomy species acanthocystis turfacea chlorella virus mn0810.1 (accession jx997174.1, taxa-id 1278272) with lowest associated p-value (p-value = 4.19e-12) to laboratory kit dnase/rnase: promega dnase stop solution (f069). atcv-1 was previously reported as a contaminant [36] . in addition to the sequences that were characterised in the previous step, we found 982 examples of uncharacterised clusters. the contigs in these clusters varied substantially in length ranging from a minimum of 200 bp to a maximum of 33.6 kb (n50 = 617 bp). our approach provides the capability to discover these recurrent novel sequences, but also permits the investigation of their plausible origin. most associations were methodological (table 1 ), probably sourcing from nucleotide sequences contained in various laboratory kits (figure 4) . for instance, out of the 868 methodologically associated clusters, there were 648 associated clusters to the laboratory reagent dnase/rnase: promega dnase stop solution (minimum p-value: 2.40e-36). additionally, 110 recurring sequences were attributed to technical issues of the flow cell lanes (minimum p-value: 1.85e-21 in feature 383). in total, 4 unmapped clusters were associated to a biological feature, namely oral cavity cancer, with the longest contig of each cluster at 1789, 3247, 4661, and 4720 bp and with respective p-values of 1.01e-10, 1.01e-10, 1.17e-14, and 1.01e-10. to further clarify the unresolved biologically associated sequences, we manually investigated the cluster representatives using the newest databases (december 2015) at the ncbi web-interfaces for blastn, blastx and ccd v. 3.14 (conserved domains) [56] with default parameters and an e-value <0.001 (table 3) . all cluster representatives could be explained as commensal bacteria related to the oral cavity as described by hmp. in order of increasing length, the cluster representatives were identified as: prevotella veroralis, prevotella veroralis, prevotella fusca jcm 17724, and peptostreptococcus anaerobius as the best hits with percent sequence identity: 92%, 90%, 91%, and 72%, respectively. cluster representatives 3 and 4 contained both bacterial and phage-like conserved domains. the super family duf4280 is of unknown function but related to bacteria and the nd2 super family is the nicotine adenine dinucleotide (nadh) dehydrogenase subunit 2 involved in electron transport. conversely, phage_base_v is related to the tail of phages and rve is an integrase domain that could also be explained as part of a transposon. likely these sequences derived from less well-described parts of the microbiome. usually, virus discovery in shotgun sequencing studies involves processing millions of reads in a viroinformatics pipeline. existing tools typically offer a comprehensive taxonomical description of a single sample that is compared to the taxonomy of other samples to determine their relevance. a downside of this bottom-up methodology is that novel sequences that cannot be sufficiently well characterised in the first round are often discarded in the process. another disadvantage is that potential contaminants will have to be controlled for in the post-processing of the data, an effort that is often omitted [38] . in the present study, we have presented a methodology to categorise recurring sequences according to experimental origin and metadata features. additionally, using this methodology we could replicate both biological and methodological sequence associations known from the literature as well as pinpoint new unannotated recurring sequences. in this study, we had no datasets and features of healthy biological controls. we included a comparison to published reference genomes from hmp to validate that biologically co-occurring sequences can be found with the presented methodology. in this case, we are most likely observing normal biological inhabitants of the tissue samples, something our metadata scheme does not account for. the disease association of many of these organisms is obviously not fully known, and some of them could be related to disease features outside the cancer domain, features that we did not include in the present study. optimising clustering parameters for one virus family might not result in the optimal separation of other families. here, we optimised clustering parameters to rediscover the association of sequences to a known laboratory kit. using these clustering parameters may result in a non-optimal separation of clusters that biologically belonged together, or the reverse problem-merged clusters that reflected different biological units. optimal separation is likely problem-specific. different taxonomic units would require the use of different clustering parameters to separate. however, choosing taxonomy-specific parameters requires a working hypothesis of the most likely findings. here, we focused on the general problem of associating sequences to features using a known association to guide the choice of clustering parameters. a combination of several features may be the true foundation of particular sequences but this was not explored in this work. there may also be situations where a combination of clusters is the correct association to a particular feature. for instance, a virus that is present with a low titre may be sequenced sporadically resulting in less than full coverage and several non-coherent contigs from different viral genome regions. each cluster may include an incomplete amount of data sets and thereby artificially show a weaker association. merged and viewed as one, the incomplete clusters will have the correct strength of association. a grouping based on taxonomy, or a more data-driven approach that cluster sequence groups based on the associated datasets as seen in figure 2 , could be included as another iteration to properly strengthen the statistical associations. furthermore, forming clusters only by internal sequence identity may also miss pathogenic scenarios such as an oncovirus and any necessary helper viruses that do not share homology to the oncovirus. in the present study, we used a majority vote to assign taxonomy. there could be other ways to assign taxonomy, for instance, using a lowest common ancestor (lca) strategy. a majority vote will likely introduce some false assignments if there are distant taxa involved in the sequence group present in nearly equal fractions. a lca strategy can handle this but may reduce the taxonomic resolution to a level where there is no real gain of information. after determining what the significantly co-occurring sequence groups are, more effort might resolve interesting unmapped contigs. for instance, use of more sensitive alignment algorithms, profile hidden markov models (hmms), gene predictors, artificial neural networks trained on specific signals such as viral capsid sequences [57] , or pcr extraction followed by sanger sequencing might provide the relevant clues. however, that was not within the scope of this study. the major advantage of the top-down approach is that it works without prior knowledge of the sequences. it is not dependent on reference sequence databases to single out the promising candidates for further analysis. the top-down method can determine the relevance of unknown sequences upfront while also systematically controlling for contamination by design. most of the annotated sequences found in this study were sequenced from cancer specimens. however, it is apparent from the association analysis that several viral sequences detected are possibly contaminants or technical artefacts. furthermore, the unmapped clusters are retained and easily arranged by relevance according to the nature of their associated features. having this information helps precipitate a prioritised list of sequence candidates the quality of the associations will depend on the experimental design, sampling, available metadata, as well as the rigorousness and standardisation of both working routines and annotations. we stress the point that care must still be taken when formulating hypotheses and in the interpretation of associations. virus discovery using high-throughput sequencing and especially characterising clinical samples is a challenge. many viral discovery pipelines rely on similarity to reference databases as the most compelling argument for identifying putative sequences of medical or biological importance. although a necessary step in the analysis, it has the downside of not considering novel sequences not included in reference sets as well as not considering the origins of the discoveries. there are many examples of contamination and technical artefacts; therefore, potential discoveries should be accompanied by convincing evidence that the sequences are not instead associated with the methodology or technology in use. we suggest a different approach that has complementary advantages inherent in the design. we show that we can differentiate between biological and non-biological associations, replicate known associations and potentially add new associations of cancer-associated viruses. supplementary materials: the following are available online at http://www.mdpi.com/1999-4915/8/2/53, table s1 : virus discovery pipelines, table s2 : distribution of library types, table s3 : methods in cancer samples, table s4 : clustering parameters, table s5 : feature descriptions, table s6 : datasets in clusters, figure s1 : clustering performance. identification of a new human coronavirus cloning of a human parvovirus by molecular screening of respiratory tract samples new dna viruses identified in patients with acute viral infection syndrome characterization and complete genome sequence of a novel coronavirus, coronavirus hku1, from patients with pneumonia identification of a third human polyomavirus identification of a novel polyomavirus from patients with acute respiratory tract infections a cornucopia of human polyomaviruses human transcriptome subtraction by using short sequence tags to search for tumor viruses in conjunctival carcinoma clonal integration of a polyomavirus in human merkel cell carcinoma identification of novel viruses using virushunter -an automated data analysis pipeline capsid: a bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes software to identify or discover microbes by deep sequencing of human tissue comprehensive human virus screening using high-throughput sequencing with a user-friendly representation of bioinformatics analysis: a pilot study rapid identification of non-human sequences in high-throughput sequencing datasets virusfinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data characterization of the viral microbiome in patients with severe lower respiratory tract infections, using metagenomic sequencing crest maps somatic structural variation in cancer genomes with base-pair resolution svdetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data a cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples faster and more accurate sequence alignment with snap rapsearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data mapping short dna sequencing reads and calling variants using mapping quality scores full genome virus detection in fecal samples using sensitive nucleic acid preparation, deep sequencing, and a novel iterative sequence classification algorithm integrative analysis of environmental sequences using megan4 a new arenavirus in a cluster of fatal transplant-associated diseases metagenomics and future perspectives in virus discovery a highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes complete viral rna genome sequencing of ultra-low copy samples by sequence-independent amplification what's in your next-generation sequence data? an exploration of unmapped dna and rna sequence reads from the bovine reference individual silico prediction of human pathogenicity in the γ-proteobacteria sequence-based identification of microbial pathogens: a reconsideration of koch's postulates failure to confirm xmrv/mlvs in the blood of patients with chronic fatigue syndrome: a multi-laboratory study the perils of pathogen discovery: origin of a novel parvovirus-like hybrid genome traced to nucleic acid extraction spin columns novel hybrid parvovirus-like virus, nih-cqv/phv, contaminants in silica column-based nucleic acid extraction kits traces of atcv-1 associated with laboratory component contamination investigation of human cancers for retrovirus by low-stringency target enrichment and high-throughput sequencing false-positive results in metagenomic virus discovery: a strong case for follow-up diagnosis hybrid dna virus in chinese patients with seronegative hepatitis discovered by deep sequencing. proc. natl. acad. sci high-throughput dna sequencing -concepts and limitations target-dependent enrichment of virions determines the reduction of high-throughput sequencing in virus discovery new type of papillomavirus and novel circular single stranded dna virus discovered in urban rattus norvegicus using circular dna enrichment and metagenomics adapterremoval: easy cleaning of next generation sequencing reads aligning sequence reads, clone sequences and assembly contigs with bwa-mem the sequence alignment/map format and sam tools a fast and symmetric dust implementation to mask low-complexity dna sequences idba-ud: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences r: a language and environment for statistical computing; r foundation for statistical computing basic local alignment search tool species evenness and productivity in experimental plant communities avian myeoloblastosis virus (amv): only one side of the coin a mathematical theory of communication how does multiple testing correction work? a framework for human microbiome research cdd: ncbi's conserved domain database artificial neural networks trained to detect viral and phage structural proteins the authors declare no conflict of interest. the founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results. key: cord-321386-u1imic5l authors: li, chun; zhao, jialing; wang, changzhong; yao, yuhua title: protein sequence comparison and dna-binding protein identification with generalized pseaac and graphical representation date: 2018-02-17 journal: comb chem high throughput screen doi: 10.2174/1386207321666180130100838 sha: doc_id: 321386 cord_uid: u1imic5l aim and objective: the rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. this study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. methods: based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. a generalized pseaac (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. results: by using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. the resulting clusters agreed well with the established taxonomic groups. in addition, a generalized pseaac based svm (support vector machine) model was developed to identify dna-binding proteins. experiment results showed that our method performed better than dnabinder, dna-prot, idna-prot and endna-prot by 3.29-10.44% in terms of acc, 0.056-0.206 in terms of mcc, and 1.45-15.76% in terms of f1m. when the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of acc, 0.05-0.32 in terms of mcc, and 3.82-33.85% in terms of f1m. conclusion: these results suggested that the generalized pseaac model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying dna-binding proteins. dna-binding proteins (dna-bps) are very important functional proteins in a cell. these proteins play vital roles in various cellular processes, including dna replication, transcription, regulation of gene expression, packaging, and other activities associated with dna [1] [2] [3] [4] [5] . it is therefore substantially important to distinguish dna-bps from non-dna-binding proteins (nbps). in the past, many experimental and computational techniques have been developed for identifying dna-bps. experimental techniques can provide a clear-cut answer to a query protein. however, the experimental methods are cost-intensive and time-consuming, and thus impractical for large datasets [3] [4] [5] [6] [7] . computational methods can be broadly divided into two categories: structure-based method and sequence-based *address correspondence to this author at the school of mathematics and statistics, hainan normal university, haikou 571158, china; tel: +86-898-65883210; e-mail: lichwun@163.com method. the former can discriminate dna-binding and nonbinding proteins with high accuracy, but these methods can't be employed in high throughput annotation, as they require the structure information of a query protein [1] . though tremendous progress has been achieved in experimental determination of protein structures in the past five decades, it can't keep pace with the explosive growth of sequence information resulting from modern sequencing technology [8] . yet as suggested by anfinsen [9] , proteins contain within their amino acid sequences enough information to determine their native conformation. therefore, it is more promising to use sequence-based methods to identify dna-bps. one of the core issues to the sequence-based methods is how to characterize protein sequences and harvest the fruits hidden in them. the most typical approach is using the amino acid composition (aac) to formulate a protein sequence. owing to its simplicity, the aac model was widely applied in a number of earlier statistic-based methods. however, as pointed out in ref [6] , if we denote by the counts of 20 standard amino acids in a protein sequence, then we can see that there are a total of different sequences/strings possessing the same aac. the reason is that aac model neglects the order relation among elements of a sequence. to overcome this drawback, the concept of pseudo amino acid composition (pseaac, or chou's pseaac) was proposed [10] [11] [12] [13] [14] [15] [16] [17] [18] . the essence of pseaac is that it not only covers aac, but also contains additional order-correlated factors along a protein sequence. another popular way for sequence analysis is to convert the protein primary sequence over 20 amino acids into a reduced one. the earliest and simplest reduction was the well-known hp model, in which 20 standard amino acids are divided into two types, hydrophobic (h) (or non-polar) and polar (p) (or hydrophilic). on the basis of the classic model, a detailed hp model was introduced by dividing the polar class into three subclasses: positive polar, uncharged polar and negative polar [19] . in addition, a few five-group classifications of amino acids were presented for practical purposes [20] [21] [22] [23] . by considering property-based triples, li et al. [6] put forward a six-letter model of amino acids. also based on three physical-chemical properties of amino acids, yao et al. [24] mapped the 20 standard amino acids to eight vertices of a cube with the center of origin, and thus an eightgroup model of amino acids is obtained. motivated by the work mentioned above, we propose a generalized pseaac which is grounded on a three-letter model and 2-d graphical representation of a protein sequence. we summarize the main work of this paper as follows: in section 2, we briefly introduce five datasets used in this study. in section 3, on the basis of two important physicochemical properties of amino acids, we cluster the 20 standard amino acids into three groups. by assigning to each group a representative symbol, we transform a protein sequence into a three-letter sequence. then a 2-d graph without loops and multiple edges and its geometric line adjacency matrix are obtained. a sequence-derived feature vector of dimension (25+ ) is thus constructed to characterize a protein sequence. our scheme is similar to, but obviously different from that of pseaac. in section 4, we apply the presented feature vector to compare -globin proteins of 17 species and 72 spike proteins of coronaviruses respectively. also, we develop a svm (support vector machine) model using the generalized pseaac to identify dna-binding and non-binding proteins on three datasets. experiment results show that the presented method outperforms the existing methods including dnabinder [1] , dna-prot [2] , idna-prot [3] and endna-prot [4] . finally, conclusions are given in section 5. in this study, the following five datasets are used. for convenience, they are denoted by betaset, covset, dnaset, dnaeset and dnaiset, respectively. the dataset called betaset is composed of -globin protein of 17 species: human (alu64020), gorilla (p02024), chimpanzee (p68873), cattle (caa25111), banteng (baj05126), goat (aaa30913), sheep (abc86525), european hare (caa68429), rabbit (caa24251), house mouse (add52660), western wild mouse (acy03394), spiny mouse (acy03377), norway rat (caa29887), opossum (aaa30976), guttata (ach46399), gallus (caa23700), muscovy duck (caa33756). this dataset is used to determine the adjustable parameters in a feature vector. this dataset consists of 72 spike proteins of coronaviruses (covs), 23 of which are mers-covs, and 30 are sars-covs. covs can be divided into three groups according to serotypes. group alpha (formerly known as cov-1) and group beta (formerly cov-2) contain mammalian viruses, while group gamma (formerly cov-3) contains only avian viruses. the name, accession number, and abbreviation of the 72 sequences are listed in table 1 . according to the existing taxonomic groups, sequences 1-5 belong to the first group, sequences 6-8 belong to the third group, and the remainings belong to the second group. this is a benchmark dataset created in 2007 by kumar et al. [1] . it contains 396 sequences, 146 of which are dna-bps (positive samples), and 250 nbps (negative samples). in both the positive and the negative sets, the sequence similarity between any two proteins is not more than 25%. this dataset was also generated by kumar et al. [1] which is based on the work of wang and brown [25] . it originally contains 92 dna-bps and 100 nbps. in order to avoid overestimating a given method, those sequences having sequence similarity with dnaset were removed by xu et al. [4] , and the final dataset is composed of 82 dna-bps and 100 nbps. as an expanded benchmark dataset, dnaeset was constructed in 2014 by xu et al. [4] . according to a sequence filter criteria which is identical to dnaset, they added a number of nbps to dnaset, and the total number of nbps is 2125. by removing the sequence which has sequence identity with dnaiset, the current version of dnaeset has 146 dna-bps and 1710 nbps. isoelectric point (pi) and relative distance (rd) are two important physicochemical properties of the 20 standard amino acids [26] [27] [28] . their original numerical values are listed in table 2 (relative distance) varies between 1469 and 3355. therefore, the normalization of these values is needed. here, we scale them into the interval [0,1] by the formulary below: the corresponding values are listed in table 3 . the last row in this table gives the average values. for the i-th amino acid , if , then we label it by "+", otherwise we will label it by "-". similarly, if property is considered, the second label for amino acid can be obtained. in this way, each of the 20 standard amino acids has a label pair. in table 3 , the corresponding labels are also listed. amino acids with a same label pair are viewed as members of a same group. thus, the 20 standard amino acids are distributed to the following groups: for each group, the first amino acid is used to stand for the group. thus the three groups have three representative letters, they are a, c and h, respectively. the value for the property of a group is defined as the average value for the property of all members in the group. in the left-hand side of table 4 , we list the corresponding values of the three groups. obviously, each group can be viewed as a 2-d vector. in order to make the vectors of the three groups have unit length, we further normalize them to be unit vectors, and list the normalized values ( ) in the right-hand side of table 4 . in fig. (1) , we show the 2-d map of the 20 standard amino acids according to the classification above. by substituting each amino acid with its representative letter, a protein primary sequence is reduced into a threeletter sequence. for example, the three-letter sequence of the sequence segment ekaavtgfwgkvkvdevgaea is ahaaahcccchahacaacaaa. to obtain the graphical representation of a reduced sequence, we start from the origin (0,0) and move in xoy-plane in the direction dictated by fig. (1) . in mathematics, one can let be a given three-letter sequence. and then one has a map , which maps s into a plot set. explicitly, , and is given by where, t represents the transpose of a matrix, (j=1,2) represents the j-th component of the unit vector corresponding to (cf. fig. 1 and table 4 ). connecting all points of the plot set in turn, a 2-d curve is drawn. in fig. (2) , we show the 2-d graphical representation of sequence ahaaahcccchahacaacaaa. it is not difficult to find that the 2-d graphical representation has no degeneracy, and thus is a simple graph, that is, a graph without loops and multiple edges. in this section, we give a numerical characterization of a protein sequence that will facilitate quantitative comparisons of protein sequences. as is known, once a graphical representation is given, it can be transformed into some structural matrices, such as the matrices ed, gd, m/m, and l/l [6, 24, [29] [30] [31] [32] [33] [34] [35] [36] [37] . here we employ the l/l matrix. l/l is a nonnegative symmetric matrix whose off-diagonal entries are defined as a quotient of the euclidean distance between two vertices of the graph and the sum of geometrical lengths of edges between the two vertices. by definition all diagonal elements are zero. obviously, the entries in a l/l matrix are less than or equal to one. the higher order k l/ k l matrix is the matrix whose (i,j)-entry is . as the exponent k approaches positive infinity, k l/ k l converges to a (0,1) matrix (denoted by b l/ b l). with respect to the proposed 2-d graph, [ b l/ b l] ij =1 if and only if the two corresponding vertices lie on a straight line in the curve, including the cases of adjacency and non-adjacency. in this sense, we call such a matrix a geometric line adjacency matrix (glam), or simply a generalized adjacency matrix (gam), generated by a graph, and denote it by . the first zagreb index is a well-known vertex-degreebased molecular structure descriptor. this index was first time considered by gutman and trinajstic about 45 years ago, and since then discussed and used in numerous studies (see [38] [39] [40] and the references cited therein). the first zagreb index is defined as (2) where du denotes the degree (=number of first neighbors) of the vertex u in graph g. if g is a simple graph (i.e. without loops and multiple edges), z g1 can be also obtained directly from its adjacency matrix since the row-sums of this matrix are equal to degrees of the corresponding vertices. it should be mentioned that the zagreb index gives greater weights to inner vertices and edges than to outer vertices and edges of a graph [38] . one way to amend it is to insert inverse values of the vertex-degree into eq(2), and thus the modified zagreb index has been proposed [38] : clearly, m z g1 gives greater weights to outer vertices/edges than to inner ones in a graph. at the same time, on the basis of our geometric line adjacency matrix, we can count the vertex-pair with generalized adjacency relationship. it should be noted that, in our case, the 'neighbors' include not only the conventional neighbors, i.e. the first neighbors, but also the second neighbors, the third neighbors, and so on. we call the corresponding number of graph g a line-adjacency index, and denote it by la(g). then we have a graph-based index: for a symmetric matrix, eigenvalue-based indices, such as the leading eigenvalue [29] [30] [31] [32] [33] 35] and the graph energy [17] , are often used as the matrix invariants. moreover, in our previous paper [41] , an alternative invariant called 'ale-index' was proposed. the ale-index is defined by the following formula: (4) where l is the order of the matrix, and are the m1-and f-norms of a matrix respectively. in order to reduce variations caused by comparison of matrices with different sizes, we consider a normalized ale-index instead of . for convenience, we denote this matrixbased index by . in addition, with respect to three-letter sequence , we define a coupling mode function by , (n=1, 2) where p 1 and p 2 are values for properties of the corresponding representative letter (group), integer k represents the counted rank (or tier) of the coupling mode. then, following the similar procedures in [10, 11] , we can extract global sequence-order information of the three-letter sequence s by , , . where is called the k-th tier correlation factor. clearly, reflects the coupling mode between the most contiguous elements along three-letter sequence s, is the coupling mode between the second most contiguous, the third most contiguous, and so forth. furthermore, if the respective counts of the three representative letters (a, c and h) in sequence s are , respectively, then we can obtain a so-called group composition (gc): where, denotes the size of a group (set). consequently, elements are derived, which reflect the information about the reduced sequence and, particularly, the 2-d graphical representation. by combining these elements with the conventional amino acid composition (aac), a dimensional feature vector can be constructed to numerically characterize a protein sequence: , where (8) here, are frequencies of occurrence of the 20 standard amino acids in a protein sequence, and are weight factors. as will be described later in detail, the four adjustable parameters in eqs (7) and (8) can be determined by a set of known samples. roughly speaking, the vector contains the feature of aac, and the information beyond aac as well, which is similar to chou's pseaac in form. therefore, we call such a vector formulated by eqs (7) and (8) the generalized pseaac of a protein sequence. in this section, we will discuss the use of the generalized pseaac. as can be seen from eqs (7) and (8), the present mathematical descriptor contains four uncertain parameters: , w 1 , w 2 and w 3 . here represents the total number of correlation ranks counted (cf. eq(6)), which is an integer. generally speaking, the greater the value of , the more sequence-order effects will be incorporated. however, if the value is too large, it might cause the overfitting problem or 'high dimension disaster' [15] , therefore, we endeavour to limit the value of to a small integer. in this study, the five datasets (betaset, covset, dnaset, dnaeset and dnaiset) are arranged into two groups: one contains betaset, the other includes the rest. the first group is used for determining the four adjustable parameters, and the second group for testing purpose. according to the method mentioned above, we first associate each of 17 protein sequences in betaset with a dimensional vector (cf. eqs (7) and (8)), and then calculate the pair-wise euclidean distance between any two of the 17 protein sequences via their m-d vectors. thus a real symmetric matrix is obtained. on the basis of the achieved distance matrix , a upgma tree is constructed using mega4 package. the result will depend on values of the rank and the three weight factors. it is found that when , , and , the three non-mammals (muscovy duck, gallus and guttata) form a separate branch and stay outside of the mammals. moreover, in the subtree of mammals, primate species (human, chimpanzee, gorilla) are grouped closely. also, rodent species (norway rat, spiny mouse, house mouse, western wild mouse) and lagomorph species (rabbit, european hare) are situated at independent branches, respectively. while goat, sheep, cattle and banteng appear to cluster together (fig. 3) . this result is analogous to that reported in the literature [6, 29, 30, 35, 36] . accordingly, the four numerical values are respectively used for the four uncertain parameters, and a 31-d feature vector is thus obtained. fig. (3) . the relationship tree of 17 species. in order to evaluate the effectiveness of our method, we test it by phylogenetic analysis on the covset dataset. coronaviruses (covs) belong to the genus coronavirus of family coronaviridae [42] . the first coronavirus (hcov-229e) was isolated from humans in 1965. until 2003, coronaviruses attracted little interest beyond causing mild upper respiratory tract infections. however, this phenomenon changed dramatically with the emergence of sars-cov and mers-cov. as of july 2017, 2040 laboratory-confirmed cases of mers-cov infection were reported in over 27 countries, and at least 710 individuals have died (crude cfr 34.8%) [43] . using the above-determined values for parameters , w 1 , w 2 , and w 3 , we calculate the 31-d feature vectors of 72 coronavirus spike proteins and their euclidean distance matrix; then the corresponding phylogenetic tree (fig. 4) is constructed. observing fig. (4) , we find that the 72 coronavirus spike proteins are clustered into three groups: one contains the five alpha coronaviruses (pedvc, pedv, tgevg, tgev, and hcov-229e), the second includes the three gamma coronaviruses (ibv, ibvbj, ibvc), and the third corresponds to the group beta. a closer look at the subtree of beta coronaviruses shows that mers-covs are clearly clustered together, so it is with sars-covs, while mhv, mhva, mhvm, mhvp, mhvjhm, bcov, bcove, bcovl, bcovm, bcovq and hcov-oc43 are situated at an independent branch. the resulting cluster agrees well with the established taxonomic groups. to further assess the effectiveness of the porposed method, we conduct a series of experiments of identification of dna-binding proteins on three datasets: dnaset, dnaeset and dnaiset. among them, dnaset and dnaeset serve as training datasets, while dnaiset serves as an independent testing dataset. support vector machine (svm) is employed as the classifier, and r package 'e1071' v1.6-8 [44] is used to implement svm. for a given set of binary-labeled training examples, svm maps the input space into a higherdimensional space and seeks a hyperplane to separate the positive samples from the negative ones [25] . the optimal hyperplane maximizes the separation margin between the two classes of training data. the distance measurement between the data points in the high-dimensional space is defined by the kernel function. in this study, we use the radial basis function (rbf) kernel . this model involves two tunable parameters: the kernel width and the penalty parameter c. prediction performance can be assessed using some quality indices including accuracy (acc), sensitivity (se), specificity (sp), fmeasure (f1m) and matthews correlation coefficient (mcc) [2, 4, 5, 25, 37, 45] : , , , . where tp, tn, fp, and fn are defined as the numbers of true positive, true negative, false positive, and false negative samples obtained from the prediction respectively, while p and r denote precision value and recall value, respectively. one can also use the alternative definition by a series of studies published recently [15, [46] [47] [48] . the higher the values of these measurements, the better the quality of prediction. this experiment is made on dnaset itself. to obtain a reliable result with few error, the svm model on dnaset is established by 5-fold cross-validation (5cv) with 3 runs. here the 31-d feature vector of a protein sequence serves as the input for svm. in a 5cv, the positive and negative samples are randomly distributed into five subsets or the socalled folds, and the test is repeated five times. in each of the five iterations, one subset is used as the testing set, while the remaining four subsets are combined together and used to build a classifier (training). the predictions made for the test data instances in all the five iterations yield the final result. the sensitivity, specificity, acc, mcc and f1m are calculated for each run, and the corresponding results and their average values are listed in table 5 . as can be seen fig. (4) . the relationship tree of 72 coronavirus spike proteins. t a lw a n t c 2 t a lw a n t c 1 t a lw a n t c 3 tw 1 tw 2 tw h tw j urbani from this table, we achieve the accuracy (acc) of 89.65%, with mcc of 0.776 and f1m of 84.91%. this result shows that our svm model performs well on the benchmark dataset dnaset. it is important to examine the performance of the newly developed method on an independent dataset. in this experiment, we establish the classifier with the benchmark dataset dnaset and then test it on the independent dataset dnaiset. to decide the parameter pair (γ, c), we utilize a systematic grid search for and , where integers i and j are in ranges [-3, 3] and [0, 3], respectively. it is find that and are the optimal values for dnaset. with the best pair (γ, c), dnaiset is fed to the svm. as a result, our model correctly predicts 68 out of 82 dna-bps and 92 out of 100 nbps. the acc arrives at 87.91%, with the mcc, sensitivity, specificity, and f1m of 0.756, 82.93%, 92.00% and 86.07%, respectively (see table 6 ). this demonstrates that our svm model performs equally well on independent dataset. for convenience of comparison, results of some existing methods including dnabinder [1] , dna-prot [2] , idna-prot [3] and endna-prot [4] are also listed in table 6 . dnabinder developed by kumar et al. [1] can extract evolutionary information in form of position specific scoring matrix (pssm) from the corresponding protein sequence. pssm-21 and pssm-400 are two feature vectors generated by means of pssm, whose dimensions are 21 and 400, respectively. in [1] , pssm-400 based svm model was mainly used for predicting dna-bps. dna-prot [2] is a random forest based method, in which the feature vector includes sequence information and structure information, such as the composition of 20 standard amino acids, composition of 10 amino acid groups, and secondary structure information predicted from a protein sequence. idna-prot [3] constructs the feature vector via the grey model, and random forest is also used as the operation engine. endna-prot [4] is a predictor which encodes a protein sequence into a feature vector with dimension of 188 and adopts an ensemble classifier constructed with four types of machine learning classifiers. all these methods are tested on the same datasets to make an unbiased comparison with our method. observing table 6 , we can see that the current approach outperforms other methods by 3.29-10.44% in terms of acc, 0.056-0.206 in terms of mcc, and 1. .76% in terms of f1m. this result indicates that our method achieves highly comparable performance. when the size of positive samples is comparable to that of negative samples, many machine learning algorithms should have better performance. however, in real life, the number of non-binding proteins is much greater than that of dna-bps, i.e., . in this case, the frequency of nbps is generally much greater than that of the binding ones in the predictions, that is, . eqs (10) and (11) lead to that the value of acc defined by eq (9) tends towards 1. to solve this problem, instead of using the definition of acc in eq (9), here we use the alternative definition [49, 50] : . (12) in order to analyze the influence of the number of negative samples in a benchmark dataset on the predictive performance of the current method, we construct a series of subsets of dnaeset and use them as training set in turn, while dnaiset is always used as the testing set. each subset contains all the 146 dna-bps and a part of nbps in dnaeset. in detail, if the set of nbps in is denoted by , k=1, 2, ..., then consists of 250 nbps randomly selected from dnaeset. and is obtained by adding 50 nbps to , until 1700 nbps are contained in it. for each subset , k=1, 2, ..., 30, we develop the svm model by 5cv with 3 runs. the results averaging over the three runs are given in fig. (5) . from fig. (5) we can see that the curves of acc and acc visibly split with each other when n, the size of , is larger. with increasing of n, acc increases rapidly, while acc tends to be steady. the value of acc seems higher and higher on the surface, but it cannot correctly reflect the performance because it is nothing but a false appearance. in order to show the advantage of their method, xu et al. [4] created a dataset called expanded benchmark dataset1100 with all the 146 positive samples and 1100 negative samples in dnaeset, which is employed as another training dataset to evaluate the predictive performance on the independent dataset dnaiset. for convenience of comparison, we also select the expanded benchmark dataset to establish the classifier and test it on dnaiset. repeating this procedure five times, the average results are given in table 7 (the first row). results obtained by the other four methods (dnabinder, dna-prot, idna-prot and endna-prot) trained on the expanded benchmark dataset with n=1100 are also listed in table 7 . from this table we see that the overall accuracy of our method is about 92%, with mcc of 0.84 and f1m of 91.24%, which outperforms other methods with improvement in the range of 2.49-19.12% in terms of acc, 0.05-0.32 in terms of mcc, and 3.82-33.85% in terms of f1m. this suggests that our method performs well on unbalanced datasets. based on two important physicochemical properties, 20 standard amino acids were distributed into three groups, and to each of which a representative symbol was assigned. by replacing each amino acid with its representative letter, a protein primary sequence was converted into a three-letter sequence, which can be viewed as a coarse-grained description of the protein primary sequence. on the basis of the three-letter sequence, a graph without loops and multiple edges was obtained. by taking the advantage of the 2-d graph, we constructed a geometric line adjacency matrix (glam) and then the corresponding ale-index, the lineadjacency index, the first zagreb index and its modification were calculated. in addition, order-correlated factors were extracted via the reduced sequence. by combining these elements with the frequencies of occurrence of 20 standard amino acids and their three representative letters, a generalized pseaac model of a protein sequence was constructed. on five popular datasets, the proposed method was tested by phylogenetic analysis and identification of dna-binding proteins. the results illustrated the better performance of our method. identification of dna-binding proteins using support vector machines and evolutionary profiles dna-prot: identification of dna binding proteins from protein sequence information using random forest idna-prot: identification of dna binding proteins using random forest with grey model endna-prot: identification of dna-binding proteins by applying ensemble learning gdna-prot: predict dna-binding proteins by employing support vector machine and a novel numerical characterization of protein sequence numerical characterization of protein sequences based on the generalized chou's pseudo amino acid composition light-directed synthesis of peptide nucleic acids (pnas) chips protein structure prediction from sequence variation principles that govern the folding of protein chains prediction of protein cellular attributes using pseudoamino acid composition using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology inuc-stnc: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of saac and chou's pseaac pseknc-general: a cross-platform package for generating various modes of pseudo nucleotide compositions identify recombination spots with pseudo dinucleotide composition sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel svm protein sequence comparison based on physicochemical properties and the position-feature energy matrix a novel protein characterization based on pseudo amino acids composition and star-like graph topological indices chaos game representation of protein sequences based on the detailed hp model and their multifractal and correlation analyses a computational approach to simplifying the protein folding problem modeling study on the validity of a possibly simplified representation of proteins 2-d graphical representation of protein sequences and its application to coronavirus phylogeny clustering of the protein design alphabets by using hierarchical self-organizing map a novel descriptor of protein sequences and its application bindn: a web-based tool for efficient prediction of dna and rna binding sites in amino acid sequences amino acid difference formula to help explain protein correlation analysis of some physical chemistry properties among genetic codons and amino acids similarity analysis of protein sequences based on the normalized relative entropy on 3-d graphical representation of dna primary sequences and their numerical characterization analysis of similarity/dissimilarity of dna sequences based on novel 2-d graphical representation milestones in graphical bioinformatics graphical representation of proteins representation of proteins as walks in 20-d space phylogenetic analysis of dna sequences based on k-word and rough set theory on the characterization of dna primary sequences by triplet of nucleic acid bases dv-curve: a novel intuitive tool for visualizing and analyzing dna sequences a novel method for similarity analysis and protein sub-cellular localization prediction the zagreb indices 30 years after on vertex-degree-based molecular structure descriptors graphs with fixed number of pendent vertices and minimal zeroth-order general randic index new invariant of dna sequences genetic drift of human coronavirus oc43 spike gene during adaptive evolution who mers-cov global summary and risk assessment assessing the accuracy of prediction algorithms for classification: an overview ipro54-pseknc: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition using deformation energy to analyze nucleosome positioning in genomes irna-pseu: identifying rna pseudouridine sites recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the z curve using a euclid distance discriminant method to find protein coding genes in the yeast genome the authors' greatest gratitude goes to the anonymous referees for their insightful suggestions and generous support.the authors are also indebted to the previous programs: the natural science foundation of liaoning province (201602005), the program for liaoning innovative research team in university (lt2014024), and the national natural science foundation of china (61762035). not applicable. the authors declare no conflict of interest, financial or otherwise. key: cord-022494-d66rz6dc authors: webb, b.; eswar, n.; fan, h.; khuri, n.; pieper, u.; dong, g.q.; sali, a. title: comparative modeling of drug target proteins date: 2014-10-01 journal: reference module in chemistry, molecular sciences and chemical engineering doi: 10.1016/b978-0-12-409547-2.11133-3 sha: doc_id: 22494 cord_uid: d66rz6dc in this perspective, we begin by describing the comparative protein structure modeling technique and the accuracy of the corresponding models. we then discuss the significant role that comparative prediction plays in drug discovery. we focus on virtual ligand screening against comparative models and illustrate the state-of-the-art by a number of specific examples. structure-based or rational drug discovery has already resulted in a number of drugs on the market and many more in the development pipeline. [1] [2] [3] [4] structure-based methods are now routinely used in almost all stages of drug development, from target identification to lead optimization. [5] [6] [7] [8] central to all structure-based discovery approaches is the knowledge of the threedimensional (3d) structure of the target protein or complex because the structure and dynamics of the target determine which ligands it binds. the 3d structures of the target proteins are best determined by experimental methods that yield solutions at atomic resolution, such as x-ray crystallography and nuclear magnetic resonance (nmr) spectroscopy. 9 while developments in the techniques of experimental structure determination have enhanced the applicability, accuracy, and speed of these structural studies, 10, 11 structural characterization of sequences remains an expensive and time-consuming task. the publicly available protein data bank (pdb) 12 currently contains $ 92 000 structures and grows at a rate of approximately 40% every 2 years. on the other hand, the various genome-sequencing projects have resulted in over 40 million sequences, including the complete genetic blueprints of humans and hundreds of other organisms. 13, 14 this achievement has resulted in a vast collection of sequence information about possible target proteins with little or no structural information. current statistics show that the structures available in the pdb account for less than 1% of the sequences in the uniprot database. 13 moreover, the rate of growth of the sequence information is more than twice that of the structures, and is expected to accelerate even more with the advent of readily available next-generation sequencing technologies. due to this wide sequence-structure gap, reliance on experimentally determined structures limits the number of proteins that can be targeted by structure-based drug discovery. fortunately, domains in protein sequences are gradually evolving entities that can be clustered into a relatively small number of families with similar sequences and structures. 15, 16 for instance, 75-80% of the sequences in the uniprot database have been grouped into fewer than 15 000 domain families. 17, 18 similarly, all the structures in the pdb have been classified into about 1000 distinct folds. 19, 20 computational protein structure prediction methods, such as threading 21 and comparative protein structure modeling, 22, 23 strive to bridge the sequence-structure gap by utilizing these evolutionary relationships. the speed, low cost, and relative accuracy of these computational methods have led to the use of predicted 3d structures in the drug discovery process. 24, 25 the other class of prediction methods, de novo or ab initio methods, attempts to predict the structure from sequence alone, without reliance on evolutionary relationships. however, despite progress in these methods, [26] [27] [28] especially for small proteins with fewer than 100 amino acid residues, comparative modeling remains the most reliable method of predicting the 3d structure of a protein, with an accuracy that can be comparable to a low-resolution, experimentally determined structure. 9 the basis of comparative modeling the primary requirement for reliable comparative modeling is a detectable similarity between the sequence of interest (target sequence) and a known structure (template). as early as 1986, chothia and lesk 29 showed that there is a strong correlation between sequence and structural similarities. this correlation provides the basis of comparative modeling, allows a coarse assessment of model errors, and also highlights one of its major challenges: modeling the structural differences between the template and target structures 30 (figure 1 ). comparative modeling stands to benefit greatly from the structural genomics initiative. 31 structural genomics aims to achieve significant structural coverage of the sequence space with an efficient combination of experimental and prediction methods. 32 this goal is pursued by careful selection of target proteins for structure determination by x-ray crystallography and nmr spectroscopy, such that most other sequences are within 'modeling distance' (e.g., >30% sequence identity) of a known structure. 15, 16, 31, 33 the expectation is that the determination of these structures combined with comparative modeling will yield useful structural information for the largest possible fraction of sequences in the shortest possible timeframe. the impact of structural genomics is illustrated by comparative modeling based on the structures determined by the new york structural genomics research consortium. for each new structure without a close homolog in the pdb, on average, 3500 protein sequences without any prior structural characterization could be modeled at least at the level of the fold. 34 thus, the structures of most proteins will eventually be predicted by computation, not determined by experiment. in this review, we begin by describing the various steps involved in comparative modeling. next, we emphasize two aspects of model refinement, loop modeling and side-chain modeling, due to their relevance in ligand docking and rational drug discovery. we then discuss the errors in comparative models. finally, we describe the role of comparative modeling in drug discovery, focusing on ligand docking against comparative models. we compare successes of docking against models and x-ray structures, and illustrate the computational docking against models with a number of examples. we conclude with a summary of topics that will impact on the future utility of comparative modeling in drug discovery, including an automation and integration of resources required for comparative modeling and ligand docking. comparative modeling consists of four main steps 23 (figure 2 (a)): (1) fold assignment that identifies similarity between the target sequence of interest and at least one known protein structure (the template); (2) alignment of the target sequence and the template(s); (3) building a model based on the alignment with the chosen template(s); and (4) predicting model errors. although fold assignment and sequence-structure alignment are logically two distinct steps in the process of comparative modeling, in practice almost all fold assignment methods also provide sequence-structure alignments. in the past, fold assignment methods were optimized for better sensitivity in detecting remotely related homologs, often at the cost of alignment accuracy. however, recent methods simultaneously optimize both the sensitivity and alignment accuracy. therefore, in the following discussion, we will treat fold assignment and sequence-structure alignment as a single protocol, explaining the differences as needed. as mentioned earlier, the primary requirement for comparative modeling is the identification of one or more known template structures with detectable similarity to the target sequence. the identification of suitable templates is achieved by scanning structure databases, such as pdb, 12 scop, 19 dali, 36 and cath, 20 with the target sequence as the query. the detected similarity is usually quantified in terms of sequence identity or statistical measures, such as e-value or z-score, depending on the method used. sequence-structure relationships are coarsely classified into three different regimes in the sequence similarity spectrum: (1) the easily detected relationships characterized by >30% sequence identity; (2) the 'twilight zone,' 37 corresponding to relationships figure 1 average model accuracy as a function of sequence identity. 30 as the sequence identity between the target sequence and the template structure decreases, the average structural similarity between the template and the target also decreases (dashed line, triangles). 29 structural overlap is defined as the fraction of equivalent c a atoms. for the comparison of the model with the actual structure (filled circles), two c a atoms were considered equivalent if they belonged to the same residue and were within 3.5 å of each other after least-squares superposition. for comparisons between the template structure and the actual target structure (triangles), two c a atoms were considered equivalent if they were within 3.5 å of each other after alignment and rigid-body superposition. the difference between the model and the actual target structure is a combination of the target-template differences (green area) and the alignment errors (red area). the figure was constructed by calculating 3993 comparative models based on a single template of varying similarity to the targets. all targets had known (experimentally determined) structures. 30 with statistically significant sequence similarity in the 10-30% range; and (3) the 'midnight zone,' 37 corresponding to statistically insignificant sequence similarity. for closely related protein sequences with identities higher than 30-40%, the alignments produced by all methods are almost always largely correct. the quickest way to search for suitable templates in this regime is to use simple pairwise sequence alignment methods such as ssearch, 38 blast, 39 and fasta. 38 brenner et al. showed that these methods detect only $18% of the homologous pairs at less than 40% sequence identity, while they identify more than 90% of the relationships when sequence identity is between 30% and 40%. 40 another benchmark, based on 200 reference structural alignments with 0-40% sequence identity, indicated that blast is able to correctly align only 26% of the residue positions. 41 the sensitivity of the search and accuracy of the alignment become progressively difficult as the relationships move into the twilight zone. 37,42 a significant improvement in this area was the introduction of profile methods by gribskov and co-workers. 43 the profile of a sequence is derived from a multiple sequence alignment and specifies residue-type occurrences for each alignment position. the information in a multiple sequence alignment is most often encoded as either a position-specific scoring matrix (pssm) 39, 44, 45 or as a hidden markov model (hmm). 46, 47 to identify suitable templates for comparative modeling, the profile of the target sequence is used to search against a database of template sequences. the profile-sequence methods are more sensitive in detecting related structures in the twilight zone than the pairwise sequence-based methods; they detect approximately twice the number of homologs under 40% sequence identity. 41, 48, 49 the resulting profile-sequence alignments correctly align approximately 43-48% of residues in the 0-40% sequence identity range; 41,50 this number is almost twice as large as that of the pairwise sequence methods. frequently used programs for profile-sequence alignment are psi-blast, 39 sam, 51 hmmer, 46 hhsearch, 52 hhblits, 53 and build_profile. 54 as a natural extension, the profile-sequence alignment methods have led to profile-profile alignment methods that search for suitable template structures by scanning the profile of the target sequence against a database of template profiles, as opposed to a database of template sequences. these methods have proven to include the most sensitive and accurate fold assignment and figure 2 comparative protein structure modeling. (a) a flowchart illustrating the steps in the construction of a comparative model. 23 (b) description of comparative modeling by extraction of spatial restraints as implemented in modeller. 35 by default, spatial restraints in modeller include: (1) homology-derived restraints from the aligned template structures; (2) statistical restraints derived from all known protein structures; and (3) stereochemical restraints from the charmm-22 molecular mechanics force field. these restraints are combined into an objective function that is then optimized to calculate the final 3d model of the target sequence. alignment protocols to date. 50, [55] [56] [57] profile-profile methods detect $28% more relationships at the superfamily level and improve the alignment accuracy by 15-20% compared to profile-sequence methods. 50, 58 there are a number of variants of profile-profile alignment methods that differ in the scoring functions they use. 50, 55, [58] [59] [60] [61] [62] [63] [64] however, several analyses have shown that the overall performances of these methods are comparable. 50, [55] [56] [57] some of the programs that can be used to detect suitable templates are ffas, 65 sp3, 58 salign, 50 hhblits, 53 hhsearch, 52 and ppscan. 54 sequence-structure threading methods as the sequence identity drops below the threshold of the twilight zone, there is usually insufficient signal in the sequences or their profiles for the sequence-based methods discussed above to detect true relationships. 48 sequence-structure threading methods are most useful in this regime as they can sometimes recognize common folds, even in the absence of any statistically significant sequence similarity. 21 these methods achieve higher sensitivity by using structural information derived from the templates. the accuracy of a sequence-structure match is assessed by the score of a corresponding coarse model and not by sequence similarity, as in sequence comparison methods. 21 the scoring scheme used to evaluate the accuracy is either based on residue substitution tables dependent on structural features such as solvent exposure, secondary structure type, and hydrogen bonding properties, 58,66-68 or on statistical potentials for residue interactions implied by the alignment. [69] [70] [71] [72] [73] the use of structural data does not have to be restricted to the structure side of the aligned sequence-structure pair. for example, sam-t08 makes use of the predicted local structure for the target sequence to enhance homolog detection and alignment accuracy. 74 commonly used threading programs are genthreader, 66,75 3d-pssm, 76 fugue, 68 sp3, 58 sam-t08 multitrack hmm, 67, 74, 77 and muster. 78 iterative sequence-structure alignment yet another strategy is to optimize the alignment by iterating over the process of calculating alignments, building models, and evaluating models. such a protocol can sample alignments that are not statistically significant and identify the alignment that yields the best model. although this procedure can be time-consuming, it can significantly improve the accuracy of the resulting comparative models in difficult cases. 79 regardless of the method used, searching in the twilight and midnight zones of the sequence-structure relationship often results in false negatives, false positives, or alignments that contain an increasingly large number of gaps and alignment errors. improving the performance and accuracy of methods in this regime remains one of the main tasks of comparative modeling today. 80 it is imperative to calculate an accurate alignment between the target-template pair. although some progress has been made recently, 81 comparative modeling can rarely recover from an alignment error. 82 after a list of all related protein structures and their alignments with the target sequence have been obtained, template structures are prioritized depending on the purpose of the comparative model. template structures may be chosen purely based on the targettemplate sequence identity or a combination of several other criteria, such as experimental accuracy of the structures (resolution of x-ray structures, number of restraints per residue for nmr structures), conservation of active-site residues, holo-structures that have bound ligands of interest, and prior biological information that pertains to the solvent, ph, and quaternary contacts. it is not necessary to select only one template. in fact, the use of several templates approximately equidistant from the target sequence generally increases the model accuracy. 83, 84 model building once an initial target-template alignment is built, a variety of methods can be used to construct a 3d model for the target protein. 23, 82, [85] [86] [87] [88] the original and still widely used method is modeling by rigid-body assembly. 86, 87, 89 this method constructs the model from a few core regions, and from loops and side chains that are obtained by dissecting related structures. commonly used programs that implement this method are composer, 90-93 3d-jigsaw, 94 rosettacm, 81 and swiss-model. 95 another family of methods, modeling by segment matching, relies on the approximate positions of conserved atoms from the templates to calculate the coordinates of other atoms. [96] [97] [98] [99] [100] an instance of this approach is implemented in segmod. 99 the third group of methods, modeling by satisfaction of spatial restraints, uses either distance geometry or optimization techniques to satisfy spatial restraints obtained from the alignment of the target sequences with the template structures. 35, [101] [102] [103] [104] specifically, modeller, 35,105,106 our own program for comparative modeling, belongs to this group of methods. modeller implements comparative protein structure modeling by the satisfaction of spatial restraints that include: (1) homologyderived restraints on the distances and dihedral angles in the target sequence, extracted from its alignment with the template structures; 35 (2) stereochemical restraints such as bond length and bond angle preferences, obtained from the charmm-22 molecular mechanics force field; 107 (3) statistical preferences for dihedral angles and nonbonded interatomic distances, obtained from a representative set of known protein structures; 108 and (4) optional manually curated restraints, such as those from nmr spectroscopy, rules of secondary structure packing, cross-linking experiments, fluorescence spectroscopy, image reconstruction from electron microscopy, site-directed mutagenesis, and intuition ( figure 2(b) ). the spatial restraints, expressed as probability density functions, are combined into an objective function that is optimized by a combination of conjugate gradients and molecular dynamics with simulated annealing. this model-building procedure is similar to structure determination by nmr spectroscopy. accuracies of the various model-building methods are relatively similar when used optimally. 109, 110 other factors, such as template selection and alignment accuracy, usually have a larger impact on the model accuracy, especially for models based on less than 30% sequence identity to the templates. however, it is important that a modeling method allows a degree of flexibility and automation to obtain better models more easily and rapidly. for example, a method should allow for an easy recalculation of a model when a change is made in the alignment; it should be straightforward to calculate models based on several templates; and the method should provide tools for incorporation of prior knowledge about the target (e.g., cross-linking restraints and predicted secondary structure). protein sequences evolve through a series of amino acid residue substitutions, insertions, and deletions. while substitutions can occur throughout the length of the sequence, insertions and deletions mostly occur on the surface of proteins in segments that connect regular secondary structure segments (i.e., loops). while the template structures are helpful in the modeling of the aligned target backbone segments, they are generally less valuable for the modeling of side chains and irrelevant for the modeling of insertions such as loops. the loops and side chains of comparative models are especially important for ligand docking; thus, we discuss them in the following two sections. loop modeling is an especially important aspect of comparative modeling in the range from 30% to 50% sequence identity. in this range of overall similarity, loops among the homologs vary while the core regions are still relatively conserved and aligned accurately. loops often play an important role in defining the functional specificity of a given protein, forming the active and binding sites. loop modeling can be seen as a mini protein folding problem because the correct conformation of a given segment of a polypeptide chain has to be calculated mainly from the sequence of the segment itself. however, loops are generally too short to provide sufficient information about their local fold. even identical decapeptides in different proteins do not always have the same conformation. 111, 112 some additional restraints are provided by the core anchor regions that span the loop and by the structure of the rest of the protein that cradles the loop. although many loop-modeling methods have been described, it is still challenging to correctly and confidently model loops longer than approximately 10-12 residues. 105, 113, 114 two classes of methods there are two main classes of loop-modeling methods: (1) database search approaches that scan a database of all known protein structures to find segments fitting the anchor core regions 98, 115 ; and (2) conformational search approaches that rely on optimizing a scoring function. [116] [117] [118] there are also methods that combine these two approaches. 119, 120 database-based loop modeling the database search approach to loop modeling is accurate and efficient when a database of specific loops is created to address the modeling of the same class of loops, such as b-hairpins, 121 or loops on a specific fold, such as the hypervariable regions in the immunoglobulin fold. 115, 122 there are attempts to classify loop conformations into more general categories, thus extending the applicability of the database search approach. [123] [124] [125] however, the database methods are limited because the number of possible conformations increases exponentially with the length of a loop, and until the late 1990s only loops up to 7 residues long could be modeled using the database of known protein structures. 126, 127 however, the growth of the pdb in recent years has largely eliminated this problem. 128 optimization-based methods there are many optimization-based methods, exploiting different protein representations, objective functions, and optimization or enumeration algorithms. the search algorithms include the minimum perturbation method, 129 dihedral angle search through a rotamer library, 114,130 molecular dynamics simulations, 119, 131 genetic algorithms, 132 monte carlo and simulated annealing, [133] [134] [135] multiple-copy simultaneous search, 136 self-consistent field optimization, 137 robotics-inspired kinematic closure 138 and enumeration based on graph theory. 139 the accuracy of loop predictions can be further improved by clustering the sampled loop conformations and partially accounting for the entropic contribution to the free energy. 140 another way of improving the accuracy of loop predictions is to consider the solvent effects. improvements in implicit solvation models, such as the generalized born solvation model, motivated their use in loop modeling. the solvent contribution to the free energy can be added to the scoring function for optimization, or it can be used to rank the sampled loop conformations after they are generated with a scoring function that does not include the solvent terms. 105, [141] [142] [143] side-chain modeling two simplifications are frequently applied in the modeling of side-chain conformations. 144 first, amino acid residue replacements often leave the backbone structure almost unchanged, 145 allowing us to fix the backbone during the search for the best side-chain conformations. second, most side chains in high-resolution crystallographic structures can be represented by a limited number of conformers that comply with stereochemical and energetic constraints. 146 this observation motivated ponder and richards 147 to develop the first library of side-chain rotamers for the 17 types of residues with dihedral angle degrees of freedom in their side chains, based on 10 high-resolution protein structures determined by x-ray crystallography. subsequently, a number of additional libraries have been derived. [148] [149] [150] [151] [152] [153] [154] [155] rotamers rotamers on a fixed backbone are often used when all the side chains need to be modeled on a given backbone. this approach reduces the combinatorial explosion associated with a full conformational search of all the side chains, and is applied by some comparative modeling 86 and protein design approaches. 156 however, $ 15% of the side chains cannot be represented well by these libraries. 157 in addition, it has been shown that the accuracy of side-chain modeling on a fixed backbone decreases rapidly when the backbone errors are larger than 0.5 å . 158 earlier methods for side-chain modeling often put less emphasis on the energy or scoring function. the function was usually greatly simplified, and consisted of the empirical rotamer preferences and simple repulsion terms for nonbonded contacts. 151 nevertheless, these approaches have been justified by their performance. for example, a method based on a rotamer library compared favorably with that based on a molecular mechanics force field, 159 and new methods continue to be based on the rotamer library approach. [160] [161] [162] the various optimization approaches include a monte carlo simulation, 163 simulated annealing, 164 a combination of monte carlo and simulated annealing, 165 the dead-end elimination theorem, 166, 167 genetic algorithms, 155 neural network with simulated annealing, 168 mean field optimization, 169 and combinatorial searches. 151, 170, 171 several studies focused on the testing of more sophisticated potential functions for conformational search 171, 172 and development of new scoring functions for side-chain modeling, 173 reporting higher accuracy than earlier studies. the major sources of error in comparative modeling are discussed in the relevant sections above. the following is a summary of these errors, dividing them into five categories (figure 3) . this error is a potential problem when distantly related proteins are used as templates (i.e., less than 30% sequence identity). distinguishing between a model based on an incorrect template and a model based on an incorrect alignment with a correct template is difficult. in both cases, the evaluation methods (below) will predict an unreliable model. the conservation of the key functional or structural residues in the target sequence increases the confidence in a given fold assignment. the single source of errors with the largest impact on comparative modeling is misalignments, especially when the target-template sequence identity decreases below 30%. alignment errors can be minimized in two ways. using the profile-based methods discussed above usually results in more accurate alignments than those from pairwise sequence alignment methods. another way of improving the alignment is to modify those regions in the alignment that correspond to predicted errors in the model. 83 segments of the target sequence that have no equivalent region in the template structure (i.e., insertions or loops) are one of the most difficult regions to model. again, when the target and template are distantly related, errors in the alignment can lead to incorrect positions of the insertions. using alignment methods that incorporate structural information can often correct such errors. once a reliable alignment is obtained, various modeling protocols can predict the loop conformation, for insertions of fewer than 8-10 residues. 105, 113, 119, 174 distortions and shifts in correctly aligned regions as a consequence of sequence divergence, the main-chain conformation changes, even if the overall fold remains the same. therefore, it is possible that in some correctly aligned segments of a model, the template is locally different (<3 å ) from the target, resulting in errors in that region. the structural differences are sometimes not due to differences in sequence, but are a consequence of artifacts in structure determination or structure determination in different environments (e.g., packing of subunits in a crystal). the simultaneous use of several templates can minimize this kind of error. 83,84 figure 3 typical errors in comparative modeling. 23 shown are the typical sources of errors encountered in comparative models. two of the major sources of errors in comparative modeling are due to incorrect templates or incorrect alignments with the correct templates. the modeling procedure can rarely recover from such errors. the next significant source of errors arises from regions in the target with no corresponding region in the template, i.e., insertions or loops. other sources of errors, which occur even with an accurate alignment, are due to rigid-body shifts, distortions in the backbone, and errors in the packing of side chains. as the sequences diverge, the packing of the atoms in the protein core changes. sometimes even the conformation of identical side chains is not conserved -a pitfall for many comparative modeling methods. side-chain errors are critical if they occur in regions that are involved in protein function, such as active sites and ligand-binding sites. the accuracy of the predicted model determines the information that can be extracted from it. thus, estimating the accuracy of a model in the absence of the known structure is essential for interpreting it. as discussed earlier, a model calculated using a template structure that shares more than 30% sequence identity is indicative of an overall accurate structure. however, when the sequence identity is lower, the first aspect of model evaluation is to confirm whether or not a correct template was used for modeling. it is often the case, when operating in this regime, that the fold assignment step produces only false positives. a further complication is that at such low similarities the alignment generally contains many errors, making it difficult to distinguish between an incorrect template on one hand and an incorrect alignment with a correct template on the other hand. there are several methods that use 3d profiles and statistical potentials, 70, 175, 176 which assess the compatibility between the sequence and modeled structure by evaluating the environment of each residue in a model with respect to the expected environment, as found in native high-resolution experimental structures. these methods can be used to assess whether or not the correct template was used for the modeling. they include verify3d, 175 184 and tsvmod. 185 even when the model is based on alignments that have >30% sequence identity, other factors, including the environment, can strongly influence the accuracy of a model. for instance, some calcium-binding proteins undergo large conformational changes when bound to calcium. if a calcium-free template is used to model the calcium-bound state of the target, it is likely that the model will be incorrect, irrespective of the target-template similarity or accuracy of the template structure. 186 the model should also be subjected to evaluations of self-consistency to ensure that it satisfies the restraints used to calculate it. additionally, the stereochemistry of the model (e.g., bond lengths, bond angles, backbone torsion angles, and nonbonded contacts) may be evaluated using programs such as procheck 187 and whatcheck. 188 although errors in stereochemistry are rare and less informative than errors detected by statistical potentials, a cluster of stereochemical errors may indicate that there are larger errors (e.g., alignment errors) in that region. when multiple models are calculated for the target based on a single template or when multiple loops are built for a single or multiple models, it is practical to select a subset of models or loops that are judged to be most suitable for subsequent docking calculations. if some known ligands or other information for the desired model is available, model selection should be guided by this known information. 189 if this extra information is not available, model selection should aim to select the most accurate model. while models or loops can be selected by the energy function used for guiding the building of comparative models or the sampling of loop configurations, using a separate statistical potential for selecting the most accurate models or loops is often more successful. 181, 182, 190, 191 it is crucial for method developers and users alike to assess the accuracy of their methods. an attempt to address this problem has been made by the critical assessment of techniques for proteins structure prediction (casp) 192 and in the past by the critical assessment of fully automated structure prediction (cafasp) experiments, 193 which is now integrated into casp. however, casp assesses methods only over a limited number of target protein sequences, and is conducted only every 2 years. 109, 194 to overcome this limitation, the new cameo web server continuously evaluates the accuracy and reliability of a number of comparative protein structure prediction servers, in a fully automated manner. 195 every week, cameo provides each tested server with the prerelease sequences of structures that are to be shortly released by the pdb. each server then has 4 days to build and return a 3d model of these sequences. when pdb releases the structures, cameo compares the models against the experimentally determined structures, and presents the results on its web site. this enables developers, non-expert users, and reviewers to determine the performance of the tested prediction servers. cameo is similar in concept to two prior such continuous testing servers, livebench 194 and eva. 196, 197 there is a wide range of applications of protein structure models ( figure 4) . 1, [198] [199] [200] [201] [202] [203] [204] for example, high-and medium-accuracy comparative models are frequently helpful in refining functional predictions that have been based on a sequence match alone because ligand binding is more directly determined by the structure of the binding site than by its sequence. it is often possible to predict correctly features of the target protein that do not occur in the template structure. 205, 206 for example, the size of a ligand may be predicted from the volume of the binding site cleft and the location of a binding site for a charged ligand can be predicted from a cluster of charged residues on the protein. fortunately, errors in the functionally important regions in comparative models are many times relatively low because the functional regions, such as active sites, tend to be more conserved in evolution than the rest of the fold. even low-accuracy comparative models may be useful, for example, for assigning the fold of a protein. fold figure 4 accuracy and applications of protein structure models. 9 shown are the different ranges of applicability of comparative protein structure modeling, threading, and de novo structure prediction, their corresponding accuracies, and their sample applications. assignment can be very helpful in drug discovery, because it can shortcut the search for leads by pointing to compounds that have been previously developed for other members of the same family. 207, 208 the remainder of this review focuses on the use of comparative models for ligand docking. [209] [210] [211] comparative protein structure modeling extends the applicability of virtual screening beyond the atomic structures determined by x-ray crystallography or nmr spectroscopy. in fact, comparative models have been used in virtual screening to detect novel ligands for many protein targets, 201 including the g-protein coupled receptors (gpcr), 210, [212] [213] [214] [215] [216] [217] [218] [219] [220] [221] [222] [223] protein kinases, [224] [225] [226] [227] nuclear hormone receptors, and many different enzymes. [228] [229] [230] [231] [232] [233] [234] [235] [236] [237] [238] [239] [240] [241] nevertheless, the relative utility of comparative models versus experimentally determined structures has only been relatively sparsely assessed. 212, 224, 225, [242] [243] [244] the utility of comparative models for molecular docking screens in ligand discovery has been documented 245 with the aid of 38 protein targets selected from the 'directory of useful decoys' (dud). 246 for each target sequence, templates for comparative modeling were obtained from the pdb, including at least one holo (ligand bound) and one apo (ligand free) template structure for each of the eight 10% sequence identity ranges from 20% to 100%. in total, 222 models were generated based on 222 templates for the 38 test proteins using modeller 9v2. 35 dud ligands and decoys (98 266 molecules) were screened against the holo x-ray structure, the apo x-ray structure, and the comparative models of each target using dock 3.5.54. 247 the accuracy of virtual screening was evaluated by the overall ligand enrichment that was calculated by integrating the area under the enrichment plot (logauc). a key result was that, for 63% and 79% of the targets, at least one comparative model yielded ligand enrichment better or comparable to that of the corresponding holo and apo x-ray structure. 245 this result indicates that comparative models can be useful docking targets when multiple templates are available. however, it was not possible to predict which model, out of all those used, led to the highest enrichment. therefore, a 'consensus' enrichment score was computed by ranking each library compound by its best docking score against all comparative models and/or templates. for 47% and 70% of the targets, the consensus enrichment for multiple models was better or comparable to that of the holo and apo x-ray structures, respectively, suggesting that multiple comparative models can be useful for virtual screening. despite problems with comparative modeling and ligand docking, comparative models have been successfully used in practice in conjunction with virtual screening to identify novel inhibitors. we briefly review a few of these success stories to highlight the potential of the combined comparative modeling and ligand-docking approach to drug discovery. comparative models have been employed to aid rational drug design against parasites for more than 20 years. 132, 231, 232, 240 as early as 1993, ring et al. 132 used comparative models for computational docking studies that identified low micromolar nonpeptidic inhibitors of proteases in malarial and schistosome parasite lifecycles. li et al. 231 subsequently used similar methods to develop nanomolar inhibitors of falcipain that are active against chloroquine-resistant strains of malaria. in a study by selzer et al. 232 comparative models were used to predict new nonpeptide inhibitors of cathepsin l-like cysteine proteases in leishmania major. sixty-nine compounds were selected by dock 3.5 as strong binders to a comparative model of protein cpb, and of these, 21 had experimental ic 50 values below 100 mmol l à1 . finally, in a study by que et al., 240 comparative models were used to rationalize ligand-binding affinities of cysteine proteases in entamoeba histolytica. specifically, this work provided an explanation for why proteases acp1 and acp2 had substrate specificity similar to that of cathepsin b, although their overall structure is more similar to that of cathepsin d. enyedy et al. 248 discovered 15 new inhibitors of matriptase by docking against its comparative model. the comparative model employed thrombin as the template, sharing only 34% sequence identity with the target sequence. moreover, some residues in the binding site are significantly different; a trio of charged asp residues in matriptase correspond to 1 tyr and 2 trp residues in thrombin. thrombin was chosen as the template, in part because it prefers substrates with positively charged residues at the p1 position, as does matriptase. the national cancer institute database was used for virtual screening that targeted the s1 site with the dock program. the 2000 best-scoring compounds were manually inspected to identify positively charged ligands (the s1 site is negatively charged), and 69 compounds were experimentally screened for inhibition, identifying the 15 inhibitors. one of them, hexamidine, was used as a lead to identify additional compounds selective for matriptase relative to thrombin. the wang group has also used similar methods to discover seven new, low-micromolar inhibitors of bcl-2, using a comparative model based on the nmr solution structure of bcl-x l . 233 schapira et al. 249 discovered a novel inhibitor of a retinoic acid receptor by virtual screening using a comparative model. in this case, the target (rar-a) and template (rar-g) are very closely related; only three residues in the binding site are not conserved. the icm program was used for virtual screening of ligands from the available chemicals directory (acd). the 5364 high-scoring compounds identified in the first round were subsequently docked into a full atom representation of the receptor with flexible side chains to obtain a final set of 300 good-scoring hits. these compounds were then manually inspected to choose the final 30 for testing. two novel agonists were identified, with 50-nanomolar activity. zuccotto et al. 250 identified novel inhibitors of dihydrofolate reductase (dhfr) in trypanosoma cruzi (the parasite that causes chagas disease) by docking into a comparative model based on $50% sequence identity to dhfr in l. major, a related parasite. the virtual screening procedure used dock for rigid docking of over 50 000 selected compounds from the cambridge structural database (csd). visual inspection of the top 100 hits was used to select 36 compounds for experimental testing. this work identified several novel scaffolds with micromolar ic 50 values. the authors report attempting to use virtual screening results to identify compounds with greater affinity for t. cruzi dhfr than human dhfr, but it is not clear how successful they were. following the outbreak of the severe acute respiratory syndrome (sars) in 2003, anand et al. 251 used the experimentally determined structures of the main protease from human coronavirus (m pro ) and an inhibitor complex of porcine coronavirus (transmissible gastroenteritis virus, tgev) m pro to calculate a comparative model of the sars coronavirus m pro . this model then provided a basis for the design of anti-sars drugs. in particular, a comparison of the active site residues in these and other related structures suggested that the ag7088 inhibitor of the human rhinovirus type 2 3c protease is a good starting point for design of anticoronaviral drugs. 252 comparative models of protein kinases combined with virtual screening have also been intensely used for drug discovery. 224, 225, [253] [254] [255] the >500 kinases in the human genome, the relatively small number of experimental structures available, and the high level of conservation around the important adenosine triphosphate-binding site make comparative modeling an attractive approach toward structure-based drug discovery. g protein-coupled receptors are another interesting class of proteins that in principle allow drug discovery through comparative modeling. 212, [256] [257] [258] [259] approximately 40% of current drug targets belong to this class of proteins. however, these proteins have been extremely difficult to crystallize and most comparative modeling has been based on the atomic resolution structure of the bovine rhodopsin. 260 despite this limitation, a rather extensive test of docking methods with rhodopsin-based comparative models shows encouraging results. the applicability of structure-based modeling and virtual screening has recently been expanded to membrane proteins that transport solutes, such as ions, metabolites, peptides, and drugs. in humans, these transporters contribute to the absorption, distribution, metabolism, and excretion of drugs, and often, mediate drug-drug interactions. additionally, several transporters can be targeted directly by small molecules. for instance, methylphenidate (ritalin) inhibiting the norepinephrine transporter (net) and, consequently, inhibiting the reuptake of norepinephrine, is used in the treatment of attention-deficit hyperactivity disorder (adhd). 261 schlessinger et al. 262 predicted 18 putative ligands of human net by docking 6436 drugs from the kyoto encyclopedia of genes (kegg drug) into a comparative model based on $25% sequence identity to leucine transporter (leut) from aquifex aeolicus. of these 18 predicted ligands, ten were validated by cis-inhibition experiments; five of them were chemically novel. close examination of the predicted primary binding site helped rationalize interactions of net with its primary substrate, norepineprhine, as well as positive and negative pharmacological effects of other net ligands. subsequently, schlessinger et al. 263 modeled two different conformations of the human gaba transporter 2 (gat-2), using the leut structures in occluded-outward-facing and outward-facing conformations. enrichment calculations were used to assess the quality of the models in molecular dynamics simulations and side-chain refinements. the key residue, glu48, interacting with the substrate was identified during the refinement of the models and validated by site-directed mutagenesis. docking against two conformations of the transporter enriches for different physicochemical properties of ligands. for example, top-scoring ligands found by docking against the outward-facing model were bulkier and more hydrophobic than those predicted using the occluded-outward-facing model. among twelve ligands validated in cis-inhibition assays, six were chemically novel (e.g., homotaurine). based on the validation experiments, gat-2 is likely to be a high-selectivity/low-affinity transporter. following these two studies, a combination of comparative modeling, ligand docking, and experimental validation was used to rationalize toxicity of an anti-cancer agent, acivicin. 264 the toxic sideeffects are thought to be facilitated by the active transport of acivicin through the blood-brain-barrier (bbb) via the large-neutral amino acid transporter 1 (lat-1). in addition, four small-molecule ligands of lat-1 were identified by docking against a comparative model based on two templates, the structures of the outward-occluded arginine-bound arginine/agmatine transporter adic from e. coli 265 and the inward-apo conformation of the amino acid, polyamine, and organo-cation transporter apct from methanococcus jannaschii. 266 two of the four hits, acivicin and fenclonine, were confirmed as substrates by a trans-stimulation assay. these studies clearly illustrate the applicability of combined comparative modeling and virtual screening to ligand discovery for transporters. although reports of successful virtual screening against comparative models are encouraging, such efforts are not yet a routine part of rational drug design. even the successful efforts appear to rely strongly on visual inspection of the docking results. much work remains to be done to improve the accuracy, efficiency, and robustness of docking against comparative models. despite assessments of relative successes of docking against comparative models and native x-ray structures, 225, 244 relatively little has been done to compare the accuracy achievable by different approaches to comparative modeling and to identify the specific structural reasons why comparative models generally produce less accurate virtual screening results than the holo structures. among the many issues that deserve consideration are the following: • the inclusion of cofactors and bound water molecules in protein receptors is often critical for success of virtual screening; however, cofactors are not routinely included in comparative models • most docking programs currently retain the protein receptor in a rigid conformation. while this approach is appropriate for 'lock-and-key' binding modes, it does not work when the ligand induces conformational changes in the receptor upon binding. a flexible receptor approach is necessary to address such induced-fit cases 267, 268 • the accuracy of comparative models is frequently judged by the c a root mean square error or other similar measures of backbone accuracy. for virtual screening, however, the precise positioning of side chains in the binding site is likely to be critical; measures of accuracy for binding sites are needed to help evaluate the suitability of comparative modeling algorithms for constructing models for docking • knowledge of known inhibitors, either for the target protein or the template, should help to evaluate and improve virtual screening against comparative models. for example, comparative models constructed from holo template structures implicitly preserve some information about the ligand-bound receptor conformation • improvement in the accuracy of models produced by comparative modeling will require methods that finely sample protein conformational space using a free energy or scoring function that has sufficient accuracy to distinguish the native structure from the nonnative conformations. despite many years of development of molecular simulation methods, attempts to refine models that are already relatively close to the native structure have met with relatively little success. this failure is likely to be due in part to inaccuracies in the scoring functions used in the simulations, particularly in the treatment of electrostatics and solvation effects. a combination of physics-based energy function with the statistical information extracted from known protein structures may provide a route to the development of improved scoring functions • improvements in sampling strategies are also likely to be necessary, for both comparative modeling and flexible docking given the increasing number of target sequences for which no experimentally determined structures are available, drug discovery stands to gain immensely from comparative modeling and other in silico methods. despite unsolved problems in virtually every step of comparative modeling and ligand docking, it is highly desirable to automate the whole process, starting with the target sequence and ending with a ranked list of its putative ligands. automation encourages development of better methods, improves their testing, allows application on a large scale, and makes the technology more accessible to both experts and non-specialists alike. through large-scale application, new questions, such as those about ligand-binding specificity, can in principle be addressed. enabling a wider community to use the methods provides useful feedback and resources toward the development of the next generation of methods. there are a number of servers for automated comparative modeling (table 1) . however, in spite of automation, the process of calculating a model for a given sequence, refining its structure, as well as visualizing and analyzing its family members in the sequence and structure spaces can involve the use of scripts, local programs, and servers scattered across the internet and not necessarily interconnected. in addition, manual intervention is generally still needed to maximize the accuracy of the models in the difficult cases. the main repository for precomputed comparative models, the protein model portal, 195, 198, 279 begins to address these deficiencies by serving models from several modeling groups, including the swiss-model 95 and modbase 34 databases. it provides access to web-based comparative modeling tools, cross-links to other sequence and structure databases, and annotations of sequences and their models. a number of databases containing comparative models and web servers for computing comparative models are publicly available. the protein model portal (pmp) 195, 198, 279 centralizes access to these models created by different methodologies. the pmp is being developed as a module of the protein structure initiative knowledgebase (psi kb) 316 and functions as a meta server for comparative models from external databases, including swiss-model 95 and modbase, 34 additionally to being a repository for comparative models that are derived from structures determined by the psi centers. it provides quality estimations of the deposited models, access to web-based comparative modeling tools, cross-links to other sequence and structure databases, annotations of sequences and their models, and detailed tutorials on comparative modeling and the use of their tools. the pmp currently contains 19.5 million comparative models for 4.4 million uniprot sequences (august 2013). a schematic of our own attempt at integrating several useful tools for comparative modeling is shown in figure 5 . 34, 291 modbase is a database that currently contains $29 million predicted models for domains in approximately 4.7 million unique sequences from uniprot, ensembl, 269 genbank, 14 and private sequence datasets. the models were calculated using modpipe 30, 291 and modeller. 35 the web interface to the database allows flexible querying for fold assignments, sequence-structure alignments, models, and model assessments. an integrated sequence-structure viewer, chimera, 304 allows inspection and analysis of the query results. models can also be calculated using modweb, 291,309 a web interface to modpipe, and stored in modbase, which also makes them accessible through the pmp. other resources associated with modbase include a comprehensive database of multiple protein structure alignments (dbali), 281 structurally defined ligand-binding sites, 319 structurally defined binary domain interfaces (pibase), 320,321 predictions of ligand-binding sites, interactions between yeast proteins, and functional consequences of human nssnps (ls-snp). 199, 322, 323 a number of associated web services handle modeling of loops in protein structures (modloop), 324, 325 evaluation of models (modeval), fitting of models against small angle x-ray scattering (saxs) profiles (foxs), 326-328 modeling of ligand-induced protein dynamics such as allostery (allosmod), 329, 330 prediction of the ensemble of conformations that best fit a given saxs profile (allosmod-foxs), 331 prediction of cryptic binding sites, 332 scoring protein-ligand complexes based on a 335, 336 compared to protein structure prediction, the attempts at automation and integration of resources in the field of docking for virtual screening are still in their nascent stages. one of the successful efforts in this direction is zinc, 317,318 a publicly available database of commercially available drug-like compounds, developed in the laboratory of brian shoichet. zinc contains more than 21 million 'ready-to-dock' compounds organized in several subsets and allows the user to query the compounds by molecular properties and constitution. the shoichet group also provides a dockblaster service 337 that enables end-users to dock the zinc compounds against their target structures using dock. 247, 338 in the future, we will no doubt see efforts to improve the accuracy of comparative modeling and ligand docking. but perhaps as importantly, the two techniques will be integrated into a single protocol for more accurate and automated docking of ligands against sequences without known structures. as a result, the number and variety of applications of both comparative modeling and ligand docking will continue to increase. cameo 195 http://cameo3d.org/ casp 315 http://predictioncenter.llnl.gov figure 5 an integrated set of resources for comparative modeling. 34 various databases and programs required for comparative modeling and docking are usually scattered over the internet, and require manual intervention or a good deal of expertise to be useful. automation and integration of these resources are efficient ways to put these resources in the hands of experts and non-specialists alike. we have outlined a comprehensive interconnected set of resources for comparative modeling and hope to integrate it with a similar effort in the area of ligand docking made by the shoichet group. 317, 318 drug disc comparative protein structure modeling modeller 9 comparative protein structure modeling and its applications to drug discovery comparative protein structure modeling this article is partially based on papers by jacobson and sali, 201 fiser and sali, 339 and madhusudhan et al. 340 we also acknowledge the funds from sandler family supporting foundation, nih r01 gm54762, p01 gm71790, p01 a135707, and u54 gm62529, as well as sun, ibm, and intel for hardware gifts. key: cord-018133-2otxft31 authors: altman, russ b.; mooney, sean d. title: bioinformatics date: 2006 journal: biomedical informatics doi: 10.1007/0-387-36278-9_22 sha: doc_id: 18133 cord_uid: 2otxft31 why is sequence, structure, and biological pathway information relevant to medicine? where on the internet should you look for a dna sequence, a protein sequence, or a protein structure? what are two problems encountered in analyzing biological sequence, structure, and function? how has the age of genomics changed the landscape of bioinformatics? what two changes should we anticipate in the medical record as a result of these new information sources? what are two computational challenges in bioinformatics for the future? ular biology and genomics-have increased dramatically in the past decade. history has shown that scientific developments within the basic sciences tend to lag about a decade before their influence on clinical medicine is fully appreciated. the types of information being gathered by biologists today will drastically alter the types of information and technologies available to the health care workers of tomorrow. there are three sources of information that are revolutionizing our understanding of human biology and that are creating significant challenges for computational processing. the most dominant new type of information is the sequence information produced by the human genome project, an international undertaking intended to determine the complete sequence of human dna as it is encoded in each of the 23 chromosomes. 1 the first draft of the sequence was published in 2001 (lander et al., 2001 ) and a final version was announced in 2003 coincident with the 50th anniversary of the solving of the watson and crick structure of the dna double helix. 2 now efforts are under way to finish the sequence and to determine the variations that occur between the genomes of different individuals. 3 essentially, the entire set of events from conception through embryonic development, childhood, adulthood, and aging are encoded by the dna blueprints within most human cells. given a complete knowledge of these dna sequences, we are in a position to understand these processes at a fundamental level and to consider the possible use of dna sequences for diagnosing and treating disease. while we are studying the human genome, a second set of concurrent projects is studying the genomes of numerous other biological organisms, including important experimental animal systems (such as mouse, rat, and yeast) as well as important human pathogens (such as mycobacterium tuberculosis or haemophilus influenzae). many of these genomes have recently been completely determined by sequencing experiments. these allow two important types of analysis: the analysis of mechanisms of pathogenicity and the analysis of animal models for human disease. in both cases, the functions encoded by genomes can be studied, classified, and categorized, allowing us to decipher how genomes affect human health and disease. these ambitious scientific projects not only are proceeding at a furious pace, but also are accompanied in many cases by a new approach to biology, which produces a third new source of biomedical information: proteomics. in addition to small, relatively focused experimental studies aimed at particular molecules thought to be important for disease, large-scale experimental methodologies are used to collect data on thousands or millions of molecules simultaneously. scientists apply these methodologies longitudinally over time and across a wide variety of organisms or (within an organism) organs to watch the evolution of various physiological phenomena. new technologies give us the abilities to follow the production and degradation of molecules on dna arrays 4 (lashkari et al., 1997) , to study the expression of large numbers of proteins with one another (bai and elledge, 1997) , and to create multiple variations on a genetic theme to explore the implications of various mutations on biological function (spee et al., 1993) . all these technologies, along with the genome-sequencing projects, are conspiring to produce a volume of biological information that at once contains secrets to age-old questions about health and disease and threatens to overwhelm our current capabilities of data analysis. thus, bioinformatics is becoming critical for medicine in the twentyfirst century. the effects of this new biological information on clinical medicine and clinical informatics are difficult to predict precisely. it is already clear, however, that some major changes to medicine will have to be accommodated. with the first set of human genomes now available, it will soon become cost-effective to consider sequencing or genotyping at least sections of many other genomes. the sequence of a gene involved in disease may provide the critical information that we need to select appropriate treatments. for example, the set of genes that produces essential hypertension may be understood at a level sufficient to allow us to target antihypertensive medications based on the precise configuration of these genes. it is possible that clinical trials may use information about genetic sequence to define precisely the population of patients who would benefit from a new therapeutic agent. finally, clinicians may learn the sequences of infectious agents (such as of the escherichia coli strain that causes recurrent urinary tract infections) and store them in a patient's record to record the precise pathogenicity and drug susceptibility observed during an episode of illness. in any case, it is likely that genetic information will need to be included in the medical record and will introduce special problems. raw sequence information, whether from the patient or the pathogen, is meaningless without context and thus is not well suited to a printed medical record. like images, it can come in high information density and must be presented to the clinician in novel ways. as there are for laboratory tests, there may be a set of nondisease (or normal) values to use as comparisons, and there may be difficulties in interpreting abnormal values. fortunately, most of the human genome is shared and identical among individuals; less than 1 percent of the genome seems to be unique to individuals. nonetheless, the effects of sequence information on clinical databases will be significant. 2. new diagnostic and prognostic information sources. one of the main contributions of the genome-sequencing projects (and of the associated biological innovations) is that we are likely to have unprecedented access to new diagnostic and prognostic tools. single nucleotide polymorphisms (snps) and other genetic markers are used to identify how a patient's genome differs from the draft genome. diagnostically, the genetic markers from a patient with an autoimmune disease, or of an infectious pathogen within a patient, will be highly specific and sensitive indicators of the subtype of disease and of that subtype's probable responsiveness to different therapeutic agents. for example, the severe acute respiratory syndrome (sars) virus was determined to be a corona virus using a gene expression array containing the genetic information from several common pathogenic viruses. 5 in general, diagnostic tools based on the gene sequences within a patient are likely to increase greatly the number and variety of tests available to the physician. physicians will not be able to manage these tests without significant computational assistance. moreover, genetic information will be available to provide more accurate prognostic information to patients. what is the standard course for this disease? how does it respond to these medications? over time, we will be able to answer these questions with increasing precision, and will develop computational systems to manage this information. several genotype-based databases have been developed to identify markers that are associated with specific phenotypes and identify how genotype affects a patient's response to therapeutics. the human gene mutations database (hgmd) annotates mutations with disease phenotype. 6 this resource has become invaluable for genetic counselors, basic researchers, and clinicians. additionally, the pharmacogenomics knowledge base (pharmgkb) collects genetic information that is known to affect a patient's response to a drug. 7 as these data sets, and others like them, continue to improve, the first clinical benefits from the genome projects will be realized. 3. ethical considerations. one of the critical questions facing the genome-sequencing projects is "can genetic information be misused?" the answer is certainly yes. with knowledge of a complete genome for an individual, it may be possible in the future to predict the types of disease for which that individual is at risk years before the disease actually develops. if this information fell into the hands of unscrupulous employers or insurance companies, the individual might be denied employment or coverage due to the likelihood of future disease, however distant. there is even debate about whether such information should be released to a patient even if it could be kept confidential. should a patient be informed that he or she is likely to get a disease for which there is no treatment? this is a matter of intense debate, and such questions have significant implications for what information is collected and for how and to whom that information is disclosed (durfy, 1993; see chapter 10). a brief review of the biological basis of medicine will bring into focus the magnitude of the revolution in molecular biology and the tasks that are created for the discipline of bioinformatics. the genetic material that we inherit from our parents, that we use for the structures and processes of life, and that we pass to our children is contained in a sequence of chemicals known as deoxyribonucleic acid (dna). 8 the total collec-766 r. b. altman and s. d. mooney tion of dna for a single person or organism is referred to as the genome. dna is a long polymer chemical made of four basic subunits. the sequence in which these subunits occur in the polymer distinguishes one dna molecule from another, and the sequence of dna subunits in turn directs a cell's production of proteins and all other basic cellular processes. genes are discreet units encoded in dna and they are transcribed into ribonucleic acid (rna), which has a composition very similar to dna. genes are transcribed into messenger rna (mrna) and a majority of mrna sequences are translated by ribosomes into protein. not all rnas are messengers for the translation of proteins. ribosomal rna, for example, is used in the construction of the ribosome, the huge molecular engine that translates mrna sequences into protein sequences. understanding the basic building blocks of life requires understanding the function of genomic sequences, genes, and proteins. when are genes turned on? once genes are transcribed and translated into proteins, into what cellular compartment are the proteins directed? how do the proteins function once there? equally important, how are the proteins turned off ? experimentation and bioinformatics have divided the research into several areas, and the largest are: (1) genome and protein sequence analysis, (2) macromolecular structure-function analysis, (3) gene expression analysis, and (4) proteomics. practitioners of bioinformatics have come from many backgrounds, including medicine, molecular biology, chemistry, physics, mathematics, engineering, and computer science. it is difficult to define precisely the ways in which this discipline emerged. there are, however, two main developments that have created opportunities for the use of information technologies in biology. the first is the progress in our understanding of how biological molecules are constructed and how they perform their functions. this dates back as far as the 1930s with the invention of electrophoresis, and then in the 1950s with the elucidation of the structure of dna and the subsequent sequence of discoveries in the relationships among dna, rna, and protein structure. the second development has been the parallel increase in the availability of computing power. starting with mainframe computer applications in the 1950s and moving to modern workstations, there have been hosts of biological problems addressed with computational methods. the human genome project was completed and a nearly finished sequence was published in 2003. 9 the benefit of the human genome sequence to medicine is both in the short and in the long term. the short-term benefits lie principally in diagnosis: the availability of sequences of normal and variant human genes will allow for the rapid identification of these genes in any patient (e.g., babior and matzner, 1997) . the long-term benefits will include a greater understanding of the proteins produced from the genome: how the proteins interact with drugs; how they malfunction in disease states; and how they participate in the control of development, aging, and responses to disease. the effects of genomics on biology and medicine cannot be understated. we now have the ability to measure the activity and function of genes within living cells. genomics data and experiments have changed the way biologists think about questions fundamental to life. where in the past, reductionist experiments probed the detailed workings of specific genes, we can now assemble those data together to build an accurate understanding of how cells work. this has led to a change in thinking about the role of computers in biology. before, they were optional tools that could help provide insight to experienced and dedicated enthusiasts. today, they are required by most investigators, and experimental approaches rely on them as critical elements. twenty years ago, the use of computers was proving to be useful to the laboratory researcher. today, computers are an essential component of modern research. this is because advances in research methods such as microarray chips, drug screening robots, x-ray crystallography, nuclear magnetic resonance spectroscopy, and dna sequencing experiments have resulted in massive amounts of data. these data need to be properly stored, analyzed, and disseminated. the volume of data being produced by genomics projects is staggering. there are now more than 22.3 million sequences in genbank comprising more than 29 billion digits. 10 but these data do not stop with sequence data: pubmed contains over 15 million literature citations, the pdb contains three-dimensional structural data for over 40,000 protein sequences, and the stanford microarray database (smd) contains over 37,000 experiments (851 million data points). these data are of incredible importance to biology, and in the following sections we introduce and summarize the importance of sequences, structures, gene expression experiments, systems biology, and their computational components to medicine. sequence information (including dna sequences, rna sequences, and protein sequences) is critical in biology: dna, rna, and protein can be represented as a set of sequences of basic building blocks (bases for dna and rna, amino acids for proteins). computer systems within bioinformatics thus must be able to handle biological sequence information effectively and efficiently. one major difficulty within bioinformatics is that standard database models, such as relational database systems, are not well suited to sequence information. the basic problem is that sequences are important both as a set of elements grouped together and treated in a uniform manner and as individual elements, with their relative locations and functions. any given position in a sequence can be important because of its own identity, because it is part of a larger subsequence, or perhaps because it is part of a large set of overlapping subsequences, all of which have different significance. it is necessary to support queries such as, "what sequence motifs are present in this sequence?" it is often difficult to represent these multiple, nested relationships within standard relational database schema. in addition, the neighbors of a sequence element are also critical, and it is important to be able to perform queries such as, "what sequence elements are seen 20 elements to the left of this element?" for these reasons, researchers in bioinformatics are developing object-oriented databases (see chapter 6) in which a sequence can be queried in different ways, depending on the needs of the user (altman, 2003) . the sequence information mentioned in section 22.3.1 is rapidly becoming inexpensive to obtain and easy to store. on the other hand, the three-dimensional structure information about the proteins that are produced from the dna sequences is much more difficult and expensive to obtain, and presents a separate set of analysis challenges. currently, only about 30,000 three-dimensional structures of biological macromolecules are known. 11 these models are incredibly valuable resources, however, because an understanding of structure often yields detailed insights about biological function. as an example, the structure of the ribosome has been determined for several species and contains more atoms than any other to date. this structure, because of its size, took two decades to solve, and presents a formidable challenge for functional annotation (cech, 2000) . yet, the functional information for a single structure is vastly outsized by the potential for comparative genomics analysis between the structures from several organisms and from varied forms of the functional complex, since the ribosome is ubiquitously required for all forms of life. thus a wealth of information comes from relatively few structures. to address the problem of limited structure information, the publicly funded structural genomics initiative aims to identify all of the common structural scaffolds found in nature and grow the number of known structures considerably. in the end, it is the physical forces between molecules that determine what happens within a cell; thus the more complete the picture, the better the functional understanding. in particular, understanding the physical properties of therapeutic agents is the key to understanding how agents interact with their targets within the cell (or within an invading organism). these are the key questions for structural biology within bioinformatics: 1. how can we analyze the structures of molecules to learn their associated function? approaches range from detailed molecular simulations (levitt, 1983) to statistical analyses of the structural features that may be important for function (wei and altman, 1998). bioinformatics 769 11 for more information see http://www.rcsb.org/pdb/. 2. how can we extend the limited structural data by using information in the sequence databases about closely related proteins from different organisms (or within the same organism, but performing a slightly different function)? there are significant unanswered questions about how to extract maximal value from a relatively small set of examples. 3. how should structures be grouped for the purposes of classification? the choices range from purely functional criteria ("these proteins all digest proteins") to purely structural criteria ("these proteins all have a toroidal shape"), with mixed criteria in between. one interesting resource available today is the structural classification of proteins (scop), 12 which classifies proteins based on shape and function. the development of dna microarrays has led to a wealth of data and unprecedented insight into the fundamental biological machine. the premise is relatively simple; up to 40,000 gene sequences derived from genomic data are fixed onto a glass slide or filter. an experiment is performed where two groups of cells are grown in different conditions, one group is a control group and the other is the experimental group. the control group is grown normally, while the experimental group is grown under experimental conditions. for example, a researcher may be trying to understand how a cell compensates for a lack of sugar. the experimental cells will be grown with limited amounts of sugar. as the sugar depletes, some of the cells are removed at specific intervals of time. when the cells are removed, all of the mrna from the cells is separated and converted back to dna, using special enzymes. this leaves a pool of dna molecules that are only from the genes that were turned on (expressed) in that group of cells. using a chemical reaction, the experimental dna sample is attached to a red fluorescent molecule and the control group is attached to a green fluorescent molecule. these two samples are mixed and then washed over the glass slide. the two samples contain only genes that were turned on in the cells, and they are labeled either red or green depending on whether they came from the experimental group or the control group. the labeled dna in the pool sticks or hybridizes to the same gene on the glass slide. this leaves the glass slide with up to 40,000 spots and genes that were turned on in the cells are now bound with a label to the appropriate spot on the slide. using a scanning confocal microscope and a laser to fluoresce the linkers, the amount of red and green fluorescence in each spot can be measured. the ratio of red to green determines whether that gene is being turned off (downregulated) in the experimental group or whether the gene is being turned on (upregulated). the experiment has now measured the activity of genes in an entire cell due to some experimental change. figure 22 .1 illustrates a typical gene expression experiment from smd. 13 computers are critical for analyzing these data, because it is impossible for a researcher to comprehend the significance of those red and green spots. currently scientists are using gene expression experiments to study how cells from different organisms compensate for environmental changes, how pathogens fight antibiotics, and how cells grow uncontrollably (as is found in cancer). a new challenge for biological computing is to develop methods to analyze these data, tools to store these data, and computer systems to collect the data automatically. with the completion of the human genome and the abundance of sequence, structural, and gene expression data, a new field of systems biology that tries to understand how proteins and genes interact at a cellular level is emerging. the basic algorithms for analyzing sequence and structure are now leading to opportunities for more integrated analysis of the pathways in which these molecules participate and ways in which molecules can be manipulated for the purpose of combating disease. a detailed understanding of the role of a particular molecule in the cell requires knowledge of the context-of the other molecules with which it interacts-and of the sequence of chemical transformations that take place in the cell. thus, major research areas in bioinformatics are elucidating the key pathways by which chemicals are transformed, defining the molecules that catalyze these transformations, identifying the input compounds and the output compounds, and linking these pathways into bioinformatics 771 networks that we can then represent computationally and analyze to understand the significance of a particular molecule. the alliance for cell signaling is generating large amounts of data related to how signal molecules interact and affect the concentration of small molecules within the cell. there are a number of common computations that are performed in many contexts within bioinformatics. in general, these computations can be classified as sequence alignment, structure alignment, pattern analysis of sequence/structure, gene expression analysis, and pattern analysis of biochemical function. as it became clear that the information from dna and protein sequences would be voluminous and difficult to analyze manually, algorithms began to appear for automating the analysis of sequence information. the first requirement was to have a reliable way to align sequences so that their detailed similarities and distances could be examined directly. needleman and wunsch (1970) published an elegant method for using dynamic programming techniques to align sequences in time related to the cube of the number of elements in the sequences. smith and waterman (1981) published refinements of these algorithms that allowed for searching both the best global alignment of two sequences (aligning all the elements of the two sequences) and the best local alignment (searching for areas in which there are segments of high similarity surrounded by regions of low similarity). a key input for these algorithms is a matrix that encodes the similarity or substitutability of sequence elements: when there is an inexact match between two elements in an alignment of sequences, it specifies how much "partial credit" we should give the overall alignment based on the similarity of the elements, even though they may not be identical. looking at a set of evolutionarily related proteins, dayhoff et al. (1974) published one of the first matrices derived from a detailed analysis of which amino acids (elements) tend to substitute for others. within structural biology, the vast computational requirements of the experimental methods (such as x-ray crystallography and nuclear magnetic resonance) for determining the structure of biological molecules drove the development of powerful structural analysis tools. in addition to software for analyzing experimental data, graphical display algorithms allowed biologists to visualize these molecules in great detail and facilitated the manual analysis of structural principles (langridge, 1974; richardson, 1981) . at the same time, methods were developed for simulating the forces within these molecules as they rotate and vibrate (gibson and scheraga, 1967; karplus and weaver, 1976; levitt, 1983) . the most important development to support the emergence of bioinformatics, however, has been the creation of databases with biological information. in the 1970s, structural biologists, using the techniques of x-ray crystallography, set up the protein data bank (pdb) of the cartesian coordinates of the structures that they elucidated (as well as associated experimental details) and made pdb publicly available. the first release, in 1977, contained 77 structures. the growth of the database is chronicled on the web: 14 the pdb now has over 30,000 detailed atomic structures and is the primary source of information about the relationship between protein sequence and protein structure. similarly, as the ability to obtain the sequence of dna molecules became widespread, the need for a database of these sequences arose. in the mid-1980s, the genbank database was formed as a repository of sequence information. starting with 606 sequences and 680,000 bases in 1982, the genbank has grown by much more than 2 million sequences and 100 billion bases. the genbank database of dna sequence information supports the experimental reconstruction of genomes and acts as a focal point for experimental groups. 15 numerous other databases store the sequences of protein molecules 16 and information about human genetic diseases. 17 included among the databases that have accelerated the development of bioinformatics is the medline 18 database of the biomedical literature and its paper-based companion index medicus (see chapter 19). including articles as far back as 1953 and brought online free on the web in 1997, medline provides the glue that relates many high-level biomedical concepts to the low-level molecule, disease, and experimental methods. in fact, this "glue" role was the basis for creating the entrez and pubmed systems for integrating access to literature references and the associated databases. perhaps the most basic activity in computational biology is comparing two biological sequences to determine (1) whether they are similar and (2) how to align them. the problem of alignment is not trivial but is based on a simple idea. sequences that perform a similar function should, in general, be descendants of a common ancestral sequence, with mutations over time. these mutations can be replacements of one amino acid with another, deletions of amino acids, or insertions of amino acids. the goal of sequence alignment is to align two sequences so that the evolutionary relationship between the sequences becomes clear. if two sequences are descended from the same ancestor and have not mutated too much, then it is often possible to find corresponding locations in each sequence that play the same role in the evolved proteins. the problem of solving correct biological alignments is difficult because it requires knowledge about the evolution of the molecules that we typically do not have. there are now, however, well-established algorithms for finding the mathematically optimal alignment of two sequences. these algorithms require the two sequences and a scoring system based on (1) exact matches between amino acids that have not mutated in the two sequences and can be aligned perfectly; (2) partial matches between amino acids that have mutated in ways that have preserved their overall biophysical properties; and (3) gaps in the alignment signifying places where one sequence or the other has undergone a deletion or insertion of amino acids. the algorithms for determining optimal sequence alignments are based on a technique in computer science known as dynamic programming and are at the heart of many computational biology applications (gusfield, 1997) . figure 22.2 shows an example of a smith-waterman matrix. unfortunately, the dynamic programming algorithms are computationally expensive to apply, so a number of faster, more heuristic methods have been developed. the most popular algorithm is the basic local alignment search tool (blast) (altschul et al., 1990) . blast is based on the observations that sections of proteins are often conserved without gaps (so the gaps can be ignored-a critical simplification for speed) and that there are statistical analyses of the occurrence of small subsequences within larger sequences that can be used to prune the search for matching sequences in a large database. another tool that has found wide use in mining genome sequences is blat (kent, 2003) . blat is often used to search long genomic sequences with significant performance increases over blast. it achieves its 50-fold increase in speed over other tools by storing and indexing long sequences as nonoverlapping k-mers, allowing efficient storage, searching, and alignment on modest hardware. one of the primary challenges in bioinformatics is taking a newly determined dna sequence (as well as its translation into a protein sequence) and predicting the structure of the associated molecules, as well as their function. both problems are difficult, being fraught with all the dangers associated with making predictions without hard experimental data. nonetheless, the available sequence data are starting to be sufficient to allow good predictions in a few cases. for example, there is a web site devoted to the assessment of biological macromolecular structure prediction methods. 19 recent results suggest that when two protein molecules have a high degree (more than 40 percent) of sequence similarity and one of the structures is known, a reliable model of the other can be built by analogy. in the case that sequence similarity is less than 25 percent, however, performance of these methods is much less reliable. when scientists investigate biological structure, they commonly perform a task analogous to sequence alignment, called structural alignment. given two sets of threedimensional coordinates for a set of atoms, what is the best way to superimpose them so that the similarities and differences between the two structures are clear? such computations are useful for determining whether two structures share a common ancestry and for understanding how the structures' functions have subsequently been refined during evolution. there are numerous published algorithms for finding good structural alignments. we can apply these algorithms in an automated fashion whenever a new structure is determined, thereby classifying the new structure into one of the protein families (such as those that scop maintains). one of these algorithms is minrms (jewett et al., 2003) . 20 minrms works by finding the minimal root-mean-squared-distance (rmsd) alignments of two protein structures as a function of matching residue pairs. minrms generates a family of alignments, each with different number of residue position matches. this is useful for identifying local regions of similarity in a protein with multiple domains. minrms solves two problems. first, it determines which structural superpositions, or alignment, to evaluate. then, given this superposition, it determines which residues should be bioinformatics 775 considered "aligned" or matched. computationally, this is a very difficult problem. minrms reduces the search space by limiting superpositions to be the best superposition between four atoms. it then exhaustively determines all potential four-atommatched superpositions and evaluates the alignment. given this superposition, the number of aligned residues is determined, as any two residues with c-alpha carbons (the central atom in all amino acids) less than a certain threshold apart. the minimum average rmsd for all matched atoms is the overall score for the alignment. in figure 22 .3, an example of such a comparison is shown. a related problem is that of using the structure of a large biomolecule and the structure of a small organic molecule (such as a drug or cofactor) to try to predict the ways in which the molecules will interact. an understanding of the structural interaction between a drug and its target molecule often provides critical insight into the drug's mechanism of action. the most reliable way to assess this interaction is to use experimental methods to solve the structure of a drug-target complex. once again, these experimental approaches are expensive, so computational methods play an important role. typically, we can assess the physical and chemical features of the drug molecule and can use them to find complementary regions of the target. for example, a highly electronegative drug molecule will be most likely to bind in a pocket of the target that has electropositive features. prediction of function often relies on use of sequential or structural similarity metrics and subsequent assignment of function based on similarities to molecules of known 776 r. b. altman and s. d. mooney function. these methods can guess at general function for roughly 60 to 80 percent of all genes, but leave considerable uncertainty about the precise functional details even for those genes for which there are predictions, and have little to say about the remaining genes. analysis of gene expression data often begins by clustering the expression data. a typical experiment is represented as a large table, where the rows are the genes on each chip and the columns represent the different experiments, whether they be time points or different experimental conditions. within each cell is the red to green ratio of that gene's experimental results. each row is then a vector of values that represent the results of the experiment with respect to a specific gene. clustering can then be performed to determine which genes are being expressed similarly. genes that are associated with similar expression profiles are often functionally associated. for example, when a cell is subjected to starvation (fasting), ribosomal genes are often downregulated in anticipation of lower protein production by the cell. it has similarly been shown that genes associated with neoplastic progression could be identified relatively easily with this method, making gene expression experiments a powerful assay in cancer research (see guo, 2003, for review) . in order to cluster expression data, a distance metric must be determined to compare a gene's profile with another gene's profile. if the vector data are a list of values, euclidian distance or correlation distances can be used. if the data are more complicated, more sophisticated distance metrics may be employed. clustering methods fall into two categories: supervised and unsupervised. hand. usually, the method begins by selecting profiles that represent the different groups of data, and then the clustering method associates each of the genes with the be performed automatically. two such unsupervised learning methods are the hierarchical and k-means clustering methods. hierarchical methods build a dendrogram, or a tree, of the genes based on ing close neighbors into a cluster. the first step often involves connecting the closest profiles, building an average profile of the joined profiles, and repeating until the entire tree is built. k-means clustering builds k clusters or groups automatically. the algorithm begins by picking k representative profiles randomly. then each gene is associated with the representative to which it is closest, as defined by the distance metric being employed. then the center of mass of each cluster is determined using all of the member gene's profiles. depending on the implementation, either the center of mass or the nearest member to it becomes the new representative for that cluster. the algorithm then iterates until the new center of mass and the previous center of mass are within some threshold. the result is k groups of genes that are regulated similarly. one drawback of k-means is that one must chose the value for k. if k is too large, logical "true" clusters may be split into pieces and if k is too small, there will be clusters that are bioinformatics 777 commonly applied because these methods require no knowledge of the data, and can supervised learning methods require some preconceived knowledge of the data at representative profile to which they are most similar. unsupervised methods are more their expression profiles. these methods are agglomerative and work by iteratively join-merged. one way to determine whether the chosen k is correct is to estimate the average distance from any member profile to the center of mass. by varying k, it is best to choose the lowest k where this average is minimized for each cluster. another drawback of k-means is that different initial conditions can give different results, therefore it is often prudent to test the robustness of the results by running multiple runs with different starting configurations (figure 22.4) . the future clinical usefulness of these algorithms cannot be understated. in 2002 , van't veer et al. (2002 found that a gene expression profile could predict the clinical outcome of breast cancer. the global analysis of gene expression showed that some can-778 r. b. altman and s. d. mooney cers were associated with different prognosis, not detectable using traditional means. another exciting advancement in this field is the potential use of microarray expression data to profile the molecular effects of known and potential therapeutic agents. this molecular understanding of a disease and its treatment will soon help clinicians make more informed and accurate treatment choices. biologists have embraced the web in a remarkable way and have made internet access to data a normal and expected mode for doing business. hundreds of databases curated by individual biologists create a valuable resource for the developers of computational methods who can use these data to test and refine their analysis algorithms. with standard internet search engines, most biological databases can be found and accessed within moments. the large number of databases has led to the development of meta-databases that combine information from individual databases to shield the user from the complex array that exists. there are various approaches to this task. the entrez system from the national center for biological information (ncbi) gives integrated access to the biomedical literature, protein, and nucleic acid sequences, macromolecular and small molecular structures, and genome project links (including both the human genome project and sequencing projects that are attempting to determine the genome sequences for organisms that are either human pathogens or important experimental model organisms) in a manner that takes advantages of either explicit or computed links between these data resources. 21 the sequence retrieval system (srs) from the european molecular biology laboratory allows queries from one database to another to be linked and sequenced, thus allowing relatively complicated queries to be evaluated. 22 newer technologies are being developed that will allow multiple heterogeneous databases to be accessed by search engines that can combine information automatically, thereby processing even more intricate queries requiring knowledge from numerous data sources. the main types of sequence information that must be stored are dna and protein. one of the largest dna sequence databases is genbank, which is managed by ncbi. 23 genbank is growing rapidly as genome-sequencing projects feed their data (often in an automated procedure) directly into the database. figure 22 .5 shows the logarithmic growth of data in genbank since 1982. entrez gene curates some of the many genes within genbank and presents the data in a way that is easy for the researcher to use (figure 22.6) . 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 year the exponential growth of genbank total number of bases figure 22 .5. the exponential growth of genbank. this plot shows that since 1982 the number of bases in genbank has grown by five full orders of magnitude and continues to grow by a factor of 10 every 4 years. in addition to genbank, there are numerous special-purpose dna databases for which the curators have taken special care to clean, validate, and annotate the data. the work required of such curators indicates the degree to which raw sequence data must be interpreted cautiously. genbank can be searched efficiently with a number of algorithms and is usually the first stop for a scientist with a new sequence who wonders "has a sequence like this ever been observed before? if one has, what is known about it?" there are increasing numbers of stories about scientists using genbank to discover unanticipated relationships between dna sequences, allowing their research programs to leap ahead while taking advantage of information collected on similar sequences. a database that has become very useful recently is the university of california santa cruz genome assembly browser 24 (figure 22.7) . this data set allows users to search for specific sequences in the ucsc version of the human genome. powered by the similarity search tool blat, users can quickly find annotations on the human genome that contain their sequence of interest. these annotations include known variations (mutations and snps), genes, comparative maps with other organisms, and many other important data. although sequence information is obtained relatively easily, structural information remains expensive on a per-entry basis. the experimental protocols used to determine precise molecular structural coordinates are expensive in time, materials, and human power. therefore, we have only a small number of structures for all the molecules characterized in the sequence databases. the two main sources of structural information are the cambridge structural database 25 for small molecules (usually less than 100 atoms) and the pdb 26 for macromolecules (see section 22.3.2), including proteins and nucleic acids, and combinations of these macromolecules with small molecules (such as drugs, cofactors, and vitamins). the pdb has approximately 20,000 high-resolution structures, but this number is misleading because many of them are small variants on the same structural architecture (figure 22.8) . if an algorithm is applied to the database to filter out redundant structures, less than 2,000 structures remain. there are approximately 100,000 proteins in humans; therefore many structures remain unsolved (e.g., burley and bonanno, 2002; gerstein et al., 2003) . in the pdb, figure 22 .8. a stylized diagram of the structure of chymotrypsin, here shown with two identical subunits interacting. the red portion of the protein backbone shows î±-helical regions, while the blue portion shows î²-strands, and the white denotes connecting coils, while the molecular surface is overlaid in gray. the detailed rendering of all the atoms in chymotrypsin would make this view difficult to visualize because of the complexity of the spatial relationships between thousands of atoms. each structure is reported with its biological source, reference information, manual annotations of interesting features, and the cartesian coordinates of each atom within the molecule. given knowledge of the three-dimensional structure of molecules, the function sometimes becomes clear. for example, the ways in which the medication methotrexate interacts with its biological target have been studied in detail for two decades. methotrexate is used to treat cancer and rheumatologic diseases, and it is an inhibitor of the protein dihydrofolate reductase, an important molecule for cellular reproduction. the three-dimensional structure of dihydrofolate reductase has been known for many years and has thus allowed detailed studies of the ways in which small molecules, such as methotrexate, interact at an atomic level. as the pdb increases in size, it becomes important to have organizing principles for thinking about biological structure. scop 27 provides a classification based on the overall structural features of proteins. it is a useful method for accessing the entries of the pdb. the ecocyc project is an example of a computational resource that has comprehensive information about biochemical pathways. 28 ecocyc is a knowledge base of the metabolic capabilities of e. coli; it has a representation of all the enzymes in the e. coli genome and of the compounds on which they work. it also links these enzymes to their position on the genome to provide a useful interface into this information. the network of pathways within ecocyc provides an excellent substrate on which useful applications can be built. for example, they could provide: (1) the ability to guess the function of a new protein by assessing its similarity to e. coli genes with a similar sequence, (2) the ability to ask what the effect on an organism would be if a critical component of a pathway were removed (would other pathways be used to create the desired function, or would the organism lose a vital function and die?), and (3) the ability to provide a rich user interface to the literature on e. coli metabolism. similarly, the kyoto encyclopedia of genes and genomes (kegg) provides pathway datacets for organism genomes. 29 a postgenomic database bridges the gap between molecular biological databases with those of clinical importance. one excellent example of a postgenomic database is the online mendelian inheritance in man (omim) database, 30 which is a compilation of known human genes and genetic diseases, along with manual annotations describing the state of our understanding of individual genetic disorders. each entry contains links to special-purpose databases and thus provides links between clinical syndromes and basic molecular mechanisms (figure 22 .9). the smd is another example of a postgenomic database that has proven extremely useful, but has also addressed some formidable challenges. as discussed previously in several sections, expression data are often represented as vectors of data values. in addition to the ratio values, the smd stores images of individual chips, complete with annotated gene spots (see figure 22 .1). further, the smd must store experimental conditions, the type and protocol of the experiment, and other data associated with the experiment. arbitrary analysis can be performed on different experiments stored in this unique resource. a critical technical challenge within bioinformatics is the interconnection of databases. as biological databases have proliferated, researchers have been increasingly interested in linking them to support more complicated requests for information. some of these links are natural because of the close connection of dna sequence to protein structure (a straightforward translation). other links are much more difficult because the semantics of the data items within the databases are fuzzy or because good methods for linking certain types of data simply do not exist. for example, in an ideal world, a protein sequence would be linked to a database containing information about that sequence's function. unfortunately, although there are databases about protein function, it is not always easy to assign a function to a protein based on sequence information alone, and so the databases are limited by gaps in our understanding of biology. some excellent recent work in the integration of diverse biological databases has been done in connection with the ncbi entrez/pubmed systems, 31 the srs resource, 32 discoverylink, 33 and the biokleisli project. 34 the human genome sequencing projects will be complete within a decade, and if the only raison d'etre for bioinformatics is to support these projects, then the discipline is not well founded. if, on the other hand, we can identify a set of challenges for the next generations of investigators, then we can more comfortably claim disciplinary status for the field. fortunately, there is a series of challenges for which the completion of the first human genome sequence is only the beginning. with the first human genome in hand, the possibilities for studying the role of genetics in human disease multiply. a new challenge immediately emerges, however: collecting individual sequence data from patients who have disease. researchers estimate that more than 99 percent of the dna sequences within humans are identical, but the remaining sequences are different and account for our variability in susceptibility to and development of disease states. it is not unreasonable to expect that for particular disease syndromes, the detailed genetic information for individual patients will provide valuable information that will allow us to tailor treatment protocols and perhaps let us make more accurate prognoses. there are significant problems associated with obtaining, organizing, analyzing, and using this information. there is currently a gap in our understanding of disease processes. although we have a good understanding of the principles by which small groups of molecules interact, we are not able to fully explain how thousands of molecules interact within a cell to create both normal and abnormal physiological states. as the databases continue to accumulate information ranging from patient-specific data to fundamental genetic information, a major challenge is creating the conceptual links between these databases to create an audit trail from molecular-level information to macroscopic phenomena, as manifested in disease. the availability of these links will facilitate the identification of important targets for future research and will provide a scaffold for biomedical knowledge, ensuring that important literature is not lost within the increasing volume of published data. an important opportunity within bioinformatics is the linkage of biological experimental data with the published papers that report them. electronic publication of the biological literature provides exciting opportunities for making data easily available to scientists. already, certain types of simple data that are produced in large volumes are expected to be included in manuscripts submitted for publication, including new sequences that are required to be deposited in genbank and new structure coordinates that are deposited in the pdb. however, there are many other experimental data sources that are currently difficult to provide in a standardized way, because the data either are more intricate than those stored in genbank or pdb or are not produced in a volume sufficient to fill a database devoted entirely to the relevant area. knowledge base technology can be used, however, to represent multiple types of highly interrelated data. knowledge bases can be defined in many ways (see chapter 20); for our purposes, we can think of them as databases in which (1) the ratio of the number of tables to the number of entries per table is high compared with usual databases, (2) the individual entries (or records) have unique names, and (3) the values of many fields for one record in the database are the names of other records, thus creating a highly interlinked network of concepts. the structure of knowledge bases often leads to unique strategies for storage and retrieval of their content. to build a knowledge base for storing information from biological experiments, there are some requirements. first, the set of experiments to be modeled must be defined. second, the key attributes of each experiment that should be recorded in the knowledge base must be specified. third, the set of legal values for each attribute must be specified, usually by creating a controlled terminology for basic data or by specifying the types of knowledge-based entries that can serve as values within the knowledge base. the development of such schemes necessitates the creation of terminology standards, just as in clinical informatics. the riboweb project is undertaking this task in the domain of rna biology (chen et al., 1997) . riboweb is a collaborative tool for ribosomal modeling that has at its center a knowledge base of the ribosomal structural literature. riboweb links standard bibliographic references to knowledge-based entries that summarize the key experimental findings reported in each paper. for each type of experiment that can be performed, the key attributes must be specified. thus, for example, a cross-linking experiment is one in which a small molecule with two highly reactive chemical groups is added to an ensemble of other molecules. the reactive groups attach themselves to two vulnerable parts of the ensemble. because the molecule is small, the two vulnerable areas cannot be any further from each other than the maximum stretched-out length of the small molecule. thus, an analysis of the resulting reaction gives information that one part of the ensemble is "close" to another part. this experiment can be summarized formally with a few features-for example, target of experiment, cross-linked parts, and cross-linking agent. the task of creating connections between published literature and basic data is a difficult one because of the need to create formal structures and then to create the necessary content for each published article. the most likely scenario is that biologists will write and submit their papers along with the entries that they propose to add to the knowledge base. thus, the knowledge base will become an ever-growing communal store of scientific knowledge. reviewers of the work will examine the knowledge-based elements, perhaps will run a set of automated consistency checks, and will allow the knowledge base to be modified if they deem the paper to be of sufficient scientific merit. riboweb in prototype form can be accessed on the web. 35 one of the most exciting goals for computational biology and bioinformatics is the creation of a unified computational model of physiology. imagine a computer program that provides a comprehensive simulation of a human body. the simulation would be a complex mathematical model in which all the molecular details of each organ system would be represented in sufficient detail to allow complex "what if ?" questions to be asked. for example, a new therapeutic agent could be introduced into the system, and its effects on each of the organ subsystems and on their cellular apparatus could be assessed. the side-effect profile, possible toxicities, and perhaps even the efficacy of the agent could be assessed computationally before trials are begun on laboratory animals or human subjects. the model could be linked to visualizations to allow the teaching of medicine at all grade levels to benefit from our detailed understanding of physiological processes-visualizations would be both anatomic (where things are) and functional (what things do). finally, the model would provide an interface to human genetic and biological knowledge. what more natural user interface could there be for exploring physiology, anatomy, genetics, and biochemistry than the universally recognizable structure of a human that could be browsed at both macroscopic and microscopic levels of detail? as components of interest were found, they could be selected, and the available literature could be made available to the user. the complete computational model of a human is not close to completion. first, all the participants in the system (the molecules and the ways in which they associate to form higher-level aggregates) must be identified. second, the quantitative equations and symbolic relationships that summarize how the systems interact have not been elucidated fully. third, the computational representations and computer power to run such a simulation are not in place. researchers are, however, working in each of these areas. the genome projects will soon define all the molecules that constitute each organism. research in simulation and the new experimental technologies being developed will give us an understanding of how these molecules associate and perform their functions. finally, research in both clinical informatics and bioinformatics will provide the computational infrastructure required to deliver such technologies. bioinformatics is closely allied to clinical informatics. it differs in its emphasis on a reductionist view of biological systems, starting with sequence information and moving to structural and functional information. the emergence of the genome sequencing projects and the new technologies for measuring metabolic processes within cells is beginning to allow bioinformaticians to construct a more synthetic view of biological processes, which will complement the whole-organism, top-down approach of clinical informatics. more importantly, there are technologies that can be shared between bioinformatics and clinical informatics because they both focus on representing, storing, and analyzing biological data. these technologies include the creation and management of standard terminologies and data representations, the integration of heterogeneous databases, the organization and searching of the biomedical literature, the use of machine learning techniques to extract new knowledge, the simulation of biological processes, and the creation of knowledge-based systems to support advanced practitioners in the two fields. the proceedings of one of the principal meetings in bioinformatics, this is an excellent source for up-to-date research reports. other important meetings include those sponsored by the this introduction to the field of bioinformatics focuses on the use of statistical and artificial intelligence techniques in machine learning introduces the different microarray technologies and how they are analyzed dna and protein sequence analysis-a practical approach this book provides an introduction to sequence analysis for the interested biologist with limited computing experience this edited volume provides an excellent introduction to the use of probabilistic representations of sequences for the purposes of alignment, multiple alignment this primer provides a good introduction to the basic algorithms used in sequence analysis, including dynamic programming for sequence alignment algorithms on strings, trees and sequences: computer science and computational biology gusfield's text provides an excellent introduction to the algorithmics of sequence and string analysis, with special attention paid to biological sequence analysis problems artificial intelligence and molecular biology this volume shows a variety of ways in which artificial intelligence techniques have been used to solve problems in biology genotype to phenotype this volume offers a useful collection of recent work in bioinformatics another introduction to bioinformatics, this text was written for computer scientists the textbook by stryer is well written, and is illustrated and updated on a regular basis. it provides an excellent introduction to basic molecular biology and biochemistry what ways will bioinformatics and medical informatics interact in the future? will the research agendas of the two fields merge will the introduction of dna and protein sequence information change the way that medical records are managed in the future? which types of systems will be most affected (laboratory, radiology, admission and discharge, financial it has been postulated that clinical informatics and bioinformatics are working on the same problems, but in some areas one field has made more progress than the other why should an awareness of bioinformatics be expected of clinical informatics professionals? should a chapter on bioinformatics appear in a clinical informatics textbook? explain your answers one major problem with introducing computers into clinical medicine is the extreme time and resource pressure placed on physicians and other health care workers. will the same problems arise in basic biomedical research? why have biologists and bioinformaticians embraced the web as a vehicle for disseminating data so quickly, whereas clinicians and clinical informaticians have been more hesitant to put their primary data online? key: cord-275258-azpg5yrh authors: mead, dylan j.t.; lunagomez, simón; gatherer, derek title: visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling date: 2019-07-26 journal: j mol graph model doi: 10.1016/j.jmgm.2019.07.014 sha: doc_id: 275258 cord_uid: azpg5yrh the protein sequence-structure gap results from the contrast between rapid, low-cost deep sequencing, and slow, expensive experimental structure determination techniques. comparative homology modelling may have the potential to close this gap by predicting protein structure in target sequences using existing experimentally solved structures as templates. this paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable rna-dependent rna polymerase (rdrp) target-template pairs within human-infective rna virus genera. measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (cnns) potentially useful for production of homology models most representative of their genera. homology modelling was then carried out for target-template pairs in different species, different genera and different families, and model quality assessed using several metrics. reconstructed ancestral rdrp sequences for individual genera were also used as templates for the production of ancestral rdrp homology models. high quality ancestral rdrp models were consistently produced, as were good quality models for target-template pairs in the same genus. homology modelling between genera in the same family produced mixed results and inter-family modelling was unreliable. we present a protocol for the production of optimal rdrp homology models for use in further experiments, e.g. docking to discover novel anti-viral compounds. (219 words) since high-throughput sequencing technologies entered mainstream use towards the end of the first decade of the 21st century, there has been an explosion in available protein sequences. by contrast, there has been no corresponding high-throughput revolution in structural biology. obtaining solved structures of proteins at adequate resolution remains a painstaking task. x-ray crystallography is still the gold standard for structure determination more than 60 years after its first use in determining myoglobin structure [1] . the result of this discrepancy between the rate of protein sequence determination and the rate of protein structure determination is the protein sequence-structure gap [2] . homology modelling is a rapid computational technique for prediction of a protein's structure from (a) the protein's sequence, and (b) a solved structure of a related protein, referred to as the target and the template, respectively. since structural similarity often exists even where sequence similarity is low [2, 3] , homology modelling has the potential to reduce massively the size of the protein sequence-structure gap, provided the models produced can be considered reliable enough for use in further research. the rna-dependent rna polymerase (rdrp) of rna viruses presents an opportunity to test and expand this approach. rdrps are the best conserved proteins throughout the rna viruses, being essential for their replication [4] . conservation is particularly high in structural regions that are involved in the replication process, for instance the indispensable rna-binding pocket [5] . rdrps are also of immense medical importance as the principal targets for antiviral drugs. evolution of resistance against anti-viral drugs is a major concern for the future, and the design of novel anti-viral compounds is a highly active research area. solved structures of rdrps are of great assistance to these efforts, as they enable the use of docking protocols against large libraries of pharmaceutical candidate compounds [e.g. refs. [6, 7] ]. although some human-infective rna viruses have solved rdrp structures, there are still large areas within the virus taxonomy that lack any. this paper will first identify where the protein sequencestructure gap is at its widest in rdrps. because of the sequencestructure gap, it is therefore impossible in many genera to perform docking protocols against solved structures of rdrp for discovery of novel anti-viral compounds. under these circumstances, replacement of real solved structures with homology models for docking experiments requires that the homology models used should be both high quality and also optimally representative of their respective genera. our second task is to present several similarity metrics in sequence space that assist in the identification of the virus species having the rdrp sequence that is most representative of its genus as a whole. we then present the first use of force-directed graphs to produce an intuitive visualization of sequence space, and select target rdrps without solved structures for homology modelling. these are then used to perform homology modelling using template-target pairs within the same genus, between sister genera and between sister families, monitoring the quality of the models produced as the template becomes progressively more genetically distant to the target sequence being modelled. finally, we produce homology models for reconstructed common ancestral rdrp sequences. in the light of our results, we comment on the strengths and weakness of homology modelling to reduce the size of the protein sequence-structure gap for rdrps, and produce a flowchart of recommendations for docking experiments on rdrp proteins lacking a solved structure. we chose rdrps from human-infective viruses based on the list provided by woolhouse & brierley [8] . given the global medical importance of aids, we also included lentivirus reverse transcriptases (rts) for analysis. solved structures for these proteins, where available, were downloaded from the rcsb protein data bank (pdb) [9] . table 1 presents our criteria for selecting suitable homology modelling candidates. rdrp and rt amino acid sequences for all virus species satisfying the criteria of table 1 were downloaded from genbank [10] . alignment of sequence sets for each genus, was performed using mafft [11] . alignments were refined in mega [12] using muscle [13] where necessary, and the best substitution model determined. alignment of target sequences onto their solved structure templates for homology modelling was carried out using the molecular operating environment (moe v.2016.08, chemical computing group, montreal h3a 2r7, canada). we define sequence space as a theoretical multi-dimensional space within which protein sequences may be represented by points. for an alignment of n related proteins, the necessary dimensionality of this sequence space is n-1, with the hyperspatial co-ordinates in each dimension for any protein determined by its genetic distance to the n-1 other proteins. for n ¼ 5, direct visualization of all dimensions of sequence space is impractical at best, since a 4-dimensional space must be simulated in three dimensions, and is effectively impossible for n ! 6. the following methods were used to reduce sequence space to two and three dimensions for ease of visualization. to simplify calculations, we allow an extra dimension defined by the distance from each sequence to itself. the value of the co-ordinate in that dimension is always zero and our sequence space has n dimensions rather than n-1. the pairwise distance matrix (m d ) for each genus, calculated from the sequence alignment in mega, consists of entries m d (i,j) giving the genetic distance between each pair of sequences i and j where {i, j} 2 {1,2 …. n} and i s j, for a set of n sequences. in our data set n ranges (see supplementary table) the similarity matrix was then used as input for r package qgraph [14] . the "spring" layout option was chosen, which uses the fruchterman-reingold algorithm to produce a two-dimensional undirected graph in which edge thickness is proportional to absolute distance in n dimensions and node proximity in two dimensions is optimized for ease of viewing while attempting to ensure that those nodes closely related in the n-dimensional input are also close in the two-dimensional output [15] . 500 iterations were performed, or until convergence was achieved. for each alignment, the pairwise distance matrix (m d ) was used as input for r package cmdscale, which uses multi-dimensional scaling to produce a three-dimensional graph from the n-dimensional input, with node proximity again reflecting relative similarity [16] . spotfire analyst (tibco spotfire analyst, v.7.12.0, 2018) was used to visualize the output of cmdscale. we define the centroid as a hypothetical protein sequence located at the centre point of the sequence space of an alignment. the real sequence closest to the hypothetical centroid is termed the centroid nearest neighbour (cnn). we calculate the position of the cnn in three ways. table 1 list of criteria used to select rna-dependent rna polymerases (rdrps) for homology modelling. human-infective virus importance to human health ncbi refseq annotated genome easy retrieval of high quality rdrp sequence rdrp located at the 3 0 end of polyprotein or on its own segment eliminates unconventional rdrps at least one solved rdrp at a range of different taxonomic levels, e.g. in same species, same genus, same family, same order to be used as the templates in homology modelling at different levels of genetic distance 2.4.1. shortest-path centroid nearest neighbour for a sequence i 2 {1,2 …. n} in an alignment of n sequences, its total path length d(i) to the other n-1 sequences may be calculated from the distance matrix m d as follows: is zero. this may be omitted to enforce a strict n-1 dimensions for n input sequences, but we leave it in to simplify subsequent calculations. we define i* as the index that minimizes d(i). the shortest path cnn is therefore sequence i*. for alignments where clusters of closely related sequences exist, giving many values of m d (i,j) close to zero, this method will tend to place the cnn within a cluster. to overcome this problem, the arithmetic mean and median, respectively, were used to determine the mean cnn and the median cnn. the values of d (equation (2)) may be averaged to produce mean total path distance d: where again n is the total number of sequences in the alignment. we now re-define i* as the index that minimizes d(i) -d. in the event of equation (5) returning zero, the mean cnn and the true centroid are identical. as with all variables using means, the mean cnn is liable to skewing by outliers. we generate a vector d over i 2 {1,2 …. n}, in which each entry d(i) represents the total path length for sequence i (equation (2)). the values of vector d are then ranked in ascending order x s(1) to x s(n) to produce vector d s . the median cnn is the sequence with value d(i) situated in the middle of the array d s , at d(m), where d(m) is either d (m odd ) or d (m even ) for alignments with odd or even numbers of sequences respectively. we now re-define i* as the index that minimizes d(i) -d(m). again, in the event of equation (9) returning zero, the median cnn and the true centroid are identical. as with all variables using medians, the median cnn is liable to skewing by the presence in the alignment of multiple sequences with the same value of d(i). the choice of solved structures as templates for homology modelling, and the choice of targets to be modelled, within each genus was governed by the following rules: (1) for each genus the solved structure that covered the highest proportion of the rdrp or rt sequence was chosen as the template for that genus. (2) if more than one candidate template structure was found at this sequence length, the structure with the lowest resolution in angstroms was selected. see table 2 for the templates satisfying these two criteria. (3) within each genus, the sequence with the greatest genetic distance from the template, was chosen as the target for homology modelling. see table 3 for the template-target pairs satisfying this criterion. (4) criterion 3 was applied to find template-target pairs in different genera (see table 4 ) and different families (see table 5 ), thus testing the limits of homology modelling at high genetic distances. homology modelling was carried out using the molecular operating environment (moe v.2016.08, chemical computing group, montreal h3a 2r7, canada). ten intermediate models were produced using the amber10:eht forcefield under medium refinement. the model that scored best under the generalised born/ volume integral (gb/vi) was selected to undergo further energy minimisation using protonate3d, which predicts the location of hydrogen atoms using the model's 3d coordinates [17, 18] . to assess the stereochemical quality of the homology models produced, ramachandran plots were derived in moe, and used to calculate the proportion of bad outlier f-j angles in the model, after subtraction of the number of outlier f-j angles in the template. generally, outlier angle percentage below 0.05% indicates a very high quality model, and a percentage below 2% indicates a good quality model [19] . models were superposed with their templates in moe and rootmean-square deviation (rmsd) value derived for the alpha carbons (ca) in the two structures. generally, an rmsd below 2 å indicates a good quality model [20] . qualitative model energy analysis (qmean) was used to analyse models using both statistical and predictive methods [21] . the qmean z-score is an overall measure of the quality of the model when compared to similar models from a pdb reference set of x-ray crystallography-solved structures. a z-score of 0 would indicate a model of the same quality as a similar high quality x-ray crystallographic structure, while a z-score below à4.00 indicates a low quality model [22] . maximum likelihood (ml) trees [23] were produced for each genus in mega. the ml tree and the corresponding multiple sequence alignment were input into the ancestral reconstruction server, fastml [24] . the reconstructed sequence for the root of the tree, i.e. the putative common ancestor rdrp or rt sequence for the genus was used as the target for homology modelling in moe, using the template chosen according to the rules in section 2.5. the reconstructed ancestral sequence was added to the alignment and the force-directed graph re-drawn. fig. 1b , showing the targettemplate pairs for homology modelling may be compared with fig. 1c , showing the ancestor-template pairs. our first observation is that there are still large areas of the viral taxonomy where no solved rdrp structures exist. no suitable templates for homology modelling were found within the entire nidovirales order of rna viruses. this order contains several coronaviruses important to human health including severe acute respiratory syndrome-related coronavirus (sars-cov) and middle east respiratory syndrome-related coronavirus (mers-cov) [25] . in the order mononegavirales, vesiculovirus was the only genus with a solved rdrp structure suitable for homology modelling. however, this order contains many medically important viruses such as zaire ebolavirus, hendra henipavirus, measles morbillivirus, and mumps rubulavirus [26] . in the order bunyavirales, phenuiviridae stands out as an important family lacking a solved rdrp, despite it containing various human-infective arboviruses such as rift valley fever phlebovirus and sandfly fever naples phlebovirus [27] . (table 1) . fig. 1 shows two-dimensional force-directed graphs of similarity for each genus with more than four rdrp reference sequences (or rt sequences in the case of lentivirus). in principle, it would be possible to draw force-directed graphs for entire families and even orders. however, the input to qgraph is the similarity matrix calculated from the distance matrix, and the distance matrix is calculated in mega from an alignment. once taxonomic distance begin to extend beyond genera, alignment becomes progressively less reliable, with all the downstream statistics tending to degrade as a consequence. we therefore confine our construction of forcedirected graphs to intra-genus comparisons. it is evident from fig. 1 that sequences are not necessarily evenly distributed in sequence space. clustering is noticeable in the genus flavivirus, with two sub-groups and an outlier sequence evident. mammarenavirus also shows division into two sub-groups. by contrast, picobirnavirus has only five relatively equidistant reference sequences, thus producing a highly regular pentagram. similarly, rotavirus has eight reference sequences, with four at each end of a fairly regular cuboid. fig. 1a also shows how the various methods equations (2)e(9) for determining the cnn of sequence space for each genus, are in poor agreement. only in rotavirus and table 2 solved structures of rdrps and reverse transcriptase (for hiv-1) selected as templates for homology modelling. all are derived by x-ray crystallography except 5a22 which is a cryo-electron microscopy structure. for protein coverage, indicates that the template covers more than 90% of the sequence, indicates less. for f-j outliers and qmean z-score, indicates good-quality, indicates poor-quality, determined by the following thresholds: f-j ¼ 2%, qmean z-score ¼ à4.00. table 3 homology modelling at intra-genus, inter-species level. templates are as given in table 2 . targets are the rdrp (or reverse transcriptase for lentivirus) sequences from the reference genome accession numbers given. rmsd: root mean square deviation in angstroms between template and model when superposed in moe. indicates good quality, indicates poor quality, determined by the following thresholds: f-j < 2%; qmean z-score > à4.00; rmsd <2 å. indicates good quality, but using a partial template (see table 1 ) *imjin thottimvirus was reclassified in 2018 by the international committee on taxonomy of viruses (ictv) in a new genus thottimvirus. table 4 homology modelling at intra-family, inter-genus level. templates are as given in table 2 . targets are the rdrp (or reverse transcriptase for spumavirus) sequences from the reference genome accession numbers given. rmsd: root mean square deviation in angstroms between template and model when superposed in moe. indicates goodquality, indicates poor-quality, determined by the following thresholds: f-j < 2%; qmean z-score > à4.00; rmsd <2 å. picobirnavirus are mean and median cnns found in the same sequence. fig. 1a also shows that the best solved structure for the purposes of template choice in homology modelling is rarely close to the centre of sequence space. only in lentivirus is the optimal template also the mean cnn, and only in vesiculovirus is the optimal template a shortest-path cnn. fig. 1b shows the relations of the template-target pairs in sequence space, illustrating how intra-genus homology modelling template-target selection attempts to traverse the largest genetic distance available within the genus. figs. 2 and 3 compare, for genera orthohantavirus and mammarenavirus respectively, the force-directed graphs of fig. 1 with the three-dimensional equivalent output of multidimensional scaling. fig. 2 shows a sequence clustering within orthohantavirus that is not readily apparent in the force-directed graph. the cnns are distributed among four clusters, as there is no sequence close to the geometrical centre of the three-dimensional space, where the notional centroid is located. the solved structure has 10 other sequences in its proximity in the three-dimensional space, roughly table 5 homology modelling at intra-order, inter-family level. templates are as given in table 2 . targets are the rdrp (or reverse transcriptase for lentivirus) sequences from the reference genome accession numbers given. rmsd: root mean square deviation in angstroms between template and model when superposed in moe. indicates goodquality, indicates poor-quality, determined by the following thresholds: f-j < 2%; qmean z-score > à4.00; rmsd <2 å. fig. 1 . force-directed graph visualisations of similarity of rdrps (or reverse transcriptase for lentivirus) within genera. the genetic distance matrix for each alignment was converted into a similarity matrix equations (1) and (2). the fruchterman-reingold algorithm (500 minimisation iterations) was implemented in r module qgraph to produce a force-directed graph. relative similarity is represented by node proximity, and absolute similarity is proportional to edge thickness. the solved structure and the three types of centroid nearest neighbour (cnn) sequences are highlighted. the species names corresponding to the numbered nodes are listed in the supplementary table. cardiovirus has less than four reference sequences and is omitted. a: location of solved structure and the three cnns in sequence space equations (3)e(7). some genera have two median cnns. equivalent to the lower right quadrant of the two-dimensional force-directed graph. similarly, the shortest-path cnn and mean cnn are both located within another three-dimensional cluster also containing 11 sequences, which is roughly equivalent to the upper right quadrant of the two-dimensional force-directed graph. fig. 3 presents a similar picture for mammarenavirus. the forcedirected graph for mammarenavirus has more obvious clustering that for orthohantavirus, showing a lower-left to top-right split. in the three-dimensional representation, these are equivalent, respectively, to the three clusters on the right and two clusters on the left. as with orthohantavirus, there is no cnn near the geometrical centre of the three-dimensional space, but the cnns are distributed around two clusters. three dimensional representations of all the genera in fig. 1 are available from the link in the raw data section. homology modelling was carried out as follows: (1) intra-genus, inter-species (11 models, table 3) (2) intra-family, inter-genus (5 models, table 4 ) (3) intra-order, inter-family (7 models, table 5 ) (4) intra-genus, on reconstructed common ancestor (12 models, table 6 ) table 3 shows that homology modelling with template and target within the same genus, produced good quality models in most cases, as judged by percentage of f-j outliers and rmsd within the high quality range. only the models for american bat vesiculovirus and tamana bat virus have percentages of f-j outliers outside of the high quality range. qmean, however, is rather more critical of the output with only the model for porcine picobirnavirus falling within the high quality range. the model for imjin thottimvirus scores eighth best on percentage of f-j outliers and second best on rmsd, despite the re-classification (occurring after the completion of our experimental work) by the ictv of this virus, originally in genus orthohantavirus into a new thottimvirus genus [28] . it should be noted that the models for imjin thottimvirus, burana orthonairovirus and brazilian mammarenavirus were based on very short template structures (see table 2 ). table 4 shows that homology modelling with template and target within the same family but different genera, still produced good quality models in most cases, as judged by percentage of f-j outliers and rmsd within the high quality range. only the models for lleida bat lyssavirus and macaque simian foamy virus have percentages of f-j outliers outside of the high quality range. however, once again, qmean assesses all models as outside the high quality range. table 5 shows that homology modelling with template and target within the same order but in different families, is a far more difficult proposition than at the lower taxonomic levels. the model for mammalian orthobornavirus 1 fails all three quality tests and only the model for rift valley fever phlebovirus manages to pass two out of three. table 6 shows that modelling the structure of the reconstructed sequence of the common ancestor of each genus, produces models of the same standard as intra-genus modelling (compare tables 3 and 6 ). by contrast with almost all the other models, the qmean scores are within the high quality range, with only two exceptions, table. the common ancestors of genera rotavirus and vesiculovirus. fig. 1c shows the force-directed graphs with the locations of the ancestral sequences added. table 7 summarises the results of tables 3e6 inclusive. as the taxonomical distance increases, production of high quality homology models becomes more difficult. however, modelling the reconstructed ancestral sequence of each genus is typically productive of a better scoring model even than the real sequence targets chosen for intra-genus modelling. fig. 4 shows representative examples of homology models of high and low quality superimposed with their template solved structure along with their corresponding ramachandran plots and qmean quality scores. all homology models in tables 3e6 are available from the link in the raw data section. the first objective of this study was to identify viral taxa which are comparatively lacking in solved structures for rna-dependent rna polymerase (rdrp). we observed that the entire order nidovirales, the families bornaviridae, filoviridae and paramyxoviridae within the order mononegavirales, and the family phenuiviridae within the order bunyavirales, fall into this category. additionally, within the genera orthohantavirus, orthonairovirus and mammarenavirus, all within the order bunyavirales, the solved structure available for rdrp covers less than 10% of the protein sequence. given the medical importance of many viruses within these taxa, and the number of anti-viral drugs that target rdrps, we suggest that they are prioritized for x-ray crystallography to close the "sequence-structure gap". our second objective was to assess how well homology modelling could provide models that might serve for computerassisted drug discovery of novel anti-viral compounds. to assist in the visualization of sequence space, we produced the first application of force-directed graphs to protein sequences (fig. 1) . we also applied multidimensional scaling for comparative purposes (figs. 2 and 3) . force-directed graphs enable the visualization of complex data in two dimensions. the three dimensional visualization produced from multidimensional scaling is visually richer, but this benefit can only be appreciated when a viewing application such as spotfire is available so that the three-dimensional image can be rotated. force-directed graphs convey much of the information in a single image which may be printed on a page or viewed on screen. this two-dimensional collapsing of sequence space also allows for easy simultaneous comparison of multiple datasets, in the present case multiple genera, which cannot readily be performed if separate three-dimensional viewers require to be open. the most common method of visualizing sequence space is the phylogenetic tree. for instance, starting from a distance matrix, agglomerative hierarchical clustering, such as the upgma method [29] , can be performed to generate a tree. slightly more sophisticated methods, such as neighbour-joining [30] can generate trees where the branch lengths are proportional to genetic distance. force-directed graphs do not represent genetic distance as accurately as phylogenetic trees, since the distances between nodes, table. although optimized to reflect relatedness, are constrained by the fruchterman-reingold algorithm to the best representation in two dimensions. however, force-directed graphs again allow easier simultaneous comparison of several data sets than phylogenetic trees. fig. 1 would be impossible to create on a single page if trees were used instead of force-directed graphs. trees represent ancestral sequences as nodes on the tree, with only existing taxa as leaves. force-directed graphs, by contrast, allow ancestral sequences to be represented in the same way as existing ones. fig. 1c shows that ancestral sequences do not necessarily appear as outliers in force-directed graphs. indeed, for genera flavivirus, hepacivirus, orthobunyavirus and orthohantavirus in particular, the insertion of the reconstructed ancestral sequence into the forcedirected graph in fig. 1c does not overly distort its original shape in fig. 1aeb . the reason for this becomes apparent when one considers a phylogenetic tree represented in unrooted "star" format. the ancestral sequence is then at the centre of the star topology and it can be seen that the genetic distance from the root to any particular leaf sequence may often be less than for many pairwise leaf sequence combinations. we did not perform calculation of centroid nearest neighbours (cnns) for alignments incorporating reconstructed ancestral sequences, but we are tempted to speculate that many of the ancestral sequences would have been cnns, had they been included. table 6 homology modelling the common ancestor for each genus. templates are as given in table 2 . targets are the reconstructed ancestral rdrp (or reverse transcriptase for lentivirus) sequences. rmsd: root mean square deviation in angstroms between template and model when superposed in moe. indicates good-quality, indicates poorquality, determined by the following thresholds: f-j < 2%; qmean z-score > à4.00; rmsd <2 å. table 7 mean model (or structure) quality. the top line shows the mean quality scores for the solved structures used. the other lines show the mean quality scores for the models produced at various levels of taxonomic distance between template and target. indicates good-quality, indicates poor-quality, determined by the following thresholds: f-j < 2%; qmean z-score > à4.00; rmsd <2 å. numbers in brackets indicate the revised scores if the model for imjin thottimvirus is moved out of the intra-genus category and into the intra-family category in the light of its subsequent transfer into the new genus thottimvirus. ), and outliers ( cross, text). the z-score graphics show model quality on a sliding scale: low-quality ( ), high-quality ( ). qmean4 shows the overall z-score, "all atom" shows the average z-score for all of the atoms in the model, "cbeta" the z-score for all cb carbons, "solvation" is a measure of how accessible the residues are to solvents, and "torsion" is a measure of torsion angle for each residue compared to adjacent residues. it is important to remember that homology models are theoretical constructions and caution must be exercised in treating them as input material for further experiments. among the various statistics for assessment of model quality, f-j outlier percentage is a measure of the proportion of implausible dihedral angles in the model, and indicate where parts of the model backbone are likely to be incorrectly predicted. nevertheless, it is also important not to become too dependent on statistics such as f-j outlier percentage, as "bad" angles do occasionally occur in solved structures. for instance in the present study, the thresholds of <0.05% for a very high quality model, and <2% for a good quality model given by lovell et al. [19] would suggest that six of the twelve template solved structures used here ( table 2) would not have been assessed as "very high quality" had they been models rather than solved structures. indeed the templates from indiana vesiculovirus and rotavirus a have more than 0.5% f-j outliers, and also have the poor quality scores for qmean. these two structures also have the poorest resolution of any of our templates, at > 3 å. the poor quality scoring may therefore simply be a consequence of uncertainties in positioning of atoms in these structures. one might reasonably posit that the use of template solved structures having such issues might influence the resulting models to contain the same outliers. however, the model for rotavirus i has a lower level of f-j outliers than its rotavirus a template ( table 3) . as might be expected, production of high quality models becomes more difficult as the genetic distance between target and template increases, as show in tables 3e5 nevertheless, even at the level of template-target pairs in separate genera (table 4) , the average performance is acceptable, as summarized in table 7 . we therefore suggest that homology modelling may be used to produce rdrp models for research use even for genera where no solved structure exists, provided a template structure exists within the same family. here, we provide examples (table 4 ) of such successful inter-genus, intra-family, models for genera coltivirus and parechovirus. our inter-genus models for lyssavirus and spumavirus are slightly less successful. moving to the next taxonomic level, models with template-target pairs in separate families (table 5) are generally less successful. one exception is our model for family phenuiviridae, which is better than some of the intra-family models. this is encouraging, since phenuiviridae is a family without any solved rdrp structure. homology models have been produced at much larger taxonomic distances than those dealt with here, for instance from bacteria to eukaryotes [31] , so it should be stressed that we make no claim for the generality of our findings outside of the viral orders under consideration, or for proteins other than rdrp. multi-domain proteins in particular, may produce higher quality models for some domains than others. one surprising result was the high quality of the models of reconstructed ancestral sequences (table 6 , summarized in table 7 ). as previously discussed, this may be due to the fact that the ancestral sequence is, assuming a regular molecular clock, potentially equally related to all descendent members of its genus. in this paper, we calculated centroid nearest neighbours (cnns) as the central points in sequence space for each genus (fig. 1) . a reconstructed ancestral sequence may also be considered as a candidate central point. the value of central points is that they may serve as targets that could be used to make models representative of their genus as a whole. for instance, the shortest-path, mean and median cnns of genus orthohantavirus are sequences 16, 22 and 7 (see supplementary table for a list of sequences for each genus) , representing sin nombre orthohantavirus, rockport orthohantavirus and cao bang orthohantavirus respectively. the partial solved structure used as the template for modelling in the genus orthohantavirus in the present paper is from hantaan orthohantavirus (5ize, see table 2 ) and the target used, imjin thottimvirus (sequence 27 in orthohantavirus panel of fig. 1) , is now classified as belonging to a new genus thottimvirus (table 3) . the three cnns, sin nombre orthohantavirus, rockport orthohantavirus and cao bang orthohantavirus are 71%, 64% and 75% identical to 5ize respectively, whereas imjin thottimvirus is only 58% identical. the latter was of course chosen to test the effectiveness of intra-genus homology modelling over as wide a genetic distance as possible (see section 2.5). for the performance of subsequent experimental procedures on orthohantavirus rdrps, for instance docking to discover novel anti-viral compounds, a homology model corresponding to one of the three cnns mentioned above or to the reconstructed ancestor (table 6 ) would be the preferred target, along with the existing solved structure. where a solved rdrp structure exists in a genus, it should be used. however, if that solved structure is not a cnn, a homology model of a cnn or ancestral sequence should be produced for comparative purposes. where no solved rdrp structure exists in a genus, a structure from another genus in the same family may be used. on the basis of our investigations, we recommend a procedural flowchart for selection of an rdrp structure for further study, for instance docking to discover novel anti-viral compounds, in any rna virus genus of interest (fig. 5) . where a solved structure exists within a genus, it is the obvious choice for further experiments. however, where that solved structure is far from any of the cnn sequences of the genus, as judged by the force-directed graph, a cnn may also be homology modelled for comparative purposes, using the existing solved structure as a template. any differential performance of the solved structure and the homology model in, for instance, a docking experiment, may give clues as to the generality of conclusions derived from the solved structure alone. a reconstructed ancestral rdrp may also be used as an alternative to, or in addition to, a cnn. the limits of homology modelling would appear, on the basis of the results presented here, to be at the intrafamily, inter-genus level. template-target pairs in different viral families are unlikely to be of practical use, as the predicted quality of the resulting models is low. our models were produced using moe, and we have not performed comparisons using other modelling tools, such as swiss-model [31] or modeller [32] . we feel that it is unlikely that significant differences in output would be produced, but when the object of the exercise is drug-discovery, we recommend that the protocol in fig. 5 be implemented using several alternative modelling softwares. crystallographic structural genome projects are badly needed to close the sequence-structure gap. in the meantime, systematic attempts to fill the gaps via homology modelling may be useful. however, for many taxa e all of the order nidovirales and much of mononegavirales -the paucity of solved structures to act as templates remains a serious obstacle. all code, inputs and outputs are available from: https://doi.org/ 10.17635/lancaster/researchdata/276. a three-dimensional model of the myoglobin molecule obtained by x-ray analysis protein modeling: what happened to the "protein structure gap the high throughput sequence annotation service (ht-sas) -the shortcut from sequence to true medline words the evolution and emergence of rna viruses crystal structure of the full-length japanese encephalitis virus ns5 reveals a conserved methyltransferase-polymerase interface molecular docking revealed the binding of nucleotide/ side inhibitors to zika viral polymerase solved structures using bioinformatics tools for the discovery of dengue rna-dependent rna polymerase inhibitors epidemiological characteristics of humaninfective rna viruses the rcsb protein data bank: integrative view of protein, gene and 3d structural information reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation mafft: iterative refinement and additional methods mega7: molecular evolutionary genetics analysis version 7.0 for bigger datasets muscle: multiple sequence alignment with high accuracy and high throughput network visualizations of of relationships in psychometric data graph drawing by force-directed placement some properties of classical multidimensional scaling protonate 3d: assignment of ionization states and hydrogen coordinates to macromolecular structures the generalized born/volume integral implicit solvent model: estimation of the free energy of hydration using london dispersion instead of atomic surface area structure validation by calpha geometry: phi,psi and cbeta deviation on the accuracy of homology modeling and sequence alignment methods applied to membrane proteins qmean: a comprehensive scoring function for model quality assessment toward the estimation of the absolute quality of individual protein structure models evolutionary trees from dna sequences: a maximum likelihood approach fastml: a web server for probabilistic reconstruction of ancestral sequences sars and mers: recent insights into emerging coronaviruses taxonomy of the order mononegavirales: second update emerging phleboviruses taxonomy of the order bunyavirales: second update construction of phylogenetic trees for proteins and nucleic acids: empirical evaluation of alternative matrix methods the neighbor-joining method: a new method for reconstructing phylogenetic trees swiss-model and the swiss-pdbviewer: an environment for comparative protein modeling modeller: generation and refinement of homology-based protein structure models supplementary data to this article can be found online at https://doi.org/10.1016/j.jmgm.2019.07.014. key: cord-266794-oyppubq5 authors: zhang, dachuan; zhang, tong; liu, sheng; sun, dandan; ding, shaozhen; cheng, xingxiang; cai, pengli; ren, ailin; han, mengying; liu, dongliang; jia, cancan; gong, linlin; zhang, rui; xing, huadong; tu, weizhong; chen, junni; hu, qian-nan title: sars2020: an integrated platform for identification of novel coronavirus by a consensus sequence-function model date: 2020-09-01 journal: bioinformatics doi: 10.1093/bioinformatics/btaa767 sha: doc_id: 266794 cord_uid: oyppubq5 motivation: the 2019 novel coronavirus outbreak has significantly affected global health and society. thus, predicting biological function from pathogen sequence is crucial and urgently needed. however, little work has been performed to identify viruses by the enzymes that they encode, and which are key to pathogen propagation. results: we built a comprehensive scientific resource, sars2020, that integrates coronavirus-related research, genomic sequences, and results of anti-viral drug trials. in addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the severe acute respiratory syndrome virus. this data-driven sequence-based strategy will enable rapid identification of agents responsible for future epidemics. availability: sars2020 is available at http://design.rxnfinder.org/sars2020/. supplementary information: the 2019 novel coronavirus (2019-ncov) outbreak is an ongoing pandemic. as of 20 may 2020, 4,900,647 cases were confirmed, and 320,107 deaths were attributed to the virus. on 30 jan. 2020, the world health organization declared the outbreak a public health emergency of international concern. identification of the virus is crucial for public health authorities to contain the spread of the disease and for researchers to find methods to cure the disease (wang, et al., 2020) . the genome sequence of the 2019-ncov became available on 10 january 2020 (wu, et al., 2020) . however, sequence alone is insufficient for accurate identification because pathogens are not defined by "taxonomy". to circumvent this limitation, we built an integrated 2019-ncov scientific resource platform and a consensus sequence-catalytic function model with which we developed novel methodology to analyze pathogen sequences for catalytic functions. this model predicted that the 2019-ncov has an enzymatic activity unique to sars viruses. we systematically collected reports of coronavirus-related research, genomic sequences, biochemical reactions, government policies, media public opinion, and anti-viral drugs in clinical trial (table s1 , hu, et al., 2011; khan, et al., 2020; shu and mccauley, 2017) . this information was used to build sars2020, an integrated scientific resource about 2019-ncov, to provide foundation data for researchers in various fields. for data quality, we imposed strict evaluation and validation criteria. all 2019-ncov related data were checked one-by-one to ensure authenticity. in addition, we integrated a consensus sequence-function model (zhang, et al., 2020) , a genome browser (ham, et al., 2012) , and a catalytic function annotation tool (dawson, et al., 2017) into the platform to assist in the research of novel viruses. sequence-function model: we adopted a consensus strategy to annotate enzymatic functions of biological sequences. for sequence function annotation, the family classification method captures common properties from the samples and extracts their feature vectors using machine learning algorithms, then merges the sequences into clusters or families. this consensus strategy enables efficient integration of these computational resources to maximize the accuracy and comprehensiveness of enzyme function prediction. web server: sars2020 runs on a linux server under a nginx environment. the backend program and algorithm were written in python using the django framework in combination with mysql to manage the data. bootstrap, css, and javascript were used to implement the frontend data presentation and interactions. identification of 2019-ncov: we obtained the coding sequences of 2019-ncov from ncbi (nc_045512) and constructed a gene model from sequence based on an interpolated markov model. we used the long-orfs tool from glimmer3 (delcher, et al., 2007) to identify the coding regions of bacterial, archaeal, and viral genomes. protein translation of coding regions was performed with biopython (cock, et al., 2009) . then we used a consensus sequence-catalytic function model provided by sars2020 to analyze the pathogen sequence for likely catalytic functions. the sars2020 system is an integrated scientific resource platform about 2019-ncov. at present, the system includes ~60,000 units of information. it provides powerful assistance for scientists to grasp the progress of 2019-ncov research and to share data. sars2020 is also a platform to assist in the identification of new viruses. we analysed the 2019-ncov genome by the method described above. all predicted catalytic functions were derived from orf1ab (geneid: 43740578), which seems to encode multiple proteins (fig 1) . the most likely predicted catalytic activity was sars coronavirus main proteinase, which enzyme commission (ec) number is 3.4.22.69. this prediction suggested that 2019-ncov was most likely a sars virus, and this result was consistent with the conclusion of the international committee on taxonomy of viruses. at the same time, we also predicted other possible catalytic functions in the 2019-ncov genome, including rna-directed rna polymerase (ec: 2.7.7.48), dolichyl-phosphate-mannose-protein mannosyltransferase (ec: 2.4.1.109), nad+ adp-ribosyltransferase (ec: 2.4.2.30), and ubiquitinyl hydrolase 1(ec: 3.4.19.12). these predicted functions will provide valuable reference for further study of biological activity and pathogenesis of the 2019-ncov. we built an integrated platform to assist 2019-ncov research, and we proposed a novel consensus sequence-function model for using genome sequence data to identify unknown species. our data-driven sequencebased strategy will enable rapid identification of constantly emerging pathogens. biopython: freely available python tools for computational molecular biology and bioinformatics cath: an expanded resource to predict protein function through structure and sequence identifying bacterial genes and endosymbiont dna with glimmer design, implementation and practice of jbei-ice: an open source biological part registry platform and tools rxnfinder: biochemical reaction search engines using molecular structures, molecular fragments and reaction similarity phylogenetic analysis and structural perspectives of rna-dependent rna-polymerase inhibition from sars-cov-2 with natural products global initiative on sharing all influenza datafrom vision to reality human intestinal defensin 5 inhibits sars-cov-2 invasion by cloaking ace2 author correction: a new coronavirus associated with human respiratory disease in china bio2rxn: sequence-based enzymatic reaction predictions by a consensus strategy key: cord-010260-8lnpujip authors: anthonsen, henrik w.; baptista, antónio; drabløs, finn; martel, paulo; petersen, steffen b. title: the blind watchmaker and rational protein engineering date: 1994-08-31 journal: j biotechnol doi: 10.1016/0168-1656(94)90152-x sha: doc_id: 10260 cord_uid: 8lnpujip in the present review some scientific areas of key importance for protein engineering are discussed, such as problems involved in deducting protein sequence from dna sequence (due to posttranscriptional editing, splicing and posttranslational modifications), modelling of protein structures by homology, nmr of large proteins (including probing the molecular surface with relaxation agents), simulation of protein structures by molecular dynamics and simulation of electrostatic effects in proteins (including ph-dependent effects). it is argued that all of these areas could be of key importance in most protein engineering projects, because they give access to increased and often unique information. in the last part of the review some potential areas for future applications of protein engineering approaches are discussed, such as non-conventional media, de novo design and nanotechnology. nature has evolved using several types of random mutations in the genetic material as a fundamental mechanism, thereby creating new versions of existing proteins. by natural selection nature has given a preference to organisms with proteins which directly or indirectly made them better adapted to their environment. thus nature works like a blind watchmaker, trying out an endless number of combinations. this may seem to be an inefficient approach by industrial standards, but nevertheless nature has been able to develop some highly complex and sophisticated designs, simply by the power of natural selection over millions of years, occurring in a large number of parallel processes. by virtue of reproduction several copies of each organism have been able to test the effect of different mutations in parallel. it is quite probable that the mutation frequency was higher in ancient species (doolittle, 1992) , although it is still possible to find highly mutable loci in genes involved in adaptation to the environment (moxon et al., 1994) . enzymes have been used by man for thousands of years for modification of biological molecules. the use of rennin (chymosin) in rennet for cheese production is a relevant example. and with increased knowledge about proteins, genes and other biological macromolecules scientists started starting with a protein with known sequence and properties, we make a 3-d model of the protein from experimental structure data or by homology. by modelling and simulation we identify mutations that will modify selected properties of the protein (the design part of the process), these mutations are implemented at the dna level and expressed in a suitable organism (the production part of the process), and the success of the design is verified by experimental methods. to look at methods for making modified proteins with new or improved properties. at first this was done by speeding up nature's own approach, by increasing the number of mutations (e.g., by using chemicals or radiation) and by using a very strong selection based on tests for specific properties. with the introduction of new and powerful techniques for structure determination and site directed mutagenesis, it is now possible to do rational protein modification. rather than testing out a large number of random mutations, it has become feasible to identify key residues within the protein structure, to predict the effect of changing these residues, to implement these changes in the genetic material, and finally to produce large amounts of modified proteins. this is protein engineering. there are several reviews describing the fundamental ideas in protein engineering, see fersht and winter (1992) for a recent one. the basic protein engineering process is shown in fig. 1 (see also petersen and martel (1994) ). in most cases it starts out with an unmodified protein with well-characterised properties. for some reason we want to modify this protein. in the case of an enzyme we may want to make it more stable, alter the specificity or increase the catalytic activity. first we enter the design part of the protein engineering process. based on structural data we create a computer model of the protein. by a combination of molecular modelling and experimental methods the correlation between relevant properties and structural features is established, and changes affecting these properties can be identified and evaluated for implementation. in more and more cases the effect of these changes can be simulated, and the modifications can be optimised with respect to these simulations. as soon as a new design has been es~tablished we may enter the production part of the process. the necessary mutations must be implemented in the genetic material, this genetic material is introduced into a production organism, and the resulting modified protein can (in most cases) be extracted from a bioprocess. this protein can be tested with respect to relevant properties, and if necessary it may be used as a basis for re-entering the design part of the protein engineering process. after a few iterations we may reach an optimal design. there are several examples of successful protein engineering projects. protein engineering may be used to improve protein stability (kaarsholm et al., 1993) , enhance or modify specificity (getzoff et al., 1992; witkowski et al., 1994) , adapt proteins to new environments (arnold, 1993; gupta, 1992) , or to engineer novel regulation into enzymes (higaki et al., 1992) . in some cases even de novo design of new proteins may be relevant, using knowledge gained from existing structures (kamtekar et al., 1993; shakhnovich and gutin, 1993; ghadiri et al., 1993; ball, 1994) . in a truly multidisciplinary project chymosin mutants with optimal activity at increased ph values compared to wild-type chymosin was designed and produced (pitts et al., 1992) . point mutations changing the charge distribution of superoxide dismutase have been used to increase reaction rate by improved electrostatic guidance (getzoff et al., 1992) . a project on converting trypsin into chymotrypsin has been important for understanding the role of chymotrypsin surface loops (hedstrom et al., 1992) , a serine active site hydrolase has been converted into a transferase by point mutations (witkowski et al., 1994) , and mutations in insulin aiming at increased folding stability have given an insulin with enhanced biological activity (kaarsholm et al., 1993) . an example of a rational de novo project (as opposed to the random approach used, e.g., in generation of catalytic antibodies) is the design of an enzymatic peptide catalysing the decarboxylation of oxaloacetate via an imine intermediate, in which a very simple design gave a three to four orders of magnitude faster formation of imine compared to simple amine catalysts . in some cases it may also be an interesting approach to incorporate nonpeptidic residues into otherwise normal proteins (baca et al., 1993) , or to build de novo proteins by assembling peptidic building blocks on to a nonpeptidic template (tuchscherer et al., 1992) . it has been shown that incorporation of nonpeptidic residues into e-turns of hiv-1 protease gives a more stable enzyme (baca et al., 1993) . the main problem with this approach is how to incorporate the non-standard residues. in the hiv-1 protease case solid-phase peptide synthesis combined with traditional organic synthesis was used, others have suggested that the degeneracy of the genetic code may be used to incorporate novel residues via the standard protein synthesis machinery of the cell (fahy, 1993) . in the present review we will look at the design part of the protein engineering process, with emphasis on some of the more difficult steps, especially homology based modelling in cases with very low sequence similarity, nuclear magnetic resonance (nmr) of very large proteins and modelling of electrostatic interactions. in the last part of the review we will discuss some possible future directions for protein engineering and protein design. any protein engineering project is based on information about the protein sequence. this information may stem from either direct protein sequencing or a deducted translation of the dna/rna sequence. the amount of information on protein and nucleic acid sequences, as well as on relevant data like 3-d structures and disease-related mutations, is growing at a very rapid pace, and novel databases and computer tools give increased access to these data (coulson, 1993) . it is very reasonable to expect that projects like the human genome project will succeed in providing us with sequence information about every single gene in our chromosomes within the next decade. this information will be after transcription, the mrna may be edited, a process that now has been reported in man, plants and primitive organisms (trypanosoma brucei). the mrna is then translated into a protein sequence. this protein sequence can subsequently be modified, leading to n-or o-glycosylation, phosphorylation, sulfation or the covalent attachment of fatty acid moieties to the protein. at this stage the protein is ready for transport to its final destination -which may be right where it is at the time of synthesis, but the destination may also be extracellular or in secluded compartments such as the mitochondria or lysosomes. in this case the protein is equipped with a signal sequence. after arrival to its destination the protein is processed, often involving proteolytic cleavage of the signal sequence. shorter routes to the functional protein with fewer steps undoubtedly exist as well as routes with interchanged steps of processing. finally, the catabolism of the protein is also part of the process, but has been left out in the figure. of key importance for our understanding of the biology, development and evolution of man. it should, however, be kept in mind that the sequence itself may give us little information about regulation of gene expression, i.e., under what conditions genes are expressed, if they are expressed at all. most protein sequences have been deducted from gene sequences. it is in most cases a priori assumed that a trivial mapping exists between the two sets of information. however, this may not necessarily be the case. in fig. 2 , the various steps currently recognised as being of importance for the production of the mature enzyme are shown, and several of these steps may affect the mapping from gene to protein. posttranscriptional editing is modifications at the mrna level affecting the mapping of information from gene to protein, often involving modification, insertion or deletion of individual nucleotides at specific positions (cattaneo, 1994) . currently only speculative models exist for the underlying molecular mechanism(s) for posttranscriptional editing. in the case of mammalian apolipoprotein b two forms exist, both originating from a single gene. the shorter form, apo b48, arises by a posttranscriptional mrna editing whereby cytidine deamination produces an uaa termination codon (teng et al., 1993) . in the ampa receptor subunit glur-b mrna editing is responsible for changing a glutamine codon (cag) into a arginine codon (cgg) (higuchi et al., 1993) . this editing has a pronounced effect on the ca 2+ permeability of the ampa receptor channel, and it seems to be controlled by the intron-exon structure of the rna. similar mrna editing has been reported in the related kainate receptor subunits giur-5 and giur-6, where two additional codons in the first trans-membrane region are altered (sommer et al., 1991; k6hler et al., 1993) . it is also interesting in this context that certain human genetic diseases have been related to reiteration of the codon cag (green, 1993) . mrna editing in plant mitochondria and chloroplasts has also been reported (gray and covello, 1993) . here the posttranscriptional mrna editing consists almost exclusively of c to u substitutions. editing occurs predominantly inside coding regions, mostly at isolated c residues, and usually at the first or second position of the codons, thus almost always changing the amino acid compared to that specified by the unedited codon. in trypanosoma brucei some extensive and well-documented posttranscriptional cases of editing have been reported (read et al., 1992; harris et al., 1992; adler et al., 1991) . the editing takes place at the mitochondrial transcript level where a large number of uridine nucleic acid bases are added or deleted from the mrna, which then subsequently is translated. several non-editing processes affecting the transcription/translation steps are also known. although the ribosomes in an almost perfect manner translate the message provided by the mrna (with error rates less than 5 x 10 -4 per amino acid incorporated), it appears as if the mrna in certain cases contain information, that forces the ribosome to read the nucleic acid information in a non-canonical fashion (farabaugh, 1993) . a special case, that may deserve some attention as well, is the seleno proteins, were seleno cystein is introduced into the protein by an alternative interpretation of selected codons (b6ck et al., 1991; yoshimura et al., 1990; farabaugh, 1993) . translational frameshifting has been found in retroviruses, coronaviruses, transposons and a prokaryotic gene, leading to different translations of the same gene. two cases of translational 'hops' have been reported, where a segment of the mrna is being skipped by all ribosomes, in the two cases 50 and 500 nucleotides were skipped, respectively (farabaugh, 1993) . to our knowledge posttranscriptional editing and related processes are uncommon but definitely present in humans. it is, therefore, important to understand precisely how these mechanisms work, in order to correctly deduct the protein sequence from the gene sequence. the most common posttranslational modifications are side chain modifications like phosphorylations, glycosylations and farnesylations, as well as others. however, some modifications may also affect the (apparent) gene to protein mapping. posttranslational processing may involve removal of both terminal and internal protein sequence fragments. in the latter case an internal protein region is removed from a protein precursor, and the external domains are joined to form a mature protein (hodges et al., 1992; xu et al., 1993) . interestingly, all intervening protein sequences reported so far have sequence similarity to homing endonucleases (doolittle, 1993) , which also can be found in coding regions of group i introns (grivell, 1994) . posttranslational modifications like phosphorylation, glycosylation, sulfation, methylation, farnesylation, prenylation, myristylation and hydroxylation should also be considered in this context. they modify properties of individual residues and of the protein, and may thus make surface prediction, dynamics simulations and structural modelling in general more complex. the residues that are specifically prone to such modifications are tyrosines (phosphorylation and sulfation), serine and threonine (o-glycosylation), asparagine (nglycosylation), proline and lysine (hydroxylation) and lysine (methylation). in addition glutamic acid residues can become y-carboxylated leading to high affinity towards calcium ions (alberts, 1983) . specific transferases are involved in the modification, e.g., tyrosylprotein sulfotransferases (suiko et al., 1992) and farnesyl-protein transferases (omer et al., 1993) . phosphorylation of amino acid residues is an important way of controlling the enzymatic function of key enzymes in the metabolic and signalling pathways. tyrosine kinases phosphorylate tyrosine residues -thus introducing an electrostatic charge at a residue, which under normal physiological ph is uncharged. phosphorylation is central to the function of many receptors, such as the insulin and insulin-like growth factor i receptors. given the possibility that several modifications may be introduced in the sequence when we move from gene to mature protein, the task of deducting a protein sequence from the gene sequence may be more complex than we normally assume. although the protein sequence itself is a valuable starting point, the optimal basis for a rational protein engineering project will be a full structure determination of the protein. in many cases this turns out to be an expensive and timeconsuming part of the project. most structure determinations are based on x-ray crystallography. this approach may give structures of atomic resolution, but is limited by the fact that stable high quality crystals are needed. many proteins are very difficult to crystallise, in particular many structural and membrane-associated proteins. a large number of important x-ray structures have been published over the last few years, and the structures of the hhai methylase (klimasauskas et al., 1994) , the tbp/tata-box complex (kim et al., 1993a; kim et al., 1993b) and the porcine ribonuclease inhibitor (kobe and deisenhofer, 1993) are mentioned as examples only. nmr may be an alternative in many cases, as the proteins can be studied in solution, and for some experiments they can even be membrane associated. however, nmr is limited to relatively small molecules, and even with incorporation of labelling in the protein the upper limit for a full structure determination using current state of the art methods seems to be close to 30 kda. some novel techniques for studying structural aspects of larger proteins will be discussed (vide infra). representative examples of important nmr structures may be interleukin 1/3 (clore et al., 1991a) , the glucose permease iia domain (fairbrother et al., 1992) and the human retionic acid receptor-/3 dna-binding domain (knegtel et al., 1993) . cryo electron microscopy (cem) is a relatively new approach to protein structure determination. the resolution of the structures are still lower than the corresponding x-ray structures, and a 2-dimensional crystal is a prerequisite. however, despite this cem appears to be a very promising approach to structure determination of membrane associated proteins that can form 2-dimensional crystals. cem has been used to study the nicotinic acetylcholine receptor at 9 .~ resolution (unwin, 1993) and the atp-driven calcium pump at 14 a resolution (toyoshima et al., 1993) , and in a combined approach using high resolution x-ray data superimposed on cem data the structure of the actin-myosin complex (rayment et al., 1993) and of the adenovirus capsid (stewart et al., 1993) has been studied. the recent structure by kiihlbrandt et al. (1994) of the chlorophyll a/b-protein complex at 3.4 a resolution shows that the resolution of cem rapidly is approaching the resolution of most x-ray protein data. scanning tunnelling microscopy (stm) is another new approach for studying protein structures (amrein and gross, 1992; lewerenz et al., 1992; haggerty and lenhoff, 1993) . the method is interesting because of a very high sensitivity, as individual molecules may be examined. the method will give a representation of the surface of the molecule, rather than a full structure determination. however, it is possible that both cem and stm can be used for identification of protein similarity. if data from these methods show that the overall shape of a protein is similar to some other known high resolution protein structure, then the known structure may be evaluated as a potential template for homology based modelling. we believe that such a model can either be used as an improved starting point for a full structure determination (i.e., for doing molecular replacement on x-ray data), or as a low resolution structure determination by itself. in homology based modelling a known structure is used as a template for modelling the structure of an homologous sequence, based on the assumption that the structures are similar. this is a very simple and rapid process, compared to a full structure determination. the sequences may be homologous in the strict sense, meaning that there is an evolutionary relationship between protein data banks the sequences. the same approach may obviously also be used for sequences that are similar, but not necessarily evolutionary related, and in that case we probably should talk about similarity based modelling. however, in this paper we will use homology based modelling as a general term, especially since the distinction between homology and similarity may be difficult in many cases. homology based modelling may turn out to be essential for the future of protein engineering. in fig. 3 , the number of entries in the swissprot protein sequence database (bairoch and boeckmann, 1992 ) and the brookhaven protein structure database (bernstein et al., 1977; abola et al., 1987) are shown as a function of time. as we can see, there is a very significant gap between the number of sequences and the number of structures. this gap is in fact even larger than shown in fig. 3 , as not all entries in the brookhaven database are unique structures. a large number of entries are mutants of other structures or identical proteins with different substrates or inhibitors. there has been an exceptional growth in the number of protein structures over the last 2-3 years. however, it is unrealistic to assume that we will be able to get high resolution experimental structures of all known proteins. the structure determination process is too time consuming, and the sequence databases are growing at a far faster pace, as shown in fig. 3 , especially as a consequence of several large-scale genome sequencing projects. on the other hand, it may not really be necessary to do experimental structure determination of all proteins (ring and cohen, 1993) . the assumption that similar sequences have similar structures (see fig. 4 ) has been proved valid several times and it seems to be true even for short peptide sequences as long as they come from proteins within the same general folding class ). an interesting case which is to some degree an exception to this rule is the structure of hiv-1 reverse transcriptase (kohlstaedt et al., 1992) . two units with identical sequence have similar secondary structure, but very different tertiary structure. however, this seems to be a rather exceptional case. new approaches to general structure alignment (orengo structure distance sequence distance fig. 4 . sequence and structure similarity. in most cases similar sequences have similar structures (region 1), and dissimilar sequences (i.e., measured by a standard mutation matrix) have dissimilar structures (region 2). in several cases quite dissimilar sequences have been shown to have very similar structures, at least with respect to individual domains (region 3). in very special cases we may have similar sequences with different structures (region 4), at least with respect to tertiary structure, showing that environment and binding to other proteins may be essential for the final conformation in some cases. however, in most cases it seems to be safe to assume that structures can be found in the lower grey triangle of this graph, indicating that structure is better conserved than sequence. holm et al., 1992; alexandrov and go, 1993; lessel and schomburg, 1993) have made it possible to search for structurally conserved domains in proteins with very low sequence similarity (swindells, 1994) . this is an important approach, as structure normally is better conserved than sequence (doolittle, 1992) . several cases have been identified where the sequences are very different (at least by traditional similarity measures), whereas the three-dimensional structures are surprisingly similar. the identification of a globin fold in a bacterial toxin (holm and sander, 1993) , and the similarity between the dsba protein and thioredoxin (martin et al., 1993) are relevant examples. recently the structure of the human serum amyloid p component was shown to be similar to concanavalin a and pea lectin, despite only 11% sequence identity (emsley et al., 1994) , and the similarity between hen egg-white lysozyme and a lysozyme-like domain in bacterial muramidase "is remarkable in view of the absence of any significant sequence homology", as noted by thunnissen et al. (1994) . this shows that there probably is a limited number of protein folds, and this number must be lower than the number of sequence classes, defined as groups of similar protein sequences. recent estimates show that this number probably is close to 1000 different protein folds (chothia, 1992) , and approx. 160 of these folds are known so far (burley, 1994; orengo et al., 1993) . this means that rather than full structure determination of a very large number of proteins, it may be sufficient to do structure determination of only a few selected examples of each protein fold, and use this as a basis for homology based modelling of other proteins shown to have the same fold. homology based modelling of the 3-d structure of a novel sequence can be divided into several steps. first, one or more templates must be identified, defined as known protein structures assumed to have the same fold as the trial sequence. then a sequence alignment between trial sequence and template is defined, and based upon this alignment an initial trial model can be built. this initial model must be refined in several steps, taking care of gap splicing, loops, side chain packing etc. the final model can be evaluated by several quality criteria for protein structures. an example of homology based modelling is the modelling of cinnamyl alcohol dehydrogenase based on the structure of alcohol dehydrogenase (mckie et al., 1993) . the protein folding problem is a fundamental problem in structural biology. this problem can be defined as the ab initio computation of a protein's tertiary structure starting from the protein sequence. this problem has not been solved and appears to be extremely difficult. if we want to solve the problem by computing an energy term for all conformations of a protein, defined by rotation around the ~b and ~o backbone angles of n residues in 10 degree steps, we have to evaluate 36 2(n-d alternatives, even without considering the side chains. for a peptide with 15 residues this corresponds to 10 44 conformations. a hypothetical computer with 106 processors, each processor running at 1015 hz (the frequency of uv light) and completing the energy evaluation of one conformation per cycle would need 3 x 1015 years in order to test all conformations. the estimated age of the universe is 14 x 10 ~2 years. a more realistic approach is the use of molecular dynamics or monte carlo methods for simulation of protein folding. however, it is still very difficult to use this as an ab initio approach, both because folding is a very slow process compared to a realistic simulation time scale, and also because it is very difficult to distinguish between correctly and incorrectly folded structures using standard molecular mechanics force fields (novotny et al., 1984) . a possible alternative approach may be to generate potential folds on a simplified lattice representation of possible residue positions (covell and jernigan, 1990; crippen, 1991) . however, this approach is still very experimental. some progress has been achieved in the area of secondary (rather than tertiary) structure prediction (benner and gerloff, 1993) . studies of local information content indicate that 65% match may be an upper limit for single-sequence prediction methods (rao et al., 1993) , whereas methods taking homology data into account may probably raise this limit to approx. 85%. methods based on neural networks and combinations of several prediction schemes seem to give good predictions, and especially methods using homology data from multiple alignments may give predictions at 70% match or better in many cases (salzberg and cost, 1992; boscott et al., 1993; rost and sander, 1993a; rost et al., 1993; levin et al., 1993) . also methods taking potential residue-residue interactions into account, like the hydrophobic cluster analysis (hca), may be used for identification of potential secondary structure elements (woodcock et al., 1992) . it has been shown that by restricting the prediction to a consensus region with stable conformation it is possible to make very reliable predictions (rooman and wodak, 1992) . in one case, neural networks were shown to be capable of returning a limited amount of information on the tertiary structure (bohr et al., 1993) . . structure retrieval by secondary structure. a flow chart for structure retrieval by secondary structure (right side) compared to retrieval by sequence (left side). please see the text for details. in this example the secondary structure library was generated using dssp (kabsch and sander, 1983) , the secondary structure was predicted with the phd program (rest et al., 1994), and fasta (pearson, 1990; pearson and lipman, 1988 ) was used to search the secondary structure library and the nrl-3-d databases (namboodiri et al., 1988; george et al., 1986) . only the secondary structure based method was able to identify the hla class i structure as similar to the class ii structure. the ribbon representation of the hla class i antigen binding region used in this figure was generated with molscript (kraulis, 1991) . be inconsistent, compared to the more sophisticated classification which can be achieved by a trained expert. recent studies show that the average agreement between alternative assignment methods used on identical structures is close to 65% for three methods (colloc'h et al., 1993) , or 79% if only two methods are compared (woodcock et al., 1992) . vadar is a new classification method which is aiming at a better agreement between manual and automatic assignment (wishart et al., 1994) , to what degree this may have influence on prediction systems remains to be seen. over the last few years it has been realised that the inverse folding problem is much easier to solve (bowie et al., 1991; blundell and johnson, 1993; bowie and eisenberg, 1993) . the inverse folding problem can be defined as follows: given a known protein structure, identify all protein sequences which can be assumed to fold in the same way. a large number of protein structures must be available in order to use this as a general approach, as the relevant protein fold has to be represented in the database in order to be identified. however, with a limited number of possible folds actually used by nature, a complete database of all folds appears to be possible. some information about possible folding classes can be derived from experimental data. circular dichroism can be used as a crude way of measuring the relative amounts of secondary structure in a protein. classification methods based on amino acid composition can be used for classification of proteins into broad structural classes zhou et al., 1992; chou and zhang, 1992; dubchak et ai., 1993) . this information may limit the number of different folds which have to be evaluated. it is also possible that such information may be used to improve the performance of other methods, although data on secondary structure prediction of all-helical proteins seems to indicate that the gain may be small (rost and sander, 1993b) . however, for a unique identification of folding class more sensitive methods are needed, and the most useful one is probably some kind of protein sequence library search. in order to identify the folding class we have to search a database of known protein structures with our trial sequence. the problem is that standard methods for sequence retrieval may not be sensitive enough in all cases. if the sequences are similar, then retrieval is trivial. however, we know that there are cases where structures are known to be similar despite very different sequences. how can these cases be identified in a reliable way? the most promising approaches are based on methods for describing the environment of each residue (bowie et al., 1991; eisenberg et al., 1992; overington et al., 1992; ouzounis et al., 1993; wilmanns and eisenberg, 1993; liithy et al., 1994) . this description can be used for generating a profile, showing to what degree each residue is found in a similar environment in other structures, and this profile can be used as a basis for sequence alignment and library searches. similar property profiles can also be used for searching database systems of protein structures (vriend et al., 1994) . a very simple approach can be used if we accept the hypothesis that protein sequences representing structures with a similar linear distribution of secondary structure elements may fold in a similar way. we can then create a sequence type library of known structures where the residues are coded by secondary structure codes rather than residue codes (see fig. 5 ). given the sequence of a protein with unknown 3-d structure, we can use a secondary structure prediction method and translate the sequence into a secondary structure description. if we define a suitable 'mutation' matrix describing the probability of inter conversion between different secondary structure elements, then a standard library search program like fasta (pearson and lipman, 1988; pearson, 1990) can be used in order to identify potential template structures. the example shown in fig. 5 is the identification of hla class i as a suitable candidate for homology modelling of hla class ii. the sequence similarity is very low, 11% sequence identity in the antigen binding region (based on alignment of the structures), and especially for this region most sequence based methods will retrieve a large number of alternative sequences before any of the class i molecules. bly improve the performance to use a positiondependent gap penalty, where most gaps are placed in loop regions rather than in helices or strands. however, the method is very simple to implement and test, as necessary tools and data already are available in most labs. however, for the secondary structure based approach the hla class i sequences are retrieved as top candidates. the structure prediction did not include any information about the hla class ii structure, which recently has been published (brown et al., 1993) . it should be mentioned that the 11% sequence identity score is not significantly higher than the score from a random alignment of sequences. if we, for each sequence in the swissprot protein sequence library, align it against a sequence selected at random from the same library (alignment without gaps, using the full length of the shortest sequence, and start the alignment at a random position within the longest sequence), then the average percentage of identical residues is (6 _+ 6)% at 3 standard deviations. the identification method using secondary structure is based on an assumption which has to be examined more closely, and the implementation of it is very crude. much work can be done on the secondary structure prediction, the 'mutation' matrix and the search method. it will proba-4.2. sequence alignment pip4_rat as described in the introduction a crucial feature in molecular evolution has been the parallel exploration of several different mutations. and although mechanisms like horizontal gene transfer and intragenic recombination may have been important as key steps in the evolution of new proteins, the most common mechanism seems to have been gene duplication followed by mutational modification (doolittle, 1992) . this means that especially multiple sequence alignment can give essential information about the mutation studies already performed by nature. conserved residues are normally conserved because they . multim alignment. alignment of inositol triphosphate specific phospholipase c /3 1 from rat (pip1 rat) against three other pip sequences from rat. each horizontal bar represents a sequence, marked in 50 residue intervals. black lines connecting the bars represent well conserved motifs found in all sequences, in this case subsequences of 8 residues where at least 4 residues are completely conserved in all 4 sequences. it is very easy to identify two well conserved regions, annotated as region x and y in the swissprot entries, despite a 400 residue insertion in two of the sequences. this insertion contains sh2 and sh3 domains (pawson, 1992) . it is an interesting observation that the extra c-terminal domain of the pipi_rat sequence shows a weak similarity to myosin and tropomyosin sequences. have an important structural or functional role in the protein, and identification of such residues will thus give vital information about structure and activity of a protein. several tools have been developed for multiple alignment. a very attractive one is macaw (schuler et al., 1991) , which will generate several alternative alignments of a given set of regions, and in a very visual way help the user to identify a reasonable combination of (sub)alignments. an even more general tool is muitim (drablcs and . here all possible alignments, based on short motifs, are shown simultaneously, and the user is free to identify potential similarities even in cases with low sequence identity and very disperse motifs. this is possible because of the superior classification potential of the human brain compared to most automatic approaches. the method includes an option for probability based filtering of motifs, and an example of a multim alignment is shown in fig. 6 . however, it is important to realise that in standard sequence alignment we are trying to solve a three-dimensional problem (residue interactions) by using an essentially one-dimensional method (alignment of linear protein sequences). as a consequence important conserved throughspace interactions may not be evident from a standard sequence alignment. a good example can be found in the alignment of lipases (schrag et al., 1992) . in fig. 7 , the sequence alignment of residues in a structurally conserved core of three lipases (rhizomucor miehei lipase (derewenda et al., 1992) , candida antarctica b lipase (a. jones, personal communication) and human pancreatic lipase (winkler et al., 1990 ) is shown. the active site residues, ser (s), asp (d) and his (h), are shown as black boxes. the ser and his residues are at identical positions. however, the asp residue of the pancreas lipase is at a very different sequence position compared to the other two lipases. it would be very difficult to identify this as the active site asp from a sequence alignment. if we look at the structural alignment in fig. 8 , we see that the positions are structurally equivalent, it is possible for all three lipases to have highly similar relative orientation of the active site atoms, despite the fact that the alternative asp positions are located at the end of two different /3-strands. an improved alignment may be generated if we can incorporate 3-d data for at least one of the sequences in the linear alignment (gracy et al., 1993) . however, in order to get a reliable alignment of sequences with low sequence similarity, we have to take true three-dimensional effects into account. this means that if we are able to identify a known 3-d structure as a potential basis for modelling, then the sequence alignment should be done in 3-d using this structure as a template. this can be done by threading the sequence through the structure and calculating pairwise interactions (jones et al., 1992; bryant and lawrence, 1993) . as soon as a template has been identified, and an alignment between this template and a sequence has been defined, a 3-d model of the protein can be generated. we can either use the template coordinates directly, combined with diffig. 7 . sequence alignment of lipases. alignment of structurally conserved regions of three lipases. for each lipase the solvent accessible surface in % compared to the gxg standard state (grey scale, white is buried and black is exposed), the secondary structure as defined by the dssp program, and the sequence is shown. the position of each subsequence in the full sequence is also shown. the active site residues are shown in white on black. please observe the shift of the active site asp (d) between two very different positions. the alignment was generated using alscript (barton, 1993) . ferent modelling approaches for the ill-defined regions, or the template can be used as a more general basis for folding the protein by distance geometry (srinivasan et al., 1993) or general molecular dynamics methods. loop regions are often highly variable, and must be treated with special approaches (topham et al., 1993) . it is also necessary to consider the orientation of side chains. although the backbone may be well conserved, many residues especially at the protein surface will be mutated, as shown in fig. 9 . the stability of a protein depends upon an optimal packing of residues, and it is important to optimise side chain conformation if we want to study protein stability and complex formation. a very common approach is the use of rotamer libraries combined with molecular dynamics refinement. recent studies show that this step of the modelling in fact may be less difficult than has been assumed . and important interactions, and exposed regions may to some degree be identified by using antibodies. however, in many cases the rational for modelling by homology is the very lack of experimental data related to structure, and we have to use other more general methods for evaluation of models. some of the approaches we already have described for sequence alignment can obviously also be used for evaluation of models. in general, model evaluation can be based on 3-d profiles (lfithy et al., 1992) , contact profiles (ouzounis et al., 1993) or more general energy potentials (hendlich et al., 1990; jones et al., 1992; nishikawa and matsuo, 1993) . some of these approaches have been implemented as programs for evaluation of structures or models, like procheck (laskowski et al., 1993) and prosa (sippl, 1993a, b) . however, in general no model (or even experimental structure) should be trusted beyond what can be verified by experimental methods. a protein model based on homology (or similarity) has to be verified in as many ways as possible, and experimental methods should always be preferred. mutation studies may give valuable information about active site residues a prerequisite for rational protein engineering is 3-d structure information about the protein. in fig. 6 , including parts of the sequences connecting the core regions. the active site asp is able to maintain a similar relative orientation, despite very different sequence positions. the alignment was generated using insight (biosym technologies). adddition to x-ray crystallography, nmr is the most important method for protein structure determination. x-ray crystallography has several advantages when compared to nmr. solving the crystal structure by x-ray crystallography is usually fast as soon as good crystals of the protein are obtained (even if it may not be so easy to obtain these crystals). it is also possible to determine the structure of very big proteins. the major disadvantage of x-ray crystallography is that it is the crystal structure that is determined. this implies that crystal contacts may distort the structure (chazin et al., 1988; wagner et al., 1987) . since active sites and other binding sites usually are located on the surface of the proteins, very important regions of the protein may be distorted. some structures even show large differences between nmr and x-ray structure (frey et al., 1985; klevit and waygood, 1986) the advantage of nmr is that it is dealing with protein molecules in solution, usually in an environment not too different from its natural one. it is possible to study the protein and the dynamical aspects of its interaction with other molecules like substrates, inhibitors, etc. it is also possible to obtain information about apparent pk a values, hydrogen exchange rates, hydrogen binding and conformational changes. all nuclei contain protons, and therefore they carry charge. some nuclei also possess a nuclear spin. this creates a magnetic dipole, and the nuclei will be oriented with respect to an external magnetic field. the most commonly studied nuclei in protein nmr (1h, 13c and 15n) have two possible orientations, representing high and low energy states. the frequency of the transition between the two orientations is proportional to the magnetic field. at a magnetic field of 11.7 tesla the energy difference corresponds to about 500 mhz for protons. in an undisturbed system there will be an equilibrium population of the possible orientations, with a small difference in spin population between the high and low energy orientation. the equilibrium population can be perturbed by a radio frequency pulse of a frequency at or close to the transition frequency. in addition, the spins will be brought into phase coherence (concerted motion) and a detectable magnetisation will be created. the intensity of the nmr signal is proportional to the population difference between the levels the nuclei can possess. nuclei of the same type in different chemical and structural environments will experience different magnetic fields due to shielding from electrons. the shielding effect leads to different resonance frequencies for nuclei of the same type. the effect is measured as a difference in resonance frequency (in parts per million, ppm) between the nuclei of interest and a reference substance, and this is called the chemical shift. in molecules with low internal symmetry most atoms will experience different amounts of shielding, the resonance signals will be distributed over a well-defined range, and we get a typical nmr spectrum. the process that brings the magnetisation back to equilibrium may be divided into two parts, longitudinal and transverse relaxation. the longitudinal or t 1 relaxation describes the time it takes to reach the equilibrium population. the transverse or t 2 relaxation describes the time it takes before the induced phase coherence is lost. for macromolecules the t 2 relaxation is always shorter than the t a relaxation. short t 2 relaxation leads to broad signals because of poor definition of the chemical shift. most molecules have dipoles with magnetic moment, and the most important cause of relaxation is fluctuation of the magnetic field caused by the brownian motion of molecular dipoles in the solution. how effective a dipole may relax the signal depends upon the size of the magnetic moment, the distance to the dipole, and the frequency distribution of the fluctuating dipoles. a nucleus may also detect the presence of nearby nuclei (less than three bonds apart), and this will split the nmr signal from the nucleus into more components. several nuclei in a coupling network is called a spin system. by applying radio frequency pulses it is possible to create and transfer magnetisation to different nuclei. it is, as an example, possible to create magnetisation at one nucleus, and transfer the magnetisation through bonds to other nuclei where it may be detected. the pulses are applied in a so-called pulse sequence (ernst, 1992; kessler et al., 1988) . the methodology for determination of protein structure by two-dimensional nmr is described in several textbooks and review papers (wagner, 1990; wiithrich, 1986; wider et al., 1984) . the standard method is based on two steps, sequential assignment: assignment of resonances from individual amino acids, and distance information: assignment of distance correlated peaks between different amino acids. the first step involves acquiring coupling correlated spectra (cosy, tocsy) in deuterium oxide to determine the spin system of correlated resonances. some amino acids have spin systems that in most cases make them easy to identify (gly, ala, thr, ile, val, leu). the other amino acids have to be grouped into several classes, due to identical spin systems, even though they are chemically different. the spin systems can be correlated to the nh proton by acquiring cosy and tocsy spectra in water. the assigned nh resonance is then used in distance correlated spectra (noesy) to assign correlations to protons (nh, h,, h~) at the previous amino acid residue (fig. 10) . by combining the knowledge of the primary sequence (which gives the spin system order) with the nmr data collected it is possible to complete the sequential assignment. when the sequential assignment is done the assignment of short range noe (up to four residues) will give information about secondary structure (a-helix, fl-strand). long-range correlations will serve as constraints (together with scalar couplings) to determine the tertiary structure of the protein. excellent procedures describing these steps are available (roberts, 1993; wiithrich, 1986) . with large proteins there will be spectral overlap of resonance lines. the problem is partially solved by labelling the protein with 13c and 15n isotopes. triple resonance multidimensional nmr methods (griesinger et al., 1989; kay et al., 1990 ) may then be applied. the resonances will then be spread out in two more dimensions (13c and ~sn) and the problem with overlap is reduced. these methods depend upon the use of scalar couplings to perform the sequential assignment, the sequential assignment procedure will then be less prone to error. the noesy spectra of such large proteins are often very crowded, but four-dimensional experiments like the 13c-~3c edited noesy spectrum (clore et al., 1991b) have been designed. such experiments will spread the proton-proton distance correlated peaks by the chemical shift of its corresponding 13c neighbour and reduce the spectral overlap. secondary structure elements may also be predicted from the chemical shift of 1h and 13c (spera and bax, 1991; williamson and asakura, 1991; wishart et al., 1992) . obtaining nmr-spectra of proteins has some aspects that should be considered. spectral overlap. as we move to larger proteins the probability of overlap of resonance lines increases. at some point it will become impossible to do sequential assignment due to this overlap. application of 3-d and 4-d multiresonance nmr has made it possible to assign proteins in the 30 kda range (foght et al., 1994; stockman et al., 1992) . fast relaxation. as the size of the protein is increased the rate of tumbling in solution is re-duced. this leads to a reduced transverse relaxation time (t2), and broadening of the resonance lines in the nmr spectra. the intensities of the peaks are reduced and they may be difficult to detect. the short transverse relaxation time will also limit the length of the pulse sequences it is possible to apply (because there will be no phase coherence left), and multidimensional methods become difficult. it is possible to determine a 3-d structure by nmr or x-ray crystallography are probably a subset of all proteins (wagner, 1993) . proteins may have regions with mobility and few cross peaks. the effective size of a protein is often increased by aggregation. the amount of aggregation can often be reduced by reducing the protein concentration. thus, very often the degree of aggregation will determine whether it is possible to assign and solve a protein structure by nmr, by limiting the maximum concentration that may be used. the stability of the proteins is also a major issue. a sample may be left in solution for days, often at elevated temperatures, so denaturation may become a problem. photo-cidnp (chemically induced nuclear polarisation) is an interesting technique for the study of surface positioned aromatic residues in proteins (broadhurst et al., 1991; cassels et al., 1978; hore and kaptain, 1983; scheffier et al., 1985) . by introducing a dye and exciting it with a laser, it is possible to transfer magnetisation to aromatic residues, where it can be observed. in addition to high-resolution nmr, solid state nmr has also been applied to studies of proteins. studies of active sites and conformation of bound inhibitors yields interesting information. the stability of proteins may be monitored under different conditions by detecting signals from transition intermediates bound to the active site (burke et al., 1992; gregory et al., 1993) . structural constraints on transition state conformation of bound inhibitors can be obtained (auger et al., 1993; christensen and schaefer, 1993) . structural constraints of the fold and conformation of the amino sequence may be gathered by setting upper and lower distances for lengths between specific amino acids (mcdowell et al., 1993) . using solid-state nmr it is also possible to study membrane proteins and their orientation with respect to their membrane (killian et al., 1992; ulrich et al., 1992) . we expect such studies to give insight into ion channels in membranes (woolley and wallace, 1992). an important mechanism for relaxation in high-resolution nmr is dipolar relaxation. usually this is induced by the spin of nuclei in the immediate vicinity, and it is a function of the size of the dipole. the electron is also a magnetic dipole, and the magnitude of this dipole is about 700-times that of a proton. paramagnetic compounds have an electron that will interact with 11 . the paramagnetic relaxation method. outline of the paramagnetic relaxation method. the protons located at the protein surface will be closer to the dissolved paramagnetic relaxation agent than the protons located inside the protein core, hence the resonance lines from protons at the surface will be broadened more than resonance lines stemming from protons located inside the protein. nearby protons and increase the relaxation rate of these protons. the widest use of paramagnetic compounds has been of gd 3÷ bound to specific sites in a protein , but also other compounds have been used (chang et al., 1990; hernandez et al., 1990a, b ). this will make it possible to identify resonance lines from residues in the vicinity of the binding site. it is also possible to calculate distances from the paramagnetic atom as the relaxation effect is distance dependent. the paramagnetic broadening effect can also be used with a compound moving freely in solution (drayney and kingsbury, 1981; esposito et al., 1992; petros et al., 1990) . in this way residues located on or close to the protein surface will give broadened resonance lines compared to residues in the interior of the protein. this method can be used to measure important noe and chemical shifts inside the protein directly, or it can be used as a difference method to identify resonances at the surface by comparing spectra acquired with and without the paramagnetic relaxation agent (fig. 11) . we have used the paramagnetic compound gadolinium diethylenetriamine pentaacetic acid (gd-dtpa) as a relaxation agent. gd-dtpa will increase both the longitudinal and the transverse relaxation rates of protons within the influence sphere. suitable nmr experiments to highlight the relaxation effect may be noesy, roesy and tocsy (bax and davis, 1985; braunschweiler and ernst, 1983) gd-dtpa is widely used in magnetic resonance imaging (mri) to enhance tissue contrast. it is assumed to be non-toxic and we do not expect it to bind to proteins. we used the wellstudied protein hen egg-white lysozyme as a test protein. both the structure and the nmr spectra of this protein are known (diamond, 1974; redfield and dobson, 1988) , and the protein is extremely well suited for nmr experiments. in fig. 12 , the 1-d ih-nmr spectrum recorded in the presence and absence of gd-dtpa is shown. although it is evident that there is a selective broadening in the 1-d spectrum, it is also clear that there are problems with overlapping spectral lines. we therefore applied two-dimensional nmr methods, and shown in fig. 13 is the low field region of a noesy spectrum of lysozyme. the region corresponds to the same region as shown in fig. 12 . from fig. 13 we see that the signals from w63, w 63 and w123 disappear with addition of gd-dtpa, while the signals from w28, w108 and wlll still are observable. by examination of the solvent accessible surface of lysozyme it is evident that the indole nh of w62, w63 and w123 is exposed to solvent, while the indole nh of w28, w108 and wlll is not exposed. this shows that the changes in the spectrum are as expected from the structure data. the appearance of the nh-nh region of the spectrum (fig. 14) also shows the reduction in the number of signals in the gd-dtpa exposed spectrum. this shows that the paramagnetic broadening effect can be used for selective identification of signals from solvent exposed residues in a protein. one of the fundamental steps in the protein engineering process shown in fig. 1 is the design step, where a correlation between structure and properties is established in order to select potential structural candidates that match new functional profiles. the understanding of this correlation implies a realistic modelling of the physical chemical properties involved in the functional features to be engineered. these features are basically of two types: diffusional and catalytic. any ligand binding to a protein, whether ligandreceptor or substrate-enzyme, is essentially a diffusional encounter of two molecules. electrostatic interactions are the strongest long-range forces at the molecular scale and, thus, it is not surprising that they are one of the determinant effects in the final part of the encounter (berg and von hippel, 1985) . in the case of substrateenzyme interactions the catalytic step that follows the binding of the substrate seems to be possible mainly by the presence of electrostatic forces that stabilise the reaction intermediates in the binding site (warshel et al., 1989) , from which the product formation may proceed. another and much more basic necessary condition for a successfully engineered protein is that a functional folded conformation is maintained. solvation of charged groups is one of the determinants in protein folding (dill, 1990) , so that even the conformation of the protein is electrostatically driven. given the ubiquitous role of electrostatic interactions, it is then obvious that their accurate modelling is an essential prerequisite in the design of engineered proteins. several good reviews exist on protein electrostatics (warshel and russel, 1984; matthew, 1985; rogers, 1986; harvey, 1989; davies and mccammon, 1990; sharp and honig, 1990) . this section intends to give a brief overview of the subject. we start by presenting the methods one can use to model electrostatic interactions. the most familiar methodology in biomolecular modelling is certainly molecular mechanics (mm) (either through energy minimisations or molecular dynamics (md)). we point out some of the limitations of mm in the treatment of electrostatic interactions, and the need to use alternative ways of describing the system, such as continuum methods. the computation of ph-dependent properties and some potential extensions of mm are also discussed. finally, we refer some applications of electrostatic methods relevant to protein engineering. in mm simulations, electrostatic interactions are usually described with a pairwise coulombic term of the form qlq2/dr12,were ql and q2 are the charges of the pair of atoms, r12 their distance, and d the dielectric constant. d is usually set equal to 1 when the solvent is included. a complete simulation in a sufficiently big box with water molecules should, in principle, give a realistic description of the protein molecule (harvey, 1989) . this would be specially true if a force field including electronic polarizability effects (see 6.3.) was available for use with biomolecular systems, which unfortunately is not the case (harvey, 1989; davis and mccammon, 1990) . we use the term force field in this context as including both the functional form and parameters describing the energetics of the system, from which the forces are derived. simulations where solvent molecules are not treated explicitly are naturally appealing, since the computation time increases with the square of the number of atoms. several methods have been proposed that attempt to account for solvent effects. the more popular approach is an ad hoc dielectric 'constant' proportional to the distance (e.g., mccammon and harvey, 1987) but different distance dependencies can be used (e.g., solmajer and mehler, 1991) . a variety of more elaborated methods were also suggested (northrup et al., 1981; still et al., 1990; gilson and honig, 1991) . all these methods should be viewed as attempts of including solvent screening effects in a simplified way. they can be useful when inclusion of water is computationally prohibitive, but they cannot substitute for an explicit inclusion of solvent since, e.g., the existence of hydrogen bonding with the solvent is not properly described by these approaches. mm of biomolecules has, in general, heavy computation needs. the number of water molecules that should be included in order to simulate a typical protein in a realistic way is quite large, especially if one wants to perform md. also, each pair of atoms has its own electrostatic interaction and the number of pairs cannot be lowered by a short cut-off distance (e.g., 7 a) as in van der waals interactions, since electrostatic interactions are very long range, typically up to 10 ~,. mm simulations have also some limitations on the description of the system, since ph and ionic strength effects usually are difficult or impossible to include. the only way to include ph effects is through the protonation state of the residues. each titrable group (in asp, glu, his, tyr, lys, arg, c-and n-terminal) in the protein have two states, protonated or unprotonated. thus, a protein with n titrable groups will have 2 n possible protonation charge sets. the best we can do is to choose the set corresponding to the protonation states of model compounds at the desired ph. free ions can be included in md simulations of proteins (levitt, 1989; mark et al., 1991) , but it is not clear if the simulated time intervals are long enough to realistically reflect ionic strength effects. another problem with mm is that the understanding it provides of the system (through energy minimisation or md) does not include entropic aspects explicitly, i.e., it does not give free energies directly. there are methods to calculate free energies based on mm potentials (beveridge and dicapua, 1989) , but even though several applications have been made on biomolecular systems (for a review see beveridge and dicapua, 1989) , they are still too demanding for routine use in systems of this size. then, when the properties under study are related to free energies rather than energies (which is often the case), mm by itself can only be seen as a first approach. in summary, although mm simulations can provide some unique information on the structural and dynamical behaviour of biomolecular systems, some limitations exist due to both conceptual and practical reasons, in particular regarding the treatment of electrostatic interactions. fortunately, other methods exist that can provide insight on aspects whose modelling is poor or absent in mm simulations, although at the cost of the atomic detail in the description. there is no 'best' modelling method and we should resort to the several methods available in order to gain an understanding of the system that is as complete as possible. the so-called continuum or macroscopic models assume that electrostatic laws are valid at the protein molecular level and that macroscopic concepts such as dielectric properties are applicable. protein and solvent are treated as dielectric materials where charges are located. these charges may be titrable groups (whose protonation state may vary), permanent ions (structural and bound ions, etc.) or, more recently, permanent partial charges of polar groups. given the dielectric description of the system and the placement of the charges, the problem can be reduced to the solution of the poisson equation (or any equivalent formulation), as can any problem of electrostatics (e.g., jackson, 1975) . the electrostatic potential thus obtained can be used to study diffusional processes or visually compare different molecules (see 6.6.). the simplest continuum model assumes the same dielectric constant inside and outside the protein. typically, a value somewhere between the protein and solvent dielectric constants has been used (sheridan and allen, 1980; koppenol and margoliash, 1982; hol, 1985) . this approach completely ignores the effects of having two very different dielectric regions, but can be used for a first qualitative computation. the more common continuum models treat the protein as a low dielectric cavity immersed in a high dielectric medium, the solvent. the way the charges are placed in this cavity and the way the electrostatic problem is solved vary with the particular method. analytical solutions can be obtained for the simplest shapes, such as spheres, but in general the more complex shapes require numerical techniques. in the first cavity model the protein was assumed to be a sphere with the charge uniformly distributed over its surface (linderstr~m-lang, 1924) . tanford and kirkwood (1957) proposed a more detailed model in which each charge has a fixed position below the surface. assuming a spherical geometry allows for a simple solution to the electrostatic problem. it is even possible to include an ionic atmosphere that accounts for ionic strength effects (leading to the poisson-boltzman equation). the effect of ph occurs naturally in the formalism. the energy cost of burying a charge inside the low-dielectric protein (self-energy) is taken to be the same as in small model compounds, since at the time when this method was developed (before protein crystallography) charges were believed to be restricted to the protein surface. this limits the method to proteins without buried charges, unless we have some estimate on the self-energy. there are, obviously, some problems in fitting real, irregularshaped proteins to a spherical model. some solutions to this problem were proposed, including an ad hoc scaling of interactions based on solvent accessibility (shire et al., 1974) , and the placing of more exposed charges in the solvent region (states and karplus, 1987) . the inclusion of non-spherical geometries im-plies the use of numerical techniques, as referred above. warwicker and watson (1982) and used the finite differences technique to solve, respectively, the poisson and poisson-boltzman equations. self-energies can be included (gilson and honig, 1988) , such that the method is fully applicable when buried charges exist. the intrinsic discretization of the system in the finite differences technique, makes these methods readily applicable to any kind of spatial dependency on any of the properties involved. the inclusion of a spatially-dependent dielectric constant, for instance, will be relatively simple. other extensions such as additional dielectric regions (ligands, membranes, etc.), eventually with charges, should also be possible. alternative numerical techniques for solving the poisson or poisson-boltzman equations have also been used, including finite elements (orttung, 1977) and boundary elements (zauhar and morgan, 1985) . the dielectric constant in a region comes from the existence of dipoles in that region, permanent or induced. permanent dipoles are due to atomic partial charges (e.g., water dipole, peptide bond dipole). induced dipoles are due to the polarizability of electron clouds. warshel and levitt (1976) represented this electronic polarizability by using point dipoles in the atoms. as pointed out by davies and mccammon (1990) this representation is roughly equivalent to a spatially-dependent dielectric constant. this approach is usually combined with a simplified representation ot water by a grid of dipoles (warshel and russel, 1984) . ionic strength and ph effects are not considered. all the above methods deal with a particular charge set (see 6.1.), even when ph effects are considered. however, a protein in solution does not exist in a single charge set. we are usually interested in the properties of a protein at a given ph and ionic strength, not at a particular charge set. moreover, if we want to test the available methods, we have to test them against experimental results which usually do not correspond to a specific charge set. a common test on the accuracy of electrostatic models is their ability in predicting pk a values of titrable groups in a protein (see 6.6.), obtained via titrations, nmr, etc. these values can be quite different from the ones of model compounds, due to environment of the groups in the protein. this difference (pk a shift) can be of several pk units. the experimentally determined apparent pka (pkap p) is determined as the ph value at which half of the groups of that residue are protonated in the protein solution, i.e., when its mean charge is 1/2 (thus, the equivalent notation pk1/2). then, if we can devise a method to compute the mean charge of the titrable groups at several ph values, we can predict their pkap p values. as mentioned above (see 6.1.), we have 2 n possible charge sets. any structural property can, in principle, be computed through a boltzman sum over all those sets, with each one contributing according to its free energy (taken as the electrostatic energy) (tanford and kirkwood, 1957; bashford and karplus, 1990) . the property thus computed is characteristic of the chosen ph value (and ionic strength, if considered) instead of a specific charge set. we are particularly interested in computing the mean charges at a given ph (see last paragraph). a sum with 2 n terms is not, however, a trivial calculation in terms of computer time. tanford and roxby (1972) avoided the boltzman sum by placing the mean charges directly on the titrable groups, instead of using one of the integer sets. this corresponds to considering the titration of the different groups as independent (a mean field approximation; bashford and karplus, 1991) . other alternatives to the boltzman sum are the monte carlo method (beroza et al., 1991) , less drastic mean field approximations (yang et al., 1993; gilson, 1993) , the 'reduced site' approximation (bashford and karplus, 1991) , or even assume that the predominant charge set is enough to describe the system (gilson, 1993) . since electrostatic interactions in proteins are typically dominated by titrable groups whose charge is affected by ph, no electrostatic treat-ment can be complete without taking this effect into account. a simple, although effective, way of doing this is to: (i) compute the electrostatic free energies (e.g., by a continuum method); (ii) compute the mean charge of each titrable group at a given ph (e.g., by a mean field approximation); (iii) use those charges to compute the electrostatic potential (e.g., by a continuum method), which can be displayed together with the protein structure (see the human pancreatic lipase example in section 6.6.). in this way a ph-dependent electrostatic model of the protein can be obtained, which is not possible with usual mm-based modelling techniques. as stated above (see 6.1.), electronic polarizability is not explicitly considered in common force fields. van belle et al. (1987) included the induced dipole formalism (warshel and levitt, 1976) in mm calculations. the electrostatic interactions in the applied force field were simply 'corrected' with additional terms due to inducible dipoles. however, it should be noted that a force field fitted to experimental data without polarizability terms, should be fitted again if those terms are included. the protein conformation used in molecular modelling is usually an experimentally based (xray, nmr) mean conformation, characteristic of those particular experimental conditions. that conformation may, however, be inadequate for modelling the protein properties at different conditions. in particular, proteins are known to denaturate at extreme ph conditions. thus, ph-dependent methods such as the continuum methods may give incorrect results when using one single conformation over the whole ph range. actually, md simulations have shown that the results can be highly dependent on side chain conformation (wendoloski and matthew, 1989) . although overall properties like titration curves did not seem to be very sensitive, individual pka's showed variations up to 2.0 pk units. as mentioned in section 6.1, mm has the problem of what charge set to use in simulations. instead of using a charge set corresponding to model compounds at the intended ph, one may use the predominant charge set of the protein, determined, e.g., by a continuum method, as suggested by gilson (1993) . a different approach to this problem would be to devise a way of including the averaged effect of all charge sets in the mm simulation. we have recently developed a method where a force field is derived which includes the proper averaged effect of all charge sets (a potential of mean force) (to be published). the method depends on the calculation of electrostatic free energies obtained from, e.g., a continuum method. the electrostatic potential, computed in some of the referred methods, can help to understand the contribution of electrostatic interactions in the diffusional encounters of proteins with ligands (substrates or not). the diffusional process driven by the electrostatic field can be simulated through brownian dynamics (bd) and diffusion rates may be computed (for references see, e.g., davies and mccammon, 1990) . the effect of mutations on the diffusion of superoxide ion into the active site of superoxide dismutase has been studied by this technique (sines et al., 1990) and faster mutants showing 2-3-fold increase in reaction rate could be designed (getzoff et al., 1992) , although this enzyme usually is considered to be 'perfect'. electrostatically driven bd simulations can help to reveal steric 'bottlenecks' (reynolds et al., 1990) and orientational effects (luty et al., 1993) . this method can also be applied to study the encounter of two proteins (northrup et al., 1988) . visual comparison of electrostatic fields can also provide useful information. soman et al. (1989) showed that rat and cow trypsins have similar electrostatic potentials near the active site, despite a total charge difference of 12.5 units. as an illustration of such type of comparisons, using ph-dependent electrostatics, we have applied the solvent accessibility-modified tanford-kirkwood method (see 6.2.) to the human pancreatic lipase structures with both closed (van tilbeurgh et al., 1992) and open lid (van tilbeurgh et al., 1993) , as shown in fig. 15a and b. fig. 15c-f shows surfaces corresponding to an electrostatic potential equal to + 1.0 kt/e (where k is the boltzman constant, t the absolute temperature and e the proton charge). these surfaces correspond to regions were the electrostatic interactions on a charge are roughly of the same magnitude as the thermal effects due to the surrounding solvent, i.e., where charged molecules in solution start to feel electrostatic steering or repulsion. at ph 7 clear differences exist between the closed and open forms, the latter showing a dipolar groove in the presumed binding site region. at pi-i 4 the molecule is strongly positively charged and most electrostatically differentiated regions have disappeared. given the role of electrostatic interactions on molecular orientation and association (see the beginning of this section (6)), this is expected to markedly affect the interaction with the lipid-water interface. for enzymes the catalytic activity involving a charged residue can be modulated by shifting the pk a of that residue. the pk a shifts of the active site histidine has been successfully predicted for a number of mutants of subtilisin loewenthal et al., 1993) . one of the main reasons why enzymes are good catalysts is because they stabilise the transition state intermediate (fersht, 1985) . for enzymatic reactions that are not diffusion limited, engineering leading to an enhanced stabilisation of the intermediate will result in an increased activity. the induced dipole method was used to compute the activation free energy for different mutants of trypsin and subtilisin (warshel et al., 1989) , with some qualitative agreement with the experimental results. the prediction of changes introduced by mutations on redox potentials could also be of interest to protein engineering. prediction of redox potentials has been made with some success (rogers et al., 1985; durell et al., 1990) . in plastocyanin the effect of chemically modifying charged groups was also considered (durell et al., 1990) . the effect of mutations could also be analysed, as has been done for pk a shift calculations (see above). the above examples clearly show that, whatever the particular method used, the modelling of 15a c e 13 d t electrostatic interactions in proteins has an important role to play in protein engineering. a highly relevant example is the design of a faster 'perfect' enzyme (getzoff et al., 1992) , which also illustrates the combination of different methods (bd and electrostatic continuum methods) that can sometimes be determinant in a modelling study. the science of protein engineering is advancing rapidly, and is emerging in many new contexts, such as metabolic engineering. rational protein engineering is a complex undertakingand only the groups with sufficient understanding of sequences and 3-d structures can handle the complex underlying problems. predicting protein structure may be difficult -but predicting future developments in a very active branch of science can be hazardous at the best. however, we will review a few of the more recent research aspects that we are convinced will be of key importance in the future development of protein engineering. often the substrates or products in an enzymatic process are poorly soluble in an aqueous medium. this may lead to poor yields and difficult or expensive purification steps. the potential of using other solvents, either pure or in mixture, where substrates and/or products may be soluble has attracted a great deal of attention (tramper et al., 1992; arnold, 1993) . dissolving the protein in organic solvents will alter the macroscopic dielectric constant and lead to a much less pronounced difference between the interior and exterior static dielectric behaviour. protein function in such media may be altered and is poorly understood; we can expect a significant development in the future. despite the often dramatic change in dielectric constant when changing the solvent from, e.g., water to an organic substance, the protein 3-d structure can remain virtually intact, as has been documented in the case of subtilisin carlsberg dissolved in anhydrous acetonitrile (fitzpatrick et al., 1993) . the hydrogen bonding pattern of the active site environment is unchanged, and 99 of the 119 enzyme-bound structural water molecules are still in place. one-third of the 12 enzymebound acetonitrile molecules reside in the active site. many enzymes remain active in organic solvents and in the case of enzyme reactions where the substrate has very poor water solubility, a change to organic solvent can be of major importance (gupta, 1992 ). an extreme case of a non-conventional medium for enzymatic action is the gas phase. certain enzymes, immobilised on a solid bed, have been shown to be active at elevated temperatures towards selected substrates in the gas phase (lamare and legoy, 1993) . obviously the range of substrates that potentially can be used is limited to those that actually can be brought into the gas phase under conditions where the enzyme is still active. enzymes for which such reactions have been studied include hydrogenase, alcohol oxidase and lipases. the fact that even interfacially activated lipases (such as the porcine pancreatic and the candida rugosa lipases) function with gas phase carried substrate molecules opens up the interesting possibility of studying the role of water in this reaction. protein engineering may be used to enhance enzyme activity in organic solvents (arnold, 1993; fig. 15 . electrostatic maps of hpl with closed and open lid. ribbon models of human pancreatic lipase with colipase are shown with closed (left: a,c,e) and open (right: b,d,f) lid. the colipase is shown in blue and the mainly a-helical 'lid' region is highlighted in cyan. the residues of the active site are shown in green. access to the active site pocket seems to be controlled by the conformational st'ate of the lid. electrostatic isopotential contours of + 1.0 kt/e are shown at ph 4 (c,d) and ph 7 (e,f). the negative surfaces are represented in red and the positive surfaces in blue. the models and isopotentiai contours were produced with insight h and delphi (biosym technologies, san diego). the ph-dependent charge sets were computed with titra (to be published). chen and arnold, 1993) . when dissolving subtilisin e in 60% dimethylformamide (dmf) the kcat/k m for the model substrate suc-ala-ala-pro-met-p-nitroanilide drops 333-fold. after ten mutations were introduced, the activity in dmf was restored almost to the level of the native enzyme in water. all metabolic conversions in micro-organisms are carried out directly or indirectly by proteins. our ability to manipulate single genes has opened up for the actual control of such processes. we may alter the efficacy of a certain pathway or we may introduce totally new pathways. thus, escherichia coli can be modified in such a way that one can use i>glucose in the e. coli based manufacture of hydroquinone, benzoquinone, catechol and adipic acid (dell and frost, 1993; draths and frost, 1990; frost, 1993) . presently such compounds are produced through organic chemical synthesis using aromatics as one of the reactants. the prospect of producing the same compounds using only microbes and glucose thus has some obvious environmental benefits. we expect to see a virtual surge in the engineering of microorganisms towards the production of rare chemical or biochemical compounds or compounds for which the current synthetic route is costly either economically or from an environmental perspective. the perspective of designing and producing functional protein molecules from scratch is extremely attractive to many visionary scientists. some central questions arise: do we know enough to undertake such tasks, and what goals can we define? screening mutation studies of protein interfaces show that the majority of mutations reduce activity or binding affinity (cunningham and wells, 1993) , indicating that most proteins already represent highly optimised designs. the groups active in this area have aimed at constructing certain 3-dimensional folds such as the four helix bundle (felix) (hecht et al., 1990 ) and histidine-based metal binding sites (arnold, 1993) and even the observation of limited enzymatic activity is regarded as a successful result . protein de novo design of helix bundles may even follow a very simple binary pattern of polar and nonpolar amino acids as was concluded in a study of four-helix bundle proteins (kamtekar et al., 1993) . the helix-helix contact surfaces are mainly hydrophobic, whereas the solvent exposed regions are hydrophilic. many variants conforming to this hydrophobic pattern were generated and two of these proteins were stabilised with 3.7 and 4.4 kcal tool -~ relatively to the unfolded form, thus approaching what is found for many natural proteins. the authors suggest that such a binary pattern may have been important in the early stages of evolution. in our laboratory we have results supporting this conclusion for the trypsin family of proteins, which is predominantly in a /3-strand based fold . fusion and hybrid proteins may be produced by fusing the genes or gene fragments including a proper linking region between the two genes (argos, 1990) . this in principle may allow for combining properties from two different proteins. thus artificial bifunctional enzymes have been produced by fusing the genes for the proteins, e.g., /3-galactosidase and galactokinase (bulow, 1990 ). in a recent paper an elegant hybrid protein concept is described. a hybrid antibody fragment was designed to consist of a heavy-chain variable domain from one antibody connected through a linker region of 5-15 residues to a short lightchain variable domain from another antibody (holliger et al., 1993) . the antibody fragments displayed similar binding characteristics as the parent antibodies. the prospect of engineering multifunctional antibodies for medical applications is imminent. a hybrid protein between the glucose transporter and the n-acetylglucosamine transporter of e. coli have been produced. the two proteins displayed 40% residue identity. the hybrid protein consisted of the putative transmembrane do-main from the glucose transporter and the two hydrophilic domains from the n-acetylglucosamine transporter. the hybrid protein was, somewhat surprisingly, still specific for glucose (hummel et al., 1992) . interestingly, several naturally occurring proteins themselves seem to have originated through gene fusion. in the case of hexokinase it is proposed that it originated from a duplication of the glucokinase gene maintaining even the gene organisation (kogure et al., 1993) . several other proteins such as receptor proteins of the insulin family can best be understood as gene fusion products of a kinase domain onto the rest of the receptor (which in itself may consist of several fragments). with potential medical applications, proteinnucleic acid hybrids have been constructed, where the nucleic acid fragment complemented the sequence of a fragment of mrna that the rnase should be targeted towards. the results obtained confirmed that this approach indeed worked (kanaya et al., 1992) . the potentials for generating anti-viral agents against, e.g., hiv are obvious. as a consequence of the enormous growth in our understanding of molecular biology and material technology, a new technological sector is emerging which takes aim at exploring the possible advantages in creating micro-machines and switchable molecular entities. this concept is currently known as nano technology (birge, 1992) . two concepts that we find particularly interesting are described briefly below. rhodopsin is a very ancient molecular construct -we find rhodopsin like molecules in a range of roles, all of them associated with its membrane location. proton transport and receptor functions are particularly interesting. bacteriorhodopsin from halobacterium halobium maintains a large ph gradient across the bacterial membrane. this protein complex is coloured, and its colour can be changed by exposing the protein to light of an appropriate frequency. the lifetime of the excited state can be adjusted by adjusting the physical chemical parameters of the medium the rhodopsin is embedded in (birge, 1992) . this protein can be used as a molecular switch in a very broad sense, e.g., as part of a high density memory device. however, changing the colour of a protein molecule is just one example that could be considered. another molecular based switch concept involves the transfer of a molecular ring (paraquat-derived rotaxane ring) between two binding sites (bradley, 1993) . currently the transfer is induced by a solvent change, but it is believed that an electrochemical transfer mechanism can be developed as well. similar concepts can probably also be developed for proteins. the present paper reviews some of many new developments in protein engineering. the review is not exhaustive -it is simply not possible to do this properly within the limits of this paper. we have tried to review some selected scientific areas of key importance for protein engineering, such as the validity of protein sequence information as well as structural information. sometimes the translation of a gene sequence to amino acid sequence is not trivial -a range of posttranscriptional editing and splicing events may occur, leading to a functional protein, where the amino acid sequence cannot be directly deducted from the gene sequence. in addition, posttranslational modification may provide triggers for other parts of the cells molecular machinery. we are thus in a situation where the full benefits and profits from projects such as the human genome project may escape us for a while. we have covered some of the recent developments in the modelling of protein structure by homology, which we regard as one of the most strategic areas of development. we will be flooded with sequence information deducted from gene sequences, and in the cases where the deducted amino acid sequences are assumed valid, we have to use homology based structure prediction in most cases. given that the number of protein structure families is expected to be limited the task is durable. here we should again caution the reader. we have no a priori reason to assume that non-soluble proteins, such as structural proteins, have structures that can be predicted from our limited library of mostly globular, soluble proteins. some structural proteins are gigantic, the cuticle collagen in the riftia worms from deep sea hydrothermal vents have a molecular mass of 2.600 kda (gaill et al., 1991) . it is extremely unlikely that a 3-d structure at atomic resolution of such a protein will ever be determined using methods we have available today. nmr has emerged with surprising speed as a structure determination tool. many excellent reviews have been written on this topic. we have decided to direct the readers attention to some recent developments that we believe will be of significant importance to the usage of nmr in protein engineering projects. the potential of using nmr to study the solvent exposed outer shell of larger proteins, that by far exceed the 30 kda limit mentioned earlier is intriguing. this is particularly so, since most functionality of a protein is a feature of exactly the residues in the outer shell. thus, we can 'peel' the protein, and thereby isolate the spectral information that pertains to the surface only. this simplifies the spectra, and in some cases even allows for a partial assignment of specific residues. recent developments in ph-dependent protein electrostatics have been given special attention here. the similarities and differences within a family of structurally related proteins can only be understood if we are capable of interpreting the consequences of the substitutions, insertions and deletions that mostly occur at the surface of the proteins. when such changes are found and they involve charged residues, this will effect the extent or polarity of the electrostatic fields that the protein molecule is embedded in. we believe that the consequences of charge mutations to a large extent can be predicted through the use of ph-dependent electrostatics although practical examples are still lacking. to our knowledge the results on the electrostatic consequences of the lid motion in the human pancreatic lipase (vide supra) are among the first such reported. the story of molecular biology is continuously unfolding -and our understanding of our own biology, development and evolution is becoming ever deeper and more detailed. but we are also, once again, discovering that one of the many qualities of nature is endless complexity. protein data bank. crystallographic databases -information content, software systems, scientific applications. bonn/cambridge/chester, data commission of the international union of crystallography modification of trypanosoma brucei mitochondrial rrna by posttranscriptional 3' polyuridine tail formation significance of similarities in protein structures (in abstracts of the 5th annual meeting of the protein engineering society of japan) scanning tunneling microscopy of biological macromolecular structures coated with a conducting film an investigation of oligopeptides linking domains in protein tertiary structures and possible candidates for general gene fusion engineering proteins for nonnatural environments solid-state 13c nmr study of a transglutaminaseinhibitor adduct structural engineering of the hiv-1 protease molecule with a/3-turn mimic of fixed geometry the swlss-prot protein sequence data bank polymers made to measure alscript: a tool to format multiple sequence alignments pka's of ionizable groups in proteins: atomic detail from a continuum electrostatic model multiple-site titration curves of proteins: an analysis of exact and approximate methods for their calculation mlev-17-based two-dimensional homonuclear magnetization transfer spectroscopy predicting the conformation of proteins. man versus machine diffusion-controlled macromolecular interactions the protein data bank: a computer-based archival file for macromolecular structures protonation of interacting residues in a protein by a monte carlo method: application to lysozyme and the photosynthetic reaction center of rhodobacter sphaeroides free energy via molecular simulation: application to chemical and biomolecular systems research and perspectives catching a common fold seleno protein synthesis: an expansion of the genetic code protein structures from distance inequalities secondary structure prediction for modelling by homology inverted protein structure prediction a method to identify protein sequences that fold into a known three-dimensional structure will future computers be all wet? coherence transfer by isotropic mixing: application to proton correlation spectroscopy a photochemically induced dynamic nuclear polarization study of denatured states of lysozyme three-dimensional structure of the human class ii histocompatibility antigen hla-dr1 an empirical energy function for threading protein sequence through the folding motif preparation of artificial bifunctional enzymes by gene fusion solidstate nmr assessment of enzyme active center structure under nonaqueous conditions forward to the fundamentals study of the tryptophan residues of lysozyme using 1h nuclear magnetic resonance rna duplexes guide base conversions ph dependence of relaxivities and hydration numbers of gadolinium(ill) complexes of linear amino carboxylates ih nmr studies of human c3a anaphylatoxin in solution: sequential resonance assignments, secondary structure, and global fold tuning the activity of an enzyme for unusual environments: sequential random mutagenesis of subtilisin e for catalysis in dimethylformamide proteins. one thousand families for the molecular biologist a correlation-coefficient method to predicting protein-structural classes from amino acid compositions solid-state nmr determination of intra-and intermolecular 31p-13c distances for shikimate 3-phosphate and [1-i3c]glyphosate bound to enolpyruvylshikimate-3-phosphate synthase four-dimensional 13c/13c-edited nuclear overhauser enhancement spectroscopy of a protein in solution: application to interleukin 1/3 high-resolution three-dimensional structure of interleukin 1/3 in solution by three-and four-dimensional nuclear magnetic resonance spectroscopy origins of structural diversity within sequentially identical hexapeptides comparison of three algorithms for the assignment of secondary structure in proteins: the advantages of a consensus assignment extracting the information -sequence analysis software design evolves conformations of folded proteins in restricted spaces prediction of protein folding from amino acid sequence over discrete conformation spaces comparison of a structural and a functional epitope electrostatics in biomolecular structure and dynamics identification and removal of impediments to biocatalytic synthesis of aromatics from d-glucose: rate-limiting enzymes in the common pathway of aromatic amino acid biosynthesis the crystal and molecular structure of the rhizomucor miehei triacylglyceride lipase at 1.9 a resolution real-space refinement of the structure of hen egg white lysozyme dominant forces in protein folding complete assignment of aromatic 1h nuclear magnetic resonances of the tyrosine residues of hen lysozyme stein and moore award address. reconstructing history with amino acid sequences the comings and goings of homing endonucleases and mobile introns multim -tools for multiple sequence analysis genomic direction of synthesis during plasmid-based biocatalysis free radical induced nuclear magnetic resonance shifts: comments on contact shift mechanism prediction of protein folding class from amino acid composition modeling of the electrostatic potential field of plastocyanin three-dimensional profiles for analysing protein sequence -structure relationships a method to configure protein side-chains from the main-chain trace in homology modelling structure of pentameric human serum amyloid p component nuclear magnetic resonance fourier transform spectroscopy (nobel lecture) probing protein structure by solvent pertubation of nuclear magnetic resonance spectra molecular nanotechnology low resolution solution structure of the bacillus subtilis glucose permease iia domain derived from heteronuclear three-dimensional nmr spectroscopy alternative readings of the genetic code enzyme structure and mechanism. freeman protein engineering enzyme crystal structure in a neat organic solvent ih, 13c and lsn nmr backbone assignments of the 269-residue serine protease pb92 from bacillus alcalophilus polypeptide -metal cluster connectivities in metallothionein 2 by novel i h-113cd heteronuclear two-dimensional nmr experiments design and use of heterologous microbes for conversion of d-glucose into aromatic chemicals. enzyme engineering xii molecular characterization of the cuticle and interstitial collagens from worms collected at deep sea hydrothermal vents the protein identification resource (pir) faster superoxide dismutase mutants designed by enhancing electrostatic guidance self-assembling organic nanotubes based on a cyclic peptide architecture multiple-site titration and molecular modeling: two rapid methods for computing energies and forces for ionizable groups in proteins calculation of the total electrostatic energy of a macromolecular system: solvation energies, binding energies, and conformational analysis the inclusion of electrostatic hydration energies in molecular mechanics calculations calculations of electrostatic potentials in an enzyme active site calculating the electrostatic potential of molecules in solution: method and error assessment improved alignment of weakly homologous protein sequences using structural information rna editing in plant mitochondria and chloroplasts human genetic diseases due to codon reiteration: relationship to an evolutionary mechanism the influence of hydration on the conformation of lysozyme studied by solid-state 13c-nmr spectroscopy three-dimensional fourier spectroscopy. application to high-resolution nmr invasive introns enzyme function in organic solvents analysis of ordered arrays of adsorbed lysozyme by scanning tunneling microscopy specific cleavage of pre-edited mrnas in trypanosome mitochondrial extracts treatment of electrostatic effects in macromolecular modeling de novo design, expression and characterization of felix: a four-helix bundle protein of native like sequence converting trypsin to chymotrypsin: the role of surface loops identification of native protein folds amongst a large number of incorrect models. the calculation of low energy conformations from potentials of mean force nuclear magnetic relaxation in aqueous solutions of the gd(hedta) complex proton magnetic relaxation dispersion in aqueous glycerol solutions of gd(dtpa) 2-and gd(dota) engineered metalloregulation in enzymes rna editing of ampa receptor subunit giur-b: a base-paired intron-exon structure determines position and efficiency protein splicing removes intervening sequences in an archaea dna polymerase the role of the a-helix dipole in protein function and structure diabodies': small bivalent and bispecific antibody fragments globin fold in a bacterial toxin a database of protein structure families with common folding motifs proton nuclear magnetic resonance assignment and surface accessibility of tryptophan residues in lysozyme using photochemically induced dynamic nuclear polarization spectroscopy a functional protein hybrid between the glucose transporter and the n-acetylglucosamine transporter of escherichia coli classical electrodynamics synthesis, structure and activity of artificial, rationally designed catalytic polypeptides a new approach to protein fold recognition engineering stability of the insulin monomer fold with application to structure-activity relationships dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features protein design by binary patterning of polar and nonpolar amino acids a hybrid ribonuclease h. a novel rna cleaving enzyme with sequence-specific recognition four-dimensional heteronuclear triple-resonance nmr spectroscopy of interleukin-1/3 in solution two-dimensional spectroscopy: background and overview of the experiments orientation of the valine-1 side chain of the gramicidin transmembrane channel and implications for channel functioning. a 2h nmr study co-crystal structure of tbp recognizing the minor groove of a tata element crystal structure of a yeast tbp/tata-box complex two-dimensional 1h nmr studies of histidine-containing protein from escherichia coli. secondary and tertiary structure as determined by nmr hhaimethyltransferase flips its target base out of the dna helix the solution structure of the human retinoic acid receptor-/3 dna-binding domain crystal structure of porcine ribonuclease inhibitor, a protein with leucine-rich repeats evolution of the type ii hexokinase gene by duplication and fusion of the glucokinase gene with conservation of its organization determinants of ca 2+ permeability in both tm1 and tm2 of high affinity kainate receptor channels: diversity by rna editing crystal structure at 3.5 ,~ resolution of hiv-1 reverse transcriptase complexed with an inhibitor the asymmetric distribution of charges on the surface of horse cytochrome c molscript: a program to produce both detailed and schematic plots of protein structures atomic model of plant light-harvesting complex by electron crystallography biocatalysis in the gas phase procheck: a program to check the stereochemical quality of protein structures a new procedure for the detection and evaluation of similar substructures in proteins quantification of secondary structure prediction improvement using multiple alignments molecular dynamics of macromolecules in water direct observation of reverse transcriptases by scanning tunneling microscopy on the ionization of proteins long-range surface charge-charge interactions in proteins assessment of protein models with three-dimensional profiles improving the sensitivity of the sequence profile method brownian dynamics simulations of diffusional encounters between triosephosphate isomerase and glyceraldehyde phosphate: electrostatic steering of glyceraldehyde phosphate conformational flexibility of aqueous monomeric and dimeric insulin: a molecular dynamics study crystal structure of the dsba protein required for disulphide bond formation in vivo electrostatic effects in proteins dynamics of proteins and nucleic acids inter-tryptophan distances in rat cellular retinol binding protein ii by solid-state nmr a molecular model for cinnamyl alcohol dehydrogenase, a plant aromatic alcohol dehydrogenase involved in lignification adaptive evolution of highly mutable loci in pathogenic bacteria automated protein structure data bank similarity searches and their use in molecular modeling with development of pseudoenergy potentials for assessing protein 3-d-1-d compatability and detecting weak homologies brownian dynamics of cytochrome c and cytochrome c peroxidase electron transfer proteins molecular dynamics of ferrocytochrome c. magnitude and anisotropy of atomic displacements an analysis of incorrectly folded protein models. implications for structure predictions characterization of recombinant human farnesyl-protein transferase: cloning, expression, farnesyl diphosphate binding, and functional homology with yeast prenyl-protein transferases fast structure alignment for protein databank searching identification and classification of protein fold families direct solution of the poisson equation for biomolecules of arbitrary shape, polarizability density, and charge distribution prediction of protein structure by evaluation of sequencestructure fitness. aligning sequences to contact profiles derived from three-dimensional structures environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds sh2 and sh3 domains rapid and sensitive sequence comparison with fastp and fasta improved tools for biological sequence comparison gene duplication and the origin of trypsin protein engineering -new or improved proteins for mankind nmr identification of protein surfaces using paramagnetic probes multidisciplinary cycles for protein engineering: site-directed mutagenesis and x-ray structural studies of aspartic proteinases. scand the local information content of the protein structural database structure of the actin -myosin complex and its implications for muscle contraction extensive editin~ of both processed and preprocessed maxicircle cr6 transcripts in trypanosoma brucei sequential 1h-nmr assignments and secondary structure of hen egg white lysozyme in solution electrostatics and diffusional dynamics in the carbonic anhydrase active site channel identification of structural motifs from protein coordinate data: secondary structure and first-level supersecondary structure modeling protein structures: construction and their applications nmr of macromolecules. a practical approach the modelling of electrostatic interactions in the function of globular proteins electrostatic interactions in globular proteins: calculation of the ph dependence of the redox potential of cytochrome c55 i extracting information on folding from the amino acid sequence: consensus regions with preferred conformation in homologous proteins prediction of protein secondary structure at better than 70% accuracy secondary structure prediction of all-helical proteins in two states phd -an automatic mail server for protein secondary structure prediction progress in protein structure prediction? predicting protein secondary structure with a nearest-neighbor algorithm database of homologyderived protein structures and the structural meaning of sequence alignment an winexpensive, versatile sample illuminator for photo-cidnp on any nmr spectrometer pancreatic lipases: evolutionary intermediates in a positional change of catalytic carboxylates? a workbench for multiple alignment construction and analysis a new approach to the design of stable proteins electrostatic interactions in macromolecules: theory and applications the electrostatic potential of the alpha helix electrostatic effects in myoglobin. hydrogen ion equilibria in sperm whale ferrimyoglobin point charge distributions and electrostatic steering in enzyme/substrate encounter: brownian dynamics of modified copper/zinc superoxide dismutases boltzmann's principle, knowledge based mean fields and protein folding recognition of errors in three-dimensional structures of proteins describing protein structure: a general algorithm yielding complete helicoidal parameters and a unique overall axis electrostatic screening in molecular dynamics simulations electrical potentials in trypsin isozymes rna editing in brain controls a determinant of ion flow in glutamate-gated channels empirical correlation between protein backbone conformation and c a and ct3 13c nuclear magnetic resonance chemical shifts an automated method for modeling proteins on known templates using distance geometry a model for electrostatic effects in proteins difference imaging of adenovirus: bridging the resolution gap between x-ray crystallography and electron microscopy semianalytical treatment of solvation for molecular mechanics and dynamics sequencespecific 1h and 15n resonance assignment for human dihydrofolate reductase in solution posttranslational modification of protein by tyrosine sulfation: active sulfate paps is the essential substrate for this modification finding your fold (commentary) theory of protein titration curves. i. general equations for impenetrable spheres interpretation of protein titration curves. application to lysozyme molecular cloning of an apolipoprotein b messenger rna editing protein fragment ranking in modelling of protein structure. conformationally constrained environmental amino acid substitution tables three-dimensional cryo-electron microscopy of the calcium ion pump in the sarcoplasmic reticulum membrane biocatalysis in non-conventional media total chemical synthesis, characterization, and immunological properties of an mhc class i model using the tasp concept for protein de novo design doughnut-shaped structure of a bacterial muramidase revealed by x-ray crystallography structure determination of the cyclohexene ring of retinal in bacteriorhodopsin by solid-state deuterium nmr nicotinic acetylcholine receptor at 9 ~, resolution calculations of electrostatic properties in proteins interfacial activation of the lipase -procolipase complex by mixed micelles revealed by x-ray crystallography structure of the pancreatic lipase -colipase complex a novel search method for protein sequence -structure relations using property profiles nmr investigations of protein structure prospects for nmr of large proteins protein structures in solution by nuclear magnetic resonance and distance geometry theoretical studies of enzymic reactions calculation of electrostatic interactions in biological systems and in solution how do serine proteases really work? calculation of the electric potential in the active site cleft due to a-helix dipoles molecular dynamics effects on protein electrostatics homonuclear two-dimensional 1h nmr of proteins. experimental procedures calculation of chemical shifts of protons on alpha carbons in proteins three-dimensional profiles from residue-pair preferences: identification of sequences with beta/alpha-barrel fold structure of human pancreatic lipase the chemical shift index: a fast and simple method for the assignment of protein secondary structure through nmr spectroscopy reengineering the specificity of a serine active-site enzyme. two active-site mutations convert a hydrolase to a transferase detection of secondary structure elements in proteins by hydrophobic cluster analysis model ion channels: gramicidin and alamethicin nmr of proteins and nucleic acids in vitro protein splicing of purified precursor and the identification of a branched intermediate on the calculation of pka's in proteins molecular cloning of cdna coding for rat plasma glutathione peroxidase a new method for computing the macromolecular electric potential an optimization approach to predicting protein structural class from amino acid composition a weighting method for predicting protein structural class from amino acid composition we want to thank christian cambillau, cnrs, marseille, for kindly providing us with pre-release 3-d data of human pancreatic lipase, jerry h. brown, harvard university, for sending us a prerelease dataset for the hla ii structure, alwyn jones, uppsala university, for pre-release 3-d data of candida antarctica b lipase, and johnmccarthy, brookhaven national laboratory, for helping us with data on previous pdb releases. the french norwegian foundation (fns 27958) and the norwegian research council (bp 29345) have contributed with financial support to some of the research activities described in this paper. a.b. and p.m. thank junta nacional de investi-ga~o cientlfica, portugal, for their grants. key: cord-311240-o0zyt2vb authors: motayo, babatunde olarenwaju; oluwasemowo, olukunle oluwapamilerin; akinduti, paul akiniyi; olusola, babatunde adebiyi; aerege, olumide t; faneye, adedayo omotayo title: evolution and genetic diversity of sarscov-2 in africa using whole genome sequences date: 2020-07-27 journal: biorxiv doi: 10.1101/2020.07.27.222901 sha: doc_id: 311240 cord_uid: o0zyt2vb the ongoing sarscov-2 pandemic was introduced into africa on 14th february 2020 and has rapidly spread across the continent causing severe public health crisis and mortality. we investigated the genetic diversity and evolution of this virus during the early outbreak months using whole genome sequences. we performed; recombination analysis against closely related cov, bayesian time scaled phylogeny and investigated spike protein amino acid mutations. results from our analysis showed recombination signals between the afrsarscov-2 sequences and reference sequences within the n and s genes. the evolutionary rate of the afrsarscov-2 was 4.133 × 10−4 high posterior density hpd (4.132 × 10−4 to 4.134 × 10−4) substitutions/site/year. the time to most recent common ancestor tmrca of the african strains was december 7th 2019. the afrsarcov-2 sequences diversified into two lineages a and b with b being more diverse with multiple sub-lineages confirmed by both maximum clade credibility mcc tree and pangolin software. there was a high prevalence of the d614-g spike protein amino acid mutation (82.61%) among the african strains. our study has revealed a rapidly diversifying viral population with the g614 spike protein variant dominating, we advocate for up scaling ngs sequencing platforms across africa to enhance surveillance and aid control effort of sarscov-2 in africa. towards the end of december 2018, chinese authorities through the world health organization office in china made known of a new pathogen responsible for a series of pneumonia associated infections in wuhan, hubei province (who 2020). the pathogen was later identified to be a novel coronavirus closely related to the severe acute respiratory syndrome virus (sars), with a possible bat origin (zhou et al, 2020) . the world health organization named the disease covid-19 (chan et al, 2020) , and later declared it a pandemic on 11 th march 2020 prompting concerted efforts towards prevention and control worldwide (who 2020). on febuary 11 th 2020 the international committee on the taxonomy of viruses (ictv) adopted the name sars-cov-2 following the report of their coronavirus working group (csg, 2020) . the virus has been placed in the subgenera sarbecovirus, genus betacoronavirus, subfamily coronavirinea, family coronaviridea (de groot et al, 2013; gorbalenya et al, 2020) . coronaviruses are enveloped viruses containing a single-stranded positive sense rna genome with a size of between 26kb to 32kb (masters and pearlman 2013) . they are responsible for a host of human and animal infections. the betacoronaviruses host the most medically important species contains several human coronaviruses such as hucovoc43, hucovhku13. the severe acute respiratory syndrome coronavirus sarscov and the middle east respiratory syndrome coronavirus mers are also members of this group, both have been shown to be pathogens of high consequence that caused large scale epidemics and have been shown to be of zoonotic origin (lau et al, 2005; zaki et al, 2012) . genomic and structural analyses have revealed that sarscov-2 contains four structural proteins and several non structural proteins (chen et al, 2020; lu et al, 2020) . the spike protein is the major antigenic protein responsible for initiating infection, via attachment of its receptor binding domain (rbd) to the sarscov/sarscov-2 receptor angiotensin converting enzyme 2 ace 2 (donelli et al, 2004; monteil et al, 2020) . globally, there have been 5,226,268 confirmed sarscov-2 cases globally, with 335,218 deaths as at 21 st of may 2020 (ecdc 2020). the coronavirus pandemic began in egypt africa on the 14 th febuary 2020 with an italian who returned into the country (who 2020b). as at 21 st of may there have been 95,332 cases in africa, with 2,995 deaths and 35,519 recoveries covering fifty four countries in africa with south africa having the highest number of cases of 18,003 (who 2020c). several reports have traced the evolutionary origins of sarscov-2 to sarsrcov from bats (zhou et al, 2020) and pangolins (lam et al 2020) . phlogenetic analysis has shown that the virus has diversified through the duration of the pandemic into three major lineages a and b with several sub-lineage diversifications (rambault et al 2020) . majority of the reports were generated using genome sequences of sarscov-2 from america, europe and asia (rambault et al, 2020) . there has been paucity of data on the genetic evolution of sarscov-2 sequences from africa, despite the increasing number of genome sequence submissions into the gisaid database from africa; there were 97 whole genome sequences available in the gisaid database as at 24 th april 2020. this gap in knowledge prompted the conceptualization of this study. this study was designed to determine to the genetic diversity and evolutionary history of genome sequences of sarscov-2 isolated in africa. full genome sequences with high coverage were downloaded from the global initiative for sharing of avian influenza data gisaid database. as at 24 th april there were 97 full genome sequences from africa available in the gisaid database we downloaded all of them excluding genomes with low coverage. a total of 69 high coverage genomes were eventually selected from the african sequences, along with these 151 high quality full genome sequences were also downloaded from three continents america (usa), asia (china and south korea) and europe (england, italy and germany). three different datasets were then generated from these sequences, the first dataset consisted of high coverage full genome sequences from africa, along with the sarscov2 reference genome sequence from wuhan, china, bat and pangolin sars related reference sequences and sarscov reference sequence. the second dataset consisted of complete genome sequences from africa, america, asia and europe, while the third dataset consisted of complete spike protein (s) gene sequences from africa, bat and pangolin sars related reference s gene sequences. whole genome sequences downloaded from the gisaid database were aligned using mafftv7.222 (ff-ns-2 algorithm) following default settings (katoh et al, 2019). maximum likelihood phylogenetic analysis was performed using the general time reversible nucleotide substitution model with gamma distributed rate variation gtr-γ (yang et al, 1994) with 1000 bootstrap replicates using iq-tree software (nguyen et al 2015) . lineage assignments for the sarscov-2 sequences were conducted using the phylogenetic assignment of named global outbreak lineages tool (pangolin), available at http://github.com/hcov-2019/pangolin (o'toole and mccrone 2020). we analyzed potential recombination events using the recombination detection program rpd software (martin et al, 2015) . the analysis was conducted on whole genome sequences of identified lineages among the african isolates, using rdp, bootscan analysis, genecov, chimera, siscan, 3seq, and maximum chisquare methods. a putative recombination event was passed only if three of the above mentioned methods gave a positive recombination signal (liu et al, 2010) . temporal clock signal was analyzed among the aligned sequences using tempest version 1.5 (rambault et al, 2016) . the root-to-tip divergence and sampling dates supported the use of molecular clock analysis in this study. phylogenetic trees were generated by bayesian inference through markov chain monte carlo (mcmc), implemented in beast version 1.10.4 (suchard et al, 2016) . we partitioned the coding genes into first+second and third codon positions and applied a separate hasegawa-kishino-yano (hky+g) substitution model with gammadistributed rate heterogeneity among sites to each partition (hasaegawa et al, 1985) . the relaxed clock with gausian markov random field skyride plot (gmrf) coalescent prior was selected for the final analysis, after running different models and comparing them using bayes factor with marginal likelihood estimated using the path sampling and stepping stone methods implemented in beast version 1.10.4 (suchard et al, 2016) . one hundred million mcmc chains were run with10% burn in. results were then visualized with tracer version 1.8. (http://tree.bio.ed.ac.uk/software/tracer/), all effective sampling size ess values were >200 indicating sufficient sampling. bayesian skyride analysis was carried out to visualize the epidemic evolutionary history using tracer v 1.8. complete s protein gene sequence of afrsarscov-2 was aligned along with ratg13 btcov and pangolin sarsrcov sequences using mafft (katoh et al, 2015) . the alignment was then edited and visualized using bioedit software. the current global sarscov-2 pandemic, otherwise known as covid-19 began on the african continent from a european returnee in egypt on february 17 th 2020 (who 2020). it has since spread to virtually all the countries within the african region. this study was based on sequences generated during the early phase of the pandemic in africa precisely between, february 2020 and april 2020. sixty nine high coverage full genome sequences from six african countries, namely algeria (3), senegal (20), democratic republic of congo drc (35), nigeria (1), ghana (6) and south africa (4) were analyzed. phylogenetic analysis of the african sequences showed clustering within the sarbecovirus sub-genus forming a sub-cluster with sarsr cov and pcov (figure 1 ) as previously reported by several workers (zhao et al, 2020; lam et al, 2020; . the root to tip regression analysis showed a not so strong signal with a correlation of coefficient of 0.995 and r 2 = 0.991 (supplementary figure 1) . results of recombination analysis of the african sarscov-2 (afrsarscov-2) sequences against references whole genome sequences of sars, recombination signals were observed between the african sarscov-2 sequences and reference sequence (major recombinant hcov-19 pangolin/guangu p4l/2017; minor parent hcov-19 b batyunan/ratg13) between the rdrp and s gene regions (figure 2 ). this result is consistent with a previous report from saudi arabia which investigated the recombination between sarscov-2 and closely related viruses such as sarscov and mers (nour et al, 2020) . evolutionary rate for the afrsarscov-2 isolates during the period under study was 4.133 × 10 -4 substitutions/site/year, high posterior density interval hpd (4.132 × 10-4 to 4.134 × 10-4). this is slightly higher than that of an earlier report from early outbreak strains from china with a rate of 3.345 × 10-4 (li et al, 2020) , it is however lower than the calculated global sarscov-2 evolutionary rate estimated to be 8.0 × 10-4 reported by nexstrain (www.nextstrain.org/ncov/global ). the mcc tree of the african sarscov-2 sequences shows that they have evolved into two major lineages a and b with lineage b being more diverse. majority of the african sarscov sequences clustered within lineage b, while three ghanaian, three congolese, and four senegalese strains clustered along with the reference chinese and south korean strains within lineage a (figure 3 ). the mcc tree for the dataset containing global reference sequences also showed a similar topology with that of the african tree, the tree was distributed into two major lineages a and b, with lineage b further diversifying into about four sub-lineages, while lineage a seemed to evolve into only two sublineages ( figure 4) . the afrsarscov-2 strains were intermixed with the global sequences within both lineages, lineage b consisted mainly of strains from germany, england, italy and usa, intermixed with african strains; while lineage a consisted mainly of strains from south korea and china with a few african strains from senegal, ghana and drc. the result of the genotype analysis using the genotyping tool pangolin was largely in conformity with observed phylogenetic analysis. figure 5 shows a summary of the lineage distribution of the isolates by country of origin using the pangolin genotyping tool. the complete distribution of the strains according to lineage and country is shown in supplementary table 2 . from the analysis with pangolin, lineage b.1 was the most commonly encountered and the most widely distributed, consisting of 93 sequences from seven countries, followed by lineage b.2 and genotype b. lineage a had 15 positive sequences from six countries. majority of the sequences recorded high bootstrap values with over 70% of the sequences recording a bootstrap value of above 80%. this shows that the pangolin is a reliable tool with a broad scope of functions including a user friendly and interactive representation of phylogenetic clustering of the identified sub-lineages and lineages by means of graphical images of the trees generated using virtually all available sarcov-2 sequences available on gisaid platform as reference. the genotyping tool was recently introduced several reports have utilized it in predicting lineage assignments accurately (xaiveir et al, 2020) . the time to most recent common ancestor tmrca of the african sarscov-2 strains was december 7 th 2019 (november 12 th 2019-december 29 th 2019), while the tmrca of all the sequences under analysis was 14 th october 2019 (july 27 th 2019-december 17 th 2019). our tmrca was lower than a similar study which reported a tmrca of 14 th october 2019 among global isolates including chinese isolates (li et al, 2020 ), but was slightly higher than another recent study investigating the evolutionary dynamics of the ongoing sarscov-2 epidemic in brazil which reported a tmrca of 10 th february 2020 (xaiveir et al, 2020) . the epidemic history of the ongoing outbreak was investigated using the bayesian skyline plot bsp. the bsp showed a steady increase in viral population as the outbreak progressed under the study period ( figure 6 ). this observation is expected as viral sequence population is supposed to increase as the infection spreads. a major limitation was the rather small number of sequences analyzed and very short study duration; therefore our results may not reflect the exact viral population dynamic of the outbreak in africa. the afrsarscov-2 sequences were analyzed for the d614-g mutation within the s1 subunit of the spike protein, which has been reported to contribute to increased transmissibility of sarscov-2 (korber et al, 2020) . figure 7 shows a representative amino acid alignment of selected afr sarscov-2 sequences along with reference sequences of btcov ratg13 and pcov. our results revealed high prevalence of d614-g mutation among afrsarscov-2 with 12/69 (17.39%). the mutation was recorded in isolates from all african countries analyzed in this study, supplementary figure 2. prior to this report the d614-g spike mutation was found predominantly in europe accompanied by high number of cases and significant mortality rate (pachetti et al, 2020; korber et al, 2020b) . the introduction of this strain in africa is quite worrisome, considering the population densities of most african cities and the poor state of public health infrastructure to support medical intervention of symptomatic sarscov-2 cases. although more evidence is still required to determine the extent of the effect of the d614-g mutation on the virulence properties of the virus, current evidence from in vitro studies seem to support the hypothesis of increased transmissibility of this variant of the virus (korber et al, 2020; hu et al, 2020) . in conclusion we have reported the genetic diversity and evolutionary history of sarscov-2 isolated in africa during the early outbreak period. our findings have identified diverse sublineages of sarscov-2 currently circulating among africans. we also identified high prevalence of the d614-g spike protein variant of the virus capable of rapid transmission in all countries sampled. a major limitation was the relatively low amount of sequence submission available in gisaid database compared with those of other regions such as europe and asia. we advocate for upscale of next generation sequencing ngs capacity for whole genome sequencing of sarcov-2 samples across the african continent to support surveillance and control effort in africa. figure 7. amino acid alignment of the partial s gene sequences covering amino acid positions 360 to 840, of selected afrsarscov isolates along with reference sequences of closely related pcov and bat ratg13. the red shaded region represents the receptor binding domain; the blue shaded box represents the d614-g motive, while the empty red box represents the polybasic cleavage site bordering the s1/s2 sub-unit. figure 7. amino acid alignment of the partial s gene sequences covering amino acid positions 360 to 840, of selected afrsarscov isolates along with reference sequences of closely related pcov and bat ratg13. the red shaded region represents the receptor binding domain; the blue shaded box represents the d614-g motive, while the empty red box represents the polybasic cleavage site bordering the s1/s2 sub-unit. novel coronavirus ( 2019-ncov ) situation report -1, 21 who africa/second case of ncov confirmed in africa coronavirus disease 2019 (covid-19) a pneumonia outbreak associated with a new coronavirus of probable bat origin covid-19 situation world wide as at 21 th may a familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster commentary: middle east respiratory syndrome coronavirus (mers-cov): announcement of the coronavirus study group isolation of a novel coronavirus from a man with pneumonia in saudi arabia severe acute respiratory syndrome coronavirus-like virus in chinese horseshoe bats genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding inhibition of sars-cov-2 infections in engineered human tissues using clinical-grade soluble human ace2 epidemiological and genetic analysis of severe acute respiratory syndrome chapter 28, coronaviridea emerging sars-cov-2 mutation hot spots include a novel rna-dependent-rna polymerase variant spike mutation pipeline reveals the emergence of a more transmissible form of sars-cov-2 a dynamic nomenclature proposal for sars-cov-2 to assist genomic epidemiology the d614g mutation of sars-cov-2 spike protein enhances viral infectivity anddecreases neutralization sensitivity to individual convalescent sera tracking changes in sars-cov-2 spike: evidence that d614g increases infectivity of the covid-19 virus probable pangolin origin of sars-cov-2 associated with the covid-19 identifying sars-cov-2 related coronaviruses in malayan pangolins rdp4: detection and analysis of recombination patterns in virus genomes. virus evolution exploring the temporal structure of heterochronous sequences using tempest (formerly path-o-gen) bayesian phylogenetic and phylodynamic data integration using beast 1.10. virus evolution dating of the human-ape splitting by a molecular clock of mitochondrial dna codon usage bias and recombination events for neuraminidase and hemagglutinin genes in chinese isolates of influenza a virus subtype h9n2. archives of virology evolutionary history, potential intermediate animal host, and cross species analysis of sarscov2 the species severe acute respiratory syndrome-related virus: classifying 2019-ncov and naming it sarscov-2 the ongoing covid-19 epidemic in minas gerais, brazil: insights from epidemiological data and sars-cov-2 whole genome sequencing maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: approximate methods insights into evolution and recombination of pandemic sarscov-2 using saudi arabian sequences. biorxiv preprint figure 2. boot scan plot of complete genome sequences of afrsarscov-2 sequences analysed with the rdp recombination software. the legend shows the identity of the sequences scanned within the plot; the light blue bars indicate the portions of the genome with recombinant signals in reference to the major and minor recombinant parent sequences. key: cord-300807-9u8idlon authors: tong, joo chuan; ranganathan, shoba title: 7 infectious disease informatics date: 2013-12-31 journal: computer-aided vaccine design doi: 10.1533/9781908818416.99 sha: doc_id: 300807 cord_uid: 9u8idlon abstract: throughout history, infectious diseases have posed a serious burden to mankind. more recently, there has been an alarming increase in drug-resistant microbes. furthermore, new pathogens are emerging due to microbial evolution and adaptation. the spread of these diseases is a result of pathogen mutations and changes in human behavior patterns. then, there are diseases that are lurking in the background, waiting for the right conditions before they strike again. in the war against these diseases, we have come to understand the behaviors of microbes in a heterogeneous world and the mechanisms governing disease transmission. these works have profoundly shaped modern knowledge of emerging and re-emerging infections. more recently, computational techniques have led the way into this new era by allowing rapid high-throughput analysis of pathogens which was previously not possible using traditional laboratory techniques. this chapter introduces methods in mathematical modeling, computational biology, and bioinformatics that have been used to study infectious diseases. abstract: throughout history, infectious diseases have posed a serious burden to mankind. more recently, there has been an alarming increase in drug-resistant microbes. furthermore, new pathogens are emerging due to microbial evolution and adaptation. the spread of these diseases is a result of pathogen mutations and changes in human behavior patterns. then, there are diseases that are lurking in the background, waiting for the right conditions before they strike again. in the war against these diseases, we have come to understand the behaviors of microbes in a heterogeneous world and the mechanisms governing disease transmission. these works have profoundly shaped modern knowledge of emerging and re-emerging infections. more recently, computational techniques have led the way into this new era by allowing rapid highthroughput analysis of pathogens which was previously not possible using traditional laboratory techniques. this chapter introduces methods in mathematical modeling, computational biology, and bioinformatics that have been used to study infectious diseases. epidemics, pandemics, and outbreaks of infectious diseases are regular features of life on earth. in 430 bc , thucydides described the very fi rst pandemic in recorded history -the athenian plague that reportedly killed up to one-half of the citizens of athens. in ad 541-2, an outbreak occurred in the byzantine empire, causing 10 000 deaths every day. the outbreak, named the justinian plague after the reigning emperor justinian i, resulted in over 100 million deaths and wiped out nearly half the inhabitants of the city. in 1348-50, the plague returned to europe under the name of the black death, killing up to 60% of the continent's population. in march 1918, an infl uenza outbreak was fi rst reported in a us military camp in kansas. the outbreak, later known as the "spanish fl u," subsequently spread and infected up to a billion people, or half the world's population at the time, causing some 50 million deaths within six months. all over the world, changes in socio-economic, demographic and environmental factors brought about by urbanization and industrialization have led to the resurgence of old and new infectious diseases. over the past 40 years, there has been an alarming increase in drug-resistant microbes in diseases such as malaria and tuberculosis. furthermore, the world is also witnessing the emergence of more new pathogens due to microbial evolution and adaptation. then, there are diseases that are lurking in the background, waiting for the right conditions before they strike again. in 1999, west nile virus re-emerged in new york and spread across the united states to long island, connecticut, maryland, florida, california, arizona, and colorado, with over 4100 reported cases and 280 associated deaths within a span of fi ve years. previously known to be a mild disease, the re-emergence of epidemic chikungunya virus (chikv) in africa, indian ocean, south-east asia, pacifi c, north america, and europe in the past decade has caused severe morbidity with some cases of fatality. in april 2009, a new strain of human infl uenza a (h1n1) virus containing genes from human, swine and avian infl uenza a viruses emerged in mexico. over the course of one year, the virus had spread to more than 212 countries and overseas territories or communities, causing more than 15 921 deaths. more recently, in january 2012, a human case of avian infl uenza a (h5n1) virus infection was reported in china. if history is our guide, we can assume that the threat of these diseases will continue to grow and pose a serious problem to the security of countries worldwide. similarity between related sequences can give clues to the structure, function, or homology to the common ancestor. computational methods that can compare sequence features are, therefore, particularly useful. sequence alignment is the determination of residue-residue correspondences between two or more character strings, usually preserving the relative order. this method allows us to measure similarity and infer evolutionary relationships between two or more sequences. pairwise sequence alignment is useful for analyzing the degree of similarity between two biological sequences. where more than two sequences are involved, multiple sequence alignment can be used to identify regions of similarity that may help explain functional and/or phenotypic variability. the 2009 h1n1 fl u was not the fi rst human pandemic caused by infl uenza a viruses. it is related to the 1889 russian fl u that killed ∼1 million people, the 1918 spanish fl u that infected ∼25% of the global population and killed at least 50 million people worldwide, the 1957 asian fl u that resulted in ∼2 million deaths and the 1968 hong kong fl u that caused ∼1 million deaths. in cases where the ancestry is unclear, sequence alignment methods can be used to infer their phylogenetic relationships. this includes: ■ identifying globally optimal alignment solutions for studying highly conserved sequences; ■ identifying maximally homologous subsequences among sets of long sequences for detecting distantly related proteins. in information theory and computer science, four types of metrics are commonly used to measure the edit distance between two strings of characters. they include: ■ the hamming distance, which is the number of positions with mismatched characters between two strings of the same length. ■ the levenshtein distance, which is the minimum number of operations that is needed to transform one string into the other, which may be of different length. an operation can be a deletion, insertion, or substitution of a single character in the strings. ■ the damerau-levenshtein distance, which is the minimum number of operations that is needed to transform one string into the other, which may be of different length. an operation can be a deletion, insertion, or substitution of a single character, or a transposition of two adjacent characters in the strings. ■ the jaro-winkler distance, which is a measure of similarity between two strings using the jaro distance metric. this method fi rst identifi es the common characters between two strings of characters. two characters are common if there is an exact match and if the difference in positions between the two strings is less than half the length of the shorter string. once all the common characters are determined, the number of transpositions of common characters are determined and used to compute the jaro similarity. strings that are more similar will have a higher jaro distance. in biological systems, certain amino acid changes are more likely to occur than others. for example, a hydrophobic residue is more likely to be replaced by another hydrophobic residue than a hydrophilic residue. to account for such transformations, a weight can be assigned to the different edit operations. this can take the form of a matrix that shows the substitution frequencies of observed pairs of amino acid residues. two popular substitution matrices are: ■ the percent accepted mutation (pam) matrices by dayhoff, which measure sequence similarity in closely related species. two sequences 1 pam apart have an average of one accepted point mutation event per 100 amino acids. they need not be 99% identical, as two point accepted mutations can occur at the same position. to analyze sequences that are more divergent, we can use the pam1 matrix as a base for calculating other matrices. this is based on the assumption that repeated mutations would follow the same pattern as those in the pam1 matrix. ■ the block substitution matrix (blosum) matrices by henikoff and henikoff, which measure sequence similarity in divergent sequences. the matrices are constructed from the blocks database of aligned conserved regions in divergent protein families. these regions are assumed to be of functional importance. once the substitution matrix is selected, the optimal alignment can be found using dynamic programming algorithms. a related concept is the use of theoretical statistics, such as information entropy, to quantify the rate of information transfer in biological sequences. the shannon entropy is a measure of uncertainty that is associated with a random variable. it is commonly used to assess the variability of microbial proteomes and epitope sequences. for a given alignment, the information content (i.e. entropy) of an amino acid position h ( x ) is defi ned by: where x is one of 20 amino acid residue types. p ( x ), the probability of occurrence of x , is estimated by f ( x ), the frequency of the appearance of residue type within the alignment column: where n ( x ) is the number of appearances of amino acid residue x , and l is the length of the column. this method has been used to analyze the genetic diversity and antigenic relationships of chikungunya virus (chikv) proteomes from its introduction in 1952 to 2009. antigenic switches refer to changes in gene expression at a specifi c site which may abrogate binding to hla molecules or antagonize/ interfere with t cell response, leading to cellular immune evasion. the study suggested that chikv is undergoing mild positive selection, with signifi cant amounts of "antigenic switches" clustered over the entire genome. an effective way to identify amino acid residues that are involved in virus adaptation is to fi nd interdependencies between mutations in multiple proteins. a simple way to do this is to calculate mutual information (mi) between variable pairs. mi is an information theoretical statistic that measures the strength of association between a pair of variables. the mutual information between two variables a and b is defi ned by: the evolutionary inertia of a pathogen can be qualitatively examined by studying the nucleotide usage patterns at single amino acid sites. the neutral theory of molecular evolution by kimura in 1968 states that most evolutionary changes at the molecular level are caused by random genetic drift of selectively neutral nucleotide substitutions. due to the degeneracy of the genetic code, some point mutations are silent with no amino acid replacements. silent or synonymous substitutions are primarily transparent to natural selection, whereas replacement or non-synonymous substitutions may be a result of strong selective pressure. a simple method to calculate the extent of adaptive evolution at highly variable genetic loci is to compare the fi xation rates between nonsynonymous (d n ) and synonymous (d s ) substitutions. the d n /d s ratio ( ω ), otherwise known as the "acceptance rate," provides a sensitive measure of selection pressure at the amino acid level. ω =1 indicates neutral expectation, ω <1 suggests negative (purifying) selection, while ω >1 suggests positive (diversifying) selection. a group of genes that often show the ω >1 relationship are antigenic genes in human immunodefi ciency virus-1, plasmodia, and other parasites. the hemagglutinin gene from infl uenza a virus is probably one of the fastest evolving genes in terms of the rate of nucleotide substitution, which was estimated at 5.7×10 −3 per site per year. this high genetic variation confers a fi tness advantage to the pathogen in its attempt to evade host defenses. the simple counting method of nei and gojobori is commonly used for estimating d n and d s . however, the reliability of this technique is low when the rate of transitional nucleotide change is higher than that of transversional change. the model-based maximum likelihood (ml) methods such as those proposed by muse and gaut and goldman and yang represent a viable and widely used alternative for this purpose. the original ml model of goldman and yang assumes a single ω for all lineages and sites, and has been extended to account for variation by allowing ω to vary either across lineages, among substitution sites, or both among sites and among lineages. lineagespecifi c models assume that ω do not vary among sites, and can detect positive selection for a lineage only if the averaged d n over all sites is greater than the average d s . site-specifi c models, on the other hand, allow ω to vary among sites but not among lineages. as such, these models can detect positive selection at individual sites only if the averaged d n over all lineages is greater than the average d s . by allowing ω to vary both among sites and among lineages, the method can be applied to detect positive selection that occurred at a few time points and affects a few sites. upcoming challenges for multiple sequence alignment methods in the high-throughput era founder effects in the assessment of hiv polymorphisms and hla allele associations prediction and entropy of printed english hla class i restriction as a possible driving force for chikungunya evolution complete-proteome mapping of human infl uenza a adaptive mutations: implications for human transmissibility of zoonotic strains mining mutation chains in biological sequences unifying the epidemiological and evolutionary dynamics of pathogens selection-driven evolution of emergent dengue virus a method for detecting positive selection at single amino acid sites adaptsite: detecting natural selection at single amino acid sites evolutionary rate at the molecular level selectionism and neutralism in molecular evolution molecular evolution of mrna: a method for estimating evolutionary rates of synonymous and amino acid substitutions from homologous nucleotide sequences and its applications sequence relationships among the hemagglutinin genes of 12 subtypes of infl uenza a virus simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions a likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome a codon-based model of nucleotide substitution for protein-coding dna sequences a maximum likelihood method for detecting directional evolution in protein sequences and its application to infl uenza a virus codon-substitution models for detecting molecular adaptation at individual sites along specifi c lineages key: cord-321150-ev6acl7b authors: lam, ha minh; ratmann, oliver; boni, maciej f title: improved algorithmic complexity for the 3seq recombination detection algorithm date: 2017-10-03 journal: mol biol evol doi: 10.1093/molbev/msx263 sha: doc_id: 321150 cord_uid: ev6acl7b identifying recombinant sequences in an era of large genomic databases is challenging as it requires an efficient algorithm to identify candidate recombinants and parents, as well as appropriate statistical methods to correct for the large number of comparisons performed. in 2007, a computation was introduced for an exact nonparametric mosaicism statistic that gave high-precision p values for putative recombinants. this exact computation meant that multiple-comparisons corrected p values also had high precision, which is crucial when performing millions or billions of tests in large databases. here, we introduce an improvement to the algorithmic complexity of this computation from o(mn(3)) to o(mn(2)), where m and n are the numbers of recombination-informative sites in the candidate recombinant. this new computation allows for recombination analysis to be performed in alignments with thousands of polymorphic sites. benchmark runs are presented on viral genome sequence alignments, new features are introduced, and applications outside recombination analysis are discussed. determining whether genomic regions are undergoing homologous recombination is important in all parts of biology and genetics. indeed, recombination has profound consequences for a population's evolutionary trajectory, and it changes our understanding of the evolutionary history of a population as described through phylogenetics (schierup and hein 2000; . identifying recombination is especially important in large genomic analyses, as the larger the region being analyzed the higher the chance that recombination will be detected even in a small sample. over the past three decades, methods of identifying recombination from sequence data have focused on detection of clustered polymorphism, excessive homoplasy, low linkage disequilibrium, mosaicism, and incongruent phylogenies . some of these statistical signals have advantages over others in terms of false positive rate, statistical power, speed, and the size of the data set that can be analyzed. an analysis of sensitivity and specificity can be found in posada and crandall (posada and crandall 2001) and a guide to choosing an appropriate method for a given data set can be found in martin et al. (2011) . in modern sequence analysis, a major challenge in recombination detection is the size of the data sets themselves. beyond the computational burden, critical but often underappreciated statistical issues arise through the extremely large number of compared nucleotide sequence patterns. with this many comparisons being performed, truly nonrecombinant sequences can exhibit nucleotide patterns that appear recombinant by chance. for this reason, statistical corrections for multiple comparisons are essential to guard against calling spurious recombinants. in an algorithm called 3seq, boni et al. (2007) presented an exact mosaicism statistic for calling recombinants. critically, the exactness of the computation (e.g., calculating p values to a precision of 10 à20 or 10 à30 ) allows these mosaic signals to remain statistically significant, even when billions of comparisons are being performed and adjusted for multiple comparison. this means that the exact mosaicism statistic implemented in the 3seq software maintains good power properties even on large data sets when statistical correction factors for multiple comparisons are on the order of 10 10 or more. recombination detection methods that detect mosaic signals always take a triplet approach or a quartet approach, positing one sequence as the candidate recombinant, two sequences as the parents, and possibly a fourth sequence as an outgroup. with the parental sequences labeled p and q and the candidate recombinant labeled c, these methods normally use "recombination informative" sites, or simply informative sites, to determine if c is a mosaic of p and q. in 3seq, nucleotide positions on c are labeled informative if the nucleotide in c is identical to one parental sequence but different from the other. if the sequence of m informative sites identical to p and n informative sites identical to q appears nonrandom or clustered, this is an indication that letter ß the author 2017. published by oxford university press on behalf of the society for molecular biology and evolution. this is an open access article distributed under the terms of the creative commons attribution non-commercial license (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. for commercial re-use, please contact journals.permissions@oup.com open access recombination may have occurred. when read from left to right along the sequence, the informative sites can be used to draw a random walk on a set of axes with m up-steps and n down-steps; this is called a hypergeometric random walk (hgrw). a strong descent or ascent in the middle of a hgrw indicates that one type of informative site exhibits clustering, and the properties of the random walk can be used to compute exact probabilities of this occurring. see figure 1 for an example. in this letter, we present a new and faster method of computing these probabilities. the central feature of 3seq was a reduction of an o(2 mþn ) space-complexity problem into an o(mn 3 ) problem, for computing the probability x m, n, k that a hgrw with m up-steps and n down-steps achieves a maximum descent of size k exactly. the descent does not need to be k consecutive down steps. the computations were done via auxiliary variables y m, n, k, j : the probability that a hgrw with m up-steps and n down-steps achieves a maximum descent of size k exactly and the minimum value achieved by the random walk is exactly j units below the origin. the y-variables can be computed recursively (boni et al. 2007 ) by building a table of size mn 3 . the x-variables are then computed as follows: by separating out the first and last term in the sum above, and using the y-variable recursions, a nearly direct recursion can be written for the x-variables: the p value for observing a maximum descent of size at least k is defined by and recursions for the p-variables reduce to: the y-variables in equation (2) above-since the last two indices are equal-can be computed recursively by building one table of size mn 2 . the p-variables can be recursively computed by building a second table of size mn 2 . this means that the entire computational procedure of p values can be done with space complexity o(mn 2 ) instead of the original o(mn 3 ) presented in boni et al. (2007) . all computations were verified against the original approach. this new approach allows larger probability tables to be built more quickly. using the 2007 recursions, a table of size 700 â 700 â 700 was built in 9 h and used 5.1-gb ram (2.6 ghz processor; 16-gb ram). using the recursions above, a table of 1,600 â 1,600 â 1,600 was built in 1 h and 42 min in a 10.4-gb memory footprint. two other noteworthy improvements were made to the algorithm: 1) faster breakpoint calculations by using polymorphic sites only in relationship between ordering of informative sites along a genome and a hypergeometric random walk. below each set of axes, the 30 red bars and 30 blue bars show positions on a genome (informative sites) where a putative recombinant sequence is identical to parent p but different from parent q (blue bars), or identical to parent q but different from parent p (red bars). each blue site can be mapped to an up-step in a random walk and each red site can be mapped to a down-step in a random walk, and there is a one-to-one correspondence between the space of informative-site arrangements and the space of hypergeometric random walks. (a) a random arrangement of informative sites, which does not visually suggest that the sequence is a mosaic of putative parents p and q. the arrangement of sites maps to a random walk which stays fairly close to the horizontal axis. this walk's maximum descent is eight steps, and $54% of hgrws with 30 up-steps and 30 down-steps have a maximum descent of eight steps or greater. (b) a nonrandom arrangement of informative sites that clearly suggests that the candidate sequence is a mosaic of the two parental sequences p and q. the probability of all the red sites appearing consecutively is 31! â 30!/60! which is 2.62 â 10 à16 . (c) an arrangement of red sites and blue sites that suggests the red sites may be clustered in the middle. when mapping the site arrangement to a hypergeometric random walk, the random walk has a maximum descent of 18 steps. the p value for a maximum descent of 18 steps cannot be written down in closed form but can be calculated from recursion (4). the p value for this maximum descent and for this arrangement of informative sites is 1.8 â 10 à4 . lam et al. . doi:10.1093/molbev/msx263 mbe breakpoint searches, and 2) a repeated subsampling feature that allows for comparison of data sets of different sizes; with this feature one can randomly subsample m sequences from multiple databases or sequence collections, and repeat the process to see how often these subsets exhibit recombination. the new source code and manual can be downloaded from http://mol.ax/3seq. when a p value falls outside the bounds of the table being used, the software substitutes in hogan-siegmund approximations (hogan and siegmund 1986) for the queried p value. the 3seq maximum descent statistic describes clustering patterns in sequences of binary outcomes, and is therefore not confined to recombination analysis. the statistic can be viewed as a generalization of the mann-whitney u statistic, in the sense that outcomes of one type (of a binary outcome variable) do not necessarily have to cluster or rank at the beginning or end of a sequence of data points. the maximum descent of a hgrw can be used to describe the clustering of one particular binary outcome in the middle of a sequence of binary outcomes; in other words, it is a 1d nonparametric clustering statistic. in recombination analysis, this is the clustering of one kind of informative site among all the informative sites (han et al. 2010) . to make use of this statistic easier for those working outside the field of recombination, we developed a web calculator ( fig. 2 ) that computes exact p values for clustering in a sequence of binary outcomes, available at http://mol.ax/delta. for example, the sequence "aaaaabb bbabbbabbbaaaa" can be typed in and the calculator reports that the clustering of bs in the middle of the sequence is significant at p ¼ 0.0055. we list two practical example uses of our nonparametric clustering statistic. first, seasonality can be assessed nonparametrically. if a particular population behavior or climatic characteristic (e.g., rain or no rain) can be noted to occur or not occur every day, then an ordered sequence of the days in the year will show if the occurrence of one of the behaviors is clustered and thus if this feature was seasonal in that one year. as a second example, when a process is expected to behave at an intermediate range or when an observation is expected to be made at intermediate values only, this pattern can be tested for nonparametrically. dengue virus does not cause severity for all ages equally. one's first dengue infection, occurring during childhood, is typically nonsevere; secondary infections, seen in older children and teenagers, have a higher chance of severity, whereas tertiary and subsequent infections, those that would occur in older age groups, are thought to be rare and/or subclinical (gubler 1998; wikramaratna et al. 2010) . thus, disease severity in a surveillance system should be seen in the intermediate age ranges, and this can be tested for nonparametrically by noting if each age band is overrepresented or underrepresented in the pool of patients experiencing dengue-like severe disease in a hospital. in fact, since all that is required here is a symptoms description, the identification of a vulnerable age range can be done for any set of symptoms. to illustrate improved runtimes and memory usage of the new 3seq algorithm, we searched for recombinants among large sequence data sets of dengue virus serotype 2, ebola virus, the coronavirus responsible for middle-east respiratory syndrome (mers) and zika virus; see table 1. full-length zika virus sequences were downloaded from the ncbi viral variation resource (brister et al. 2014 ) and aligned with muscle v3.8 (edgar 2004) . full-length sequences of ebola virus, dengue virus serotype 2, and the coronavirus responsible for middle-east respiratory syndrome (mers) were downloaded from ncbi and aligned with the online ncbi alignment tools. ebola virus sequences were restricted to human viruses sampled in africa after december 1, 2013. dengue virus serotype 2 was chosen to include a particularly large and polymorphic alignment. as negative controls, we considered segments pb2 and ns from avian influenza a virus, subtype h5n1, originally analyzed in boni et al. (2010) ; only sequences from the influenza genome sequencing project were included fig. 2. screenshot of new online tool that can be used to calculate p values testing the hypothesis of whether one binary outcome clusters in the middle of a (1d) sequence of binary outcomes. one input method is simply typing two characters in a text box (above, "u" for up and "d" for down) and letting the calculator return a p value showing whether one type of character is clustered in the middle. to test whether the other type of character is clustered, the "swap" button can be used. the hypergeometric walk is shown graphically. the exact p value, computed with the methods in this article, is shown. the two hogan-siegmund approximations for this p value are also shown. improved algorithmic complexity for 3seq . doi:10.1093/molbev/msx263 mbe (ghedin et al. 2005 ) and identical sequences were removed (when identical sequences were not removed, results using the new version of 3seq were identical to the results in table 1 of boni et al. 2010) . the new version of the software-run with a p value table of size 1,200 â 1,200 â 1,200-had faster computation times than the previous version and was able to comfortably accommodate alignments with thousands of polymorphic sites. table 1 shows the results of all runs. note that because 3seq evaluates all triplets in a data set, the run time of the algorithm scales as the cube of the number of sequences and linearly with the alignment length. as informative sites can sometimes be clustered in short regions of the genome, 3seq will report these short segments as recombinant. for this reason, an additional column is included in table 1 showing the number of sequences that were identified as recombinant with both inherited regions being longer than 500 nt; if one of the recombinant regions is very short, it is difficult to confirm the recombination results with a phylogenetic analysis of the two identified parental segments. starting with the analysis on the two negative control data sets, no recombinant segments longer than 500 nt were detected in either avian influenza alignment. both of these runs took <30 s. the genomic alignments of mers and zika virus contained 1,150 and 2,792 polymorphic sites, respectively, and >99.9% triplets were able to be tested for mosaicism with exact p values. these runs took <2 min. as expected from a recent analysis by dudas and rambaut (dudas and rambaut 2015) , the mers sequence data set was highly recombinant, with 100 out of 164 sequences being identified as such. for zika, 6 out of 157 virus sequences were identified as recombinant, consistent with earlier analyses supporting the presence of recombination in the evolutionary history of zika (faye et al. 2014; zhu et al. 2016) ; details of the recombinants, parents, and breakpoints are included in the supplementary material online. the ebola virus and dengue virus alignments each contained around 1,000 sequences. the ebola virus data showed no evidence of recombination. the dengue alignment was the most diverse of all the tested data sets with 6,151 polymorphic sites; 99.4% of the triplets in this data set were able to be evaluated with exact p values. a total of 36 out of 1,108 dengue sequences were identified as recombinant (see supplementary material online). several previous analyses of dengue virus have shown evidence for intraserotype recombination in dengue worobey et al. 1999; uzcategui et al. 2001; aaskov et al. 2007; waman et al. 2016 waman et al. , 2017 . the results presented here, as well as those of waman et al. (2017) , suggest that recombination in dengue is infrequent. in general, when recombinants are identified by a mosaicism statistic like the one used by 3seq, a phylogenetic analysis should be performed to ensure that the recombination signal is preserved when the entire evolutionary history of the sample is taken into account. the size of modern data sets presents two challenges here. first, as the number of available sequences increases, the choice for phylogenetic inference tools drifts to more approximate methods, as thorough explorations of tree space become computationally expensive for large numbers of sequences. this reduces our confidence in phylogenetic incongruence signals that we observe in these data. second, genome-level analyses in highly recombining organisms are likely to result in a subdivision of the genome into many nonrecombinant blocks. inferring phylogenies for all blocks individually will be computationally expensive, as will the subsequent analysis of identifying specific phylogenetic incongruences among the trees. the next generation of recombination detection methods should focus on these computational challenges. supplementary data are available at molecular biology and evolution online. multiple recombinant dengue type 1 viruses in an isolate from a dengue patient guidelines for identifying homologous recombination events in influenza a virus an exact nonparametric method for inferring mosaic structure in sequence triplets virus variation resource -recent updates and future directions mers-cov recombination: implications about the reservoir and potential for adaptation muscle: multiple sequence alignment with high accuracy and high throughput molecular evolution of zika virus during its emergence in the 20th century large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution dengue and dengue hemorrhagic fever no observed effect of homologous recombination on influenza c virus evolution large deviations for the maxima of some random fields phylogenetic evidence for recombination in dengue virus analysing recombination in nucleotide sequences evaluation of methods for detecting recombination from dna sequences: computer simulations the effect of recombination on the accuracy of phylogeny estimation recombination in evolutionary genomics consequences of recombination on traditional phylogenetic analysis molecular epidemiology of dengue type 2 virus in venezuela: evidence for in situ virus evolution and recombination genetic diversity and evolution of dengue virus serotype 3: a comparative genomics study population genomics of dengue virus serotype 4: insights into genetic structure and evolution the effects of tertiary and quaternary infections on the epidemiology of dengue widespread intra-serotype recombination in natural populations of dengue virus comparative genomic analysis of pre-epidemic and epidemic zika virus strains for virological factors potentially associated with the rapidly expanding epidemic key: cord-102766-n6mpdhyu authors: alam, md. nafis ul; chowdhury, umar faruq title: short k-mer abundance profiles yield robust machine learning features and accurate classifiers for rna viruses date: 2020-06-25 journal: biorxiv doi: 10.1101/2020.06.25.170779 sha: doc_id: 102766 cord_uid: n6mpdhyu high throughout sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. recent attempts have highlighted the use of machine learning models for the task but these models rely entirely on dna genomes and owing to the intrinsic genomic complexity of viruses, rna viruses have gone completely overlooked. here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. we trained 18 classifiers for the task of distinguishing viral rna from human transcripts. we challenged our models with very stringent testing protocols across different species and evaluated performance against blastn, blastx and hmmer3 searches. for clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. on de-novo assemblies of raw rna-seq data from cells subjected to ebola virus, the area under the roc curve varied from 0.6 to 0.86 depending on the software used for assembly. our classifier was able to properly classify the majority of the false hits generated by blast and hmmer3 searches on the same data. the outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data. author summary in this age of high-throughput sequencing, proper classification of copious amounts of sequence data remains to be a daunting challenge. presently, sequence alignment methods are immediately assigned to the task. owing to the selection forces of nature, there is considerable homology even between the sequences of different species which draws ambiguity to the results of alignment-based searches. machine learning methods are becoming more reliable for characterizing sequence data, but virus genomes are more variable than all forms of life and viruses with rna-based genomes have gone overlooked in previous machine learning attempts. we designed a novel short k-mer based scoring criteria whereby a large number of highly robust numerical feature sets can be derived from sequence data. these features were able to accurately distinguish virus rna from human transcripts with performance scores better than all previous reports. our models were able to generalize well to distant species of viruses and mouse transcripts. the model correctly classifies the majority of false hits generated by current standard alignment tools. these findings strongly imply that this k-mer score based computational pipeline forges a highly informative, rich set of numerical machine learning features and similar pipelines can greatly advance the field of computational biology. of viruses can be radically different from all life, both in terms of molecular structure and 23 primary sequence. alignment-based and profile-based searches are commonly employed for 24 characterization of assembled viral contigs from high-throughput sequencing data. recent 25 attempts have highlighted the use of machine learning models for the task but these models rely 26 entirely on dna genomes and owing to the intrinsic genomic complexity of viruses, rna 27 viruses have gone completely overlooked. here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. we trained 18 classifiers for the task of distinguishing viral rna from human 30 transcripts. we challenged our models with very stringent testing protocols across different 31 species and evaluated performance against blastn, blastx and hmmer3 searches. for 32 clean sequence data retrieved from curated databases, our models display near perfect accuracy, 33 outperforming all similar attempts previously reported. on de-novo assemblies of raw rna-seq 34 data from cells subjected to ebola virus, the area under the roc curve varied from 0.6 to 0.86 35 depending on the software used for assembly. our classifier was able to properly classify the 36 majority of the false hits generated by blast and hmmer3 searches on the same data. the 37 outstanding performance metrics of our model lays the groundwork for robust machine learning 38 methods for the automated annotation of sequence data. viruses are numerous [1] and only a handful have been thoroughly characterized thus far [2] . as 62 we have entered and slowly progress through the age of automated, high-throughput genomics, 63 generation of sequence data itself is no longer of much concern, but rather accurate annotation of the bacterial 16srna gene tremendously aids the study of bacterial phylogeny [4] . the lack of 73 an omnipresent gene or genome segment that can be ubiquitously used to extract 74 phylogenetically relevant data about viruses, complicates the study of virus evolution [5, 6] . 75 there are additional levels to the structural complexity of viral genotypes. viral genomes stand 76 out from those of all other genomic entities as they can be either single, double or gapped dna 77 or rna molecules with positive, negative or ambi-sense mechanism of genomic encoding; 78 whereas all known realms of life exclusively rely on dna as the genetic material. these 79 categories of classification are built on top of the original baltimore scheme[7] that sought to 80 characterize viral genomes with regard to expression of genetic information [8, 9] . 81 in the search for viruses in metagenomic data, researchers have noted that the usefulness of 82 homology based search tools are quickly exhausted [6, 10] . most assembled contigs, likely to be 83 from viral origins, are short and fragmented with no guarantee of containing coding regions that 84 some of these tools rely on [11] . a number of software programs have been developed for the 85 purpose of identifying viral sequences that have integrated into host genomes [12] [13] [14] [15] . all 86 machine learning models built for identification and characterization of viral sequences from cell 87 samples or metagenomic data thus far rely on dna sequences [16, 17] . rna viruses comprise a 88 major group having great clinical and scientific importance [18] . presently made even more 89 evident by the global pandemic caused by the 2019 novel sars coronavirus [19] . it has been 90 demonstrated that rna-seq data can be a very promising avenue for improving knowledge on 91 rna viruses when leveraged by tactful algorithms [20] . here, we present a novel short k-mer 92 based scoring function that can be applied to sequences of different types to achieve impressive 93 discriminatory capacity from classic machine learning models. we constructed a number of 94 classifiers that rely on the profile of short genetic elements across the genome to classify rna 95 viruses from cellular mrna and further classify them into positive or negative strand rna 96 viruses. we tested the models rigorously using stringent cutoff parameters and many variable test 97 sets including noisy de-novo assembly data of cells cultured with ebola virus. we evaluated the 98 performance of our selected model against nucleotide blast, protein blast and protein family 99 hmmer3 search. in spite of the stiff testing protocols, our k-mer based numerical features stood 100 out to be informative and robust datapoints for the classification of biological sequences by 101 machine learning algorithms. a total of 5460 features were contrived using the sequence scoring formula for k-mer elements 105 of length 1 to 6. the feature columns were tested for data normality by several methods. q-q 106 plots of the data columns were suggestive of a normal distribution (not shown but can be found 107 on git repository), but more robust statistical normality tests such as the kolmogorov-smirnov 108 test and shapiro-wilk test strongly delineated a non-normal distribution for all feature columns. our features were therefore not suitable for univariate statistical analysis for feature selection. scikit-learn's in-built tree-based feature importance values derived from our trained random 111 forest classifier were applied for feature selection on two levels. similarly, scikit-learn's 112 optimized lasso model applied to our trained logistic regression classifier with l1 penalty was 113 also used for feature selection on two succeeding levels. using a combination of both our tree-114 based selections and lasso, we obtained a total of 6 levels of feature importance categories (4 115 from the trees and lasso themselves and 2 more for each method leading into the other on a 116 second level), each assigning a value of 1 to selected features and 0 to omitted ones (fig 1) . the 117 top 3 feature sets ranked by importance from these 6 combined levels of selection are presented 118 in 5 caatc, tccaat, gcaatc, ttgac, cgtagt, acggtt, ttga, gttggt, catcaa, atacgg, atcgta, cgataa, cgatta, gttgat, cgat, cgatc, cggtt, gcgtac, cgcgtt, ttgcg, cgtcga, caatct, atcaac, attgg, atagcg, gatc, ttgcgc, ggttg, tcggtt, gtcaat, atcaat, ttcgac, ccgtta, ccgata, tcga, gcgtta 4 atcata, acaatc, cgta, tcgt, atggta, ggtagt, tggt, ctgt, gtcga, agctg, gtcaa, aaataa, tcctg, cagc, tatccg, ctgga, caatt, tcggta, aggaa, atcaa, ataacg, tccaac, aattgg, aaatg, tggttc, aacc, ttgatg, ggcgta, tggtt, cgttgc, gtcgac, tcaac, caga, tcaatt, cagag, tcaat, aggttg, tcaacc, cgcgta, atccaa, ggtacg, gtcata, attggt, gcatac, ggtatg, gattgg, gtacgc, aaccga, tcgcaa, ttttta, acatcg, gcgtaa, tttttt, accggt, gttgag, taacac, ataggg, cgcaa, ccgcaa, cggttg, tatggt, aacggt, tagggt, ctaacg, cggta, gttgt, ggttc, acgata, ttcgca, ctcgat, tcgagt, aaaatg, tccat, gtcaaa, aataaa, ttggcg, gttgc, ggtcaa, ctgta, the summed feature score is obtained by summing the total number of times the feature appears 128 in the six filtered feature sets we obtained from our selection algorithm. the macro-averaged f1-score for the model was 88% and the binary f1-score was 81.83%. among the others, the decision tree models with 5460 and 194 features and the naïve bayes 146 model trained with 68 features were the only models to obtain binary f1-scores below 95%. all 147 other models displayed resounding accuracy on both their cross-validation and test sets. auroc of 0.95 (fig 4) . the f1-macro score was 0.89 and the f1-binary score was 0.83 (table 2 198 part 2). performance of this magnitude on novel divergent data concerns one to be critical of the to stand out to be good discriminators in biological data [15, 23] , particularly because of their 202 propensity against overfitting. our 68-feature random forest model displayed impressive 203 classification accuracy and an auroc of 0.96 on this divergent test set (fig 4) . correctly classified 101 of these false positives giving us an accuracy of 75% on the unary data. when feeding the entirety of the human cell line assembly into our classification pipeline we 232 achieved an accuracy of 63% for transcripts of all sizes and a much-improved value of 95% 233 when taking transcripts of the same length range as our training data (table 3) owing to the minimized skew of the separate classes (fig 5) . the classification metrics from all 251 instances are presented in table 3 . (table 3) . to further investigate the 261 despondence of the model, we assessed the performance of the 10 assembly software separately. this revealed that performance varied significantly between the different assembly software that 263 joined the contigs (fig 6) . genomes. by plotting the frequency of proteins discovered against the protein length cutoff 286 threshold, it was evident that at a length of about 100 amino acids, non-genic random orf 287 numbers significantly subside (fig 7) . using this 100-amino-acid cutoff for 6-frame in-silico (table 4) . in the same manner, we sorted out 2346 uninformative vfams from the high performance vfams 308 vfam a available at http://derisilab.ucsf.edu/software/vfam/. using hmmsearch on our 309 informative vfams against the de-novo assembly dataset yielded 26003 false hits. our model 310 was able to correctly classify 14089 of these false hits as human transcripts as noted in table 4 . scale the model, we went beyond tree-based methods and employed lasso regularization. feature selection by lasso stems from linear regression models by penalizing the coefficients of 361 the variables. using a combination of these selection methods on multiple levels, we were able to 362 shrink the model down into only few features while only minimally compromising performance. overfitting is always a concern when applying ml models to biological data. the first 364 impressions on the high accuracy of the classifiers suggested a case of overfitting in the model. the bootstrap aggregating or bagging meta-algorithm employed in random forest classification is 366 particularly resistant to high variance and overfitting. although decision trees employ the same 367 bagging principle, because of their larger depths, they are more prone to higher variance and 368 overfitting. many random sampling events generate subsets of data known as bootstrap samples. isolated decision trees are trained on these bootstrap samples to construct the random forest 370 ensemble. a majority vote from the members of the ensemble generates predicted classifications 371 for test sets. since our random forest models performed noticeably better than the decision tree 372 model, it would lead us to believe that overfitting does not account for the fidelity of our models. and the prevalence of many viral genomic remnants in the genomes of host organisms. we can 431 also reasonably conclude that protein-based methods such as protein blast or pfams are not suited 432 to the task of taxonomic classification of greatly divergent rna sequences. pertaining to the 433 nonuniformity of the overlap between the results from our classifier and the tools it was 434 compared to, we propose that a combinatorial approach employing multiple tools simultaneously 435 is currently the best option for segregating sequence data by taxonomy (fig 10) . upgrade and models such as ours that facilitate the annotation of the assembled data will be 453 expected to catch up to these advancements. contriving features using k-mer abundance profiles 457 we define where s is our set of all dna sequences q. for all k-mers from k = 1 to k = 6, ∈ 458 and hence e is defined as the set of all possible k-mers of length k. in the equation that ∈ 459 follows, j is the total number of allowed mismatches or gaps in a score we obtain for any ( ) 460 given sequence q and for every element e, as the sum of all instances where is a slice of q 461 from the i-th letter of the word that matches element e, while allowing exactly j total mismatches 462 or gaps between and e. here, is the length for each sequence q. this final score is used as the pre-scaled value for feature e in sequence q in all constructed 468 machine learning models in this study. table 6 ). the reverse complement here a virus, there a virus, everywhere the same virus? 537 trends microbiol viruses in soil ecosystems: an unknown quantity within an 539 annual review of virology emerging view of the human virome ecological significance of microdiversity: identical 16s 543 rrna gene sequences can be found in bacteria with highly divergent genomes and 544 ecophysiologies unraveling the viral tapestry (from 546 inside the capsid out). isme j metagenomic contrasts of viruses in soil and aquatic 548 environments expression of animal virus genomes towards the system of viruses genome replication/expression strategies of positive-strand rna viruses: 553 a simple version of a combinatorial classification and prediction of new strategies isolation independent methods of characterizing phage 556 communities 2: characterizing a metagenome virfinder: a novel k-mer based tool for identifying viral sequences from 558 assembled metagenomic data. microbiome phage_finder: automated identification and classification of prophage 560 regions in complete bacterial genome sequences prophinder: a computational tool for prophage prediction in 562 prokaryotic genomes phaster: a better, faster version of the phast phage search tool prophage hunter: an integrative hunting tool for active prophages viraminer: deep learning on raw dna sequences for identifying viral 568 genomes in human samples identifying viruses from metagenomic data using deep learning the evolutionary history of vertebrate rna viruses genomic characterisation and epidemiology of 2019 novel coronavirus: 574 implications for virus origins and receptor binding. the lancet a bioinformatics approach reveals seven nearly rna-virus genomes in bivalve rna-seq data machine learning for detection of viral sequences in human 578 metagenomic datasets profile hidden markov models for the detection of viruses within 580 gene selection and classification of 582 microarray data using random forest de novo transcriptome assembly: a comprehensive cross-584 species comparison of short-read rna-seq assemblers. gigascience blast+: architecture and applications origins and challenges of viral dark matter discovering viral genomes in human metagenomic 590 data by predicting unknown protein families virsorter: mining viral signal from microbial genomic data discovering viral genomes in human metagenomic 594 data by predicting unknown protein families 16s classifier: a tool for fast and accurate taxonomic 596 classification of 16s rrna hypervariable regions in metagenomic datasets why are rna virus mutation rates so damn high? viral metagenomics third generation sequencing: technology and its potential impact on 602 evolutionary biodiversity research virus taxonomy: the database of the international committee on 606 nucleic acids research accelerated profile hmm searches binpacker: packing-based de novo transcriptome assembly from rna-610 seq data bridger: a new framework for de novo transcriptome assembly using 612 rna-seq data shannon: an information-optimal de novo rna-seq assembler rnaspades: a de novo transcriptome assembler and 616 its application to rna-seq data idba-tran: a more robust de novo de bruijn graph assembler for 621 transcriptomes with uneven expression levels soapdenovo-trans: de novo transcriptome assembly with short rna-seq 623 reads de novo assembly and analysis of rna-seq data full-length transcriptome assembly from rna-seq data without a 627 reference genome oases: robust de novo rna-seq assembly across the dynamic range 629 of expression levels s1 fig. performance of the 18 classifiers when trained on different numbers of features. 633 (a) logistic regression classifier on 3 feature sets. (b) linear svm classifier on 3 feature sets. (c) rbf-kernel svm classifier on 3 feature sets. (d) decision tree classifier on 3 feature sets. (e) random forest classifier on 3 feature sets. (f) gaussian naïve bayes classifier 636 on 3 feature sets all k-meric features ranked by feature importance instated by our feature 638 selection flow-chart complete results for leave-one-family-out cross-validation procedure information of sequences used to test the model's ability to generalize false nucleotide blast hits across whole human transcriptome when queried 642 against the rna virus sequence database false nucleotide blast hits across whole mouse transcriptome when queried 644 against the rna virus sequence database s6 table. detailed information on training set sequences this research received no specific grant from any funding in the public, commercial, or not-for-533 profit sectors. the findings and views expressed are that of the authors alone and none else. key: cord-252347-vnn4135b authors: lee, wai-ming; kiesner, christin; pappas, tressa; lee, iris; grindle, kris; jartti, tuomas; jakiela, bogdan; lemanske, robert f.; shult, peter a.; gern, james e. title: a diverse group of previously unrecognized human rhinoviruses are common causes of respiratory illnesses in infants date: 2007-10-03 journal: plos one doi: 10.1371/journal.pone.0000966 sha: doc_id: 252347 cord_uid: vnn4135b background: human rhinoviruses (hrvs) are the most prevalent human pathogens, and consist of 101 serotypes that are classified into groups a and b according to sequence variations. hrv infections cause a wide spectrum of clinical outcomes ranging from asymptomatic infection to severe lower respiratory symptoms. defining the role of specific strains in various hrv illnesses has been difficult because traditional serology, which requires viral culture and neutralization tests using 101 serotype-specific antisera, is insensitive and laborious. methods and findings: to directly type hrvs in nasal secretions of infants with frequent respiratory illnesses, we developed a sensitive molecular typing assay based on phylogenetic comparisons of a 260-bp variable sequence in the 5' noncoding region with homologous sequences of the 101 known serotypes. nasal samples from 26 infants were first tested with a multiplex pcr assay for respiratory viruses, and hrv was the most common virus found (108 of 181 samples). typing was completed for 101 samples and 103 hrvs were identified. surprisingly, 54 (52.4%) hrvs did not match any of the known serotypes and had 12–35% nucleotide divergence from the nearest reference hrvs. of these novel viruses, 9 strains (17 hrvs) segregated from hrva, hrvb and human enterovirus into a distinct genetic group (“c”). none of these new strains could be cultured in traditional cell lines. conclusions: by molecular analysis, over 50% of hrv detected in sick infants were previously unrecognized strains, including 9 strains that may represent a new hrv group. these findings indicate that the number of hrv strains is considerably larger than the 101 serotypes identified with traditional diagnostic techniques, and provide evidence of a new hrv group. human rhinoviruses (hrvs), members of picornavirus family, are small nonenveloped viruses with a 7200-base mrna positive sense rna genome [1] . the first hrv was discovered in 1956 [2, 3] , and by 1987, 101 serotypes (1a and 1b to 100) were identified using susceptible cell cultures and specific antisera [4, 5, 6] . multiple epidemiologic studies of serotype circulation conducted between [1975] [1976] [1977] [1978] [1979] [1980] [1981] [1982] [1983] showed that .90% of field isolates could be identified with the 90 serotype-specific antisera prepared before 1973, and many serotypes identified earlier were still circulating [6, 7, 8] . these results suggested hrv serotypes are stable and do not undergo influenza virus-like antigenic drift [7] . hrvs are the most prevalent human respiratory pathogens [8, 9, 10, 11, 12] . annually, hrvs are responsible for .50% of all acute upper respiratory illness (common colds), the most frequent human illness. hrv infections occur year round worldwide and are epidemic in early fall and late spring in the temperate regions. hrv infections cause a wide range of clinical outcomes including asymptomatic infections, [13, 14, 15, 16, 17] upper respiratory illnesses, and in children, asthmatics, and other susceptible populations, lower respiratory symptoms. [18, 19, 20, 21, 22, 23] . defining the role of specific strains in various hrv illnesses has been difficult because traditional serology requires the isolation of hrv in susceptible cell cultures and neutralization tests against all 101 serotype-specific antisera [6] . this traditional serological method is insensitive, labor intensive and cumbersome [24] . more sensitive and faster molecular methods have been developed for serotyping enteroviruses, which are closely related to hrv [25] . in addition, molecular typing methods have been used to identify the links between illnesses and specific strains of pathogens such as dengue viruses, influenza viruses, human papillomaviruses, hepatitis c viruses, and hiv [26, 27] . molecular typing involves pcr amplification of a portion of the target viral genome, sequencing and phylogenetic analyses. in this report, we analyzed clinical specimens from sick infants with a new molecular method, and identified 26 new hrv strains including 9 that constitute a new hrv group. the length of the p1-p2 sequences (region between primer sites p1 and p2 in figure 1 ) varied only slightly between the 101 established serotypes, ranging from 261 to 273 bases. the maximum pairwise nucleotide divergence (%) between all 101 serotypes in this region was 45% ( figure 2 ). this result was similar to the maximum pairwise divergence of 101 vp4 sequences (46%) and slightly lower than that of vp1 sequences (54%) [28, 29] . furthermore, 97.5% of all the serotype pairs had .9% pairwise nucleotide divergence. the maximum pairwise divergences (%) of p1-p2 sequences among hrva and hrvb viruses were 33% and 27%, respectively. these results demonstrated the potential utility of this region for differentiating hrv serotypes. p1-p2 sequences of 101 hrv serotypes clustered into 2 previously defined genetic groups: hrva and hrvb phylogenetic tree reconstruction confirmed that the 101 p1-p2 sequences clustered into 2 genetic groups, a and b, (figure 3 ). the p1-p2 phylogenetic distribution of the serotypes into group was identical to that of published trees based on vp1 and vp4-vp2 sequences [28, 29, 30] , with the same 76 serotypes in the hrva group and 25 serotypes in hrvb group ( figure 3 ). the topology of the p1-p2 tree was similar to that of the vp1 and vp4-vp2 trees [28, 29, 30] . these results agreed with previous reports that the nucleotide phylogenies of hrvs are consistent across the whole genome, from 5'ncr through polyprotein to 3'ncr [31] . accurate typing of 79 clinical isolates by phylogenetic tree reconstruction of p1-p2 sequences to test whether p1-p2 sequences were suitable for hrv serotype identification, we compared the typing results of 79 culturable clinical isolates obtained by p1-p2 sequences with those by nim-1a sequences of vp1. the degenerate primers ev292 and ev222 for pcr amplification of nim-1a region were not sensitive enough for direct detection of small amount of hrv in original clinical samples (data not shown), and high titer infected cell lysates of cultured isolates were needed to produce enough pcr product for cloning and sequencing. the length of nim-1a and p1-p2 sequences ranged 329-356 bases and 263-271 bases, respectively. based on phylogenetic tree reconstruction of nim-1a sequences, the 79 clinical isolates were assigned to 24 different serotypes with highly significant bootstrap values (77-100%, table 1 ). identical assignment results were obtained with p1-p2 tree, although a number of assignments (hrv1b, hrv15, hrv85 and one of the 5 hrv89) had low but still significant bootstrap values (60%, 64%, 61% and 62%, respectively). interestingly, the nucleotide divergence between the clinical isolates and the respective reference strain was lower at the p1-p2 region (mean 3.5%, range 0-8%) than at the nim-1a region (mean 9.6%, range 5-13%) ( table 1) . this result agreed with previous reports that nucleotide sequences were more conserved at the ncrs than the coding region due to the preservation of conserved rna structure elements within the ncr [31, 32] . nasal lavage samples of 181 illnesses from 26 infants with frequent respiratory illnesses were analyzed by respiratory multicode assay [33] . hrv was detected in 108 samples (60%) ( table 2) . other viruses detected were enterovirus, rsv, adenovirus, coronavirus, influenza a virus, metapneumovirus and parainfluenza virus. among the 108 hrv-positive samples, 80 had only hrv and 28 had coinfection with at least one other respiratory virus. the identity of hrv in 101 of the 108 samples was determined by molecular typing. of the 7 samples that were not typed, 2 had no sample left and 5 did not yield p1-p2 fragments by semi-nested pcr. a total of 103 hrvs were identified in 101 samples. only 2 samples contained 2 different hrvs, indicating a low rate of infection with more than one strain. of the 103 hrvs, serotypes were assigned to 49 hrvs by phylogenetic tree reconstruction ( table 3 ). the assignments of 45 hrvs were strongly supported by highly significant bootstrap values (.74%). one assignment (hrv20) had a low but still significant bootstrap value (52%). the p1-p2 sequences of these 46 hrvs had 94-100% identity with the respective reference serotypes. three sequences clustered with hrv89 but had poor bootstrap values (,50%) although they had 99% identity with hrv89 reference sequences. this was probably because these 3 hrvs also matched well with hrv36, which has 97% identity with hrv89 in the p1-p2 region. these 46 hrvs and p3) and variable region between p1 and p2 (p1-p2 in red) at the 5'ncr and the pcr fragments used in this study. p1, p2 and p3 are located at bases 163-181, 443-463 and 535-551, respectively in hrv16 genome. pcr fragment a (about 900bps) was used to determine the 5'ncr sequences of all 101 hrv serotypes. it was amplified using pan-hrv pcr forward primer p1-1, which anneals to conserved region p1, and a serotype-specific reverse primer annealed to the 5' end of vp2 gene (between base# 1000 and 1100). pcr fragment b (about 390 bps) was generated with pan-hrv pcr forward primer p1-1 and reverse primer p3-1. pcr fragment c (about 300 bps) was generated with forward primer p1-1 and an equimolar mixture of reverse primers p2-1, p2-2 and p2-3. the variable sequences of p1-p2 were used for the molecular typing assay. doi:10.1371/journal.pone.0000966.g001 were grouped into 23 hrv serotypes; 42 (19 serotypes) were hrva and 4 (4 serotypes) were hrvb viruses ( table 3) . the p1-p2 sequences of 54 hrvs did not cluster with the homologous sequences of any of the 101 serotypes in phylogenetic tree reconstructions (figure 4 ). these 54 hrvs clustered into 26 new unique strains with high degree of nucleotide divergence from the respective nearest reference serotypes (mean 24.6%, range 12-35%) and between strains (mean 32.5%, range 10-46%) ( table 4 ). different new hrvs of the same strain had a high degree of identity among themselves (mean 98.4%, range 94-100%, table 4 ). seventeen of the new strains clustered with group a hrv ( figure 4 ) and had 12-35% pairwise nucleotide divergence from the nearest reference serotype (table 4 ). among them, 5 strains (w7, w15, w24, w28 and w36) clustered in a distant branch near serotypes 12, 45 and 78; 7 strains (w6, w10, w11, w12, w17, w23 and w25) in distant branch near serotypes 51, 65 and 71; strains w8, w9, w20 and w38 formed a new branch, and strain w33 was the lone member of a new branch. moreover, 9 strains (w1, w13, w18, w21, w26, w31, w32, w35 and w37) separated from both hrva and hrvb ( figure 4 ) and had 31-35% pairwise divergence from the nearest reference serotype (table 4 ). we propose that they represent a new hrv genetic group (hrvc). none of the new strains clustered with hrvb viruses. interestingly, none of the samples containing the new hrv strains produced cpe in standard wi-38 or mrc-5 cell cultures used for the detection and isolation of hrv (data not shown). hevs are closely related to hrvs [1] , and comprise .65 distinct serotypes that include polioviruses, coxsackieviruses a and b, echoviruses and the newer numbered evs. hevs are classified into 5 groups: poliovirus and human enterovirus a-d (hev-a-d), according to both biological and molecular properties. like hrvs, some hevs are upper respiratory pathogens. to determine the relationship of hrvc to hevs, phylogenetic tree reconstruction was performed with the p1-p2 sequences of 9 hrvc strains and respective sequences of all known hev (n = 74). the results indicate that hrvc viruses are distinct from hev: the pairwise divergence of p1-p2 sequences between a hrvc strain and the respective nearest hev ranged from 31% to 35% ( figure 5 ). in this report we demonstrate that the pool of circulating hrv strains is significantly larger than the collection of 101 known serotypes. less than half of the hrv detected in these young infants corresponded to previously recognized serotypes (table 3) , and 54 are previously unrecognized hrvs belonging to 26 new strains (table 4 ). moreover, 9 of the strains form a new genetic group "c" that is distinct from the previously defined groups hrva and hrvb and the human enteroviruses ( figure 5 ). the remaining 17 new strains cluster into existing distant or new branches in hrva group ( figure 4) . interestingly, none of the new viruses grew in standard tissue culture, which may explain why these viruses were previously undetected. these new hrv strains were detected with a sensitive molecular method to type hrv directly from the original clinical specimens. this new assay had 3 key components: sensitive pan-hrv primers and semi-nested pcr to amplify p1-p2 region from cdna prepared from original clinical specimens, a sequence database of 260-bp p1-p2 region of 5'ncr of all 101 hrv serotypes to serve as standard references for hrv identification, and phylogenetic tree reconstruction of the new p1-p2 sequences and the 101 homologous reference sequences. phylogenetic tree reconstruction has been shown to be more accurate than the other sequence analysis methods for molecular typing of enteroviruses [34] . interestingly, the 5'ncr is not suitable for typing of enterovirus. for example, the 5'ncr sequence of coxackieviruses do not correlate with serotype and vp1 sequence due to frequent recombination at the 5'ncr [35, 36] . in contrast, hrv maintains consistent phylogeny across its genome and has limited recombination [31] , and this characteristic enables 5'ncr sequence (p1-p2) to be used for strain identification. for enteroviruses, vp1 sequencing is commonly used for molecular serotyping because this region contains major antigenic sites that correlate well with serotype [25] . however, vp1 sequences of hrv are less conserved, and degenerate primer pairs such as ev292 and ev222 [24] that target this region were relatively insensitive for pcr amplification. the 101 established hrv serotypes (hrv1a to hrv100) were discovered and designated between 1956 and 1987. they were isolated using susceptible wi-38 cell cultures and then defined with serotype-specific antisera [4, 5, 6] . multiple epidemiologic studies of serotype circulation conducted between 1975-1983 showed that .90% of the field isolates could be identified with the 90 serotype-specific antisera (hrv1a through hrv89) prepared before 1973 and many serotypes identified earlier were still circulating [8, 11] [6, 7] . these results suggested that almost all hrv serotypes had already been identified and new serotypes were not evolving [6, 7] . in fact, only one possible new serotype, hrv-hanks, was reported in the next 20 years (1987 to 2006), and it was subsequently shown to be hrv21 by careful sequence analysis and neutralization testing [30] . recently, studies utilizing molecular techniques instead of culture-based diagnostics have provided evidence of additional strains of hrv [37, 38] . for example, lamson and colleagues identified 8 new hrvs in new york state (ny) by vp4 sequencing. they concluded that these hrvs represented a new genetic clade because they clustered in a branch at the root of the hrva phylogenetic tree [37] . in addition, mcerlean and colleagues obtained the complete genome sequence of a new hrv strain, hrv-qpm, in queensland, australia. they showed by phylogenetic analysis that hrv-qpm was a new member of hrva and belonged to a new genetic sub-lineage of hrva, hrv-a2, and the 8 new ny hrvs also belonged to hrv-a2 [38] . to compare the identities of these newly reported hrvs with our new strains, we performed phylogenetic tree reconstruction using p1-p2 sequences of our new strains, hrv-qpm, and the 101 established serotypes; and then a similar analysis using vp4 sequences of hrv-qpm, 8 ny hrvs, and the 101 established serotypes. the p1-p2 phylogenetic tree (not shown) revealed that hrv-qpm and one of our new hrva viruses, w24, were the same strain and thus supported mcerlean's conclusion that hrv-qpm was a group a virus. the vp4 tree (not shown) confirmed mcerlean's finding that hrv-qpm and the 8 ny hrvs belonged to the same cluster within hrva group. therefore the 8 ny hrvs and hrv-qpm were related to our 17 new hrva strains. in contrast, our 9 hrvc strains form a distinct group separated from both hrva and hrvb (figure 4) . analysis of 5'ncr of all hrv serotypes reveals that there are both highly conserved sequences (e.g. p1, p2 and p3 primer sites, figure 1 ) and also variable sequences between p1 and p2. the 260-bp p1-p2 region had up to 45% pairwise nucleotide divergence between serotypes, similar to that of vp4 (46%) and vp1 (54%). moreover, 97.5% of all the p1-p2 pairs from distinct serotypes had .9% pairwise nucleotide divergence. despite this inter-serotype variability, the p1-p2 sequence of 5'ncr was significantly more conserved between clinical isolates and the corresponding reference prototype strain (mean divergence 3.5%) compared to the nim-1a coding sequences of vp1 (mean divergence 9.6%, table 1 ). these data suggest that 5'ncr sequences may be under greater selective constraint, and it is known that this region contains rna structure elements that are critical for viral replication and translation [32] . in summary, more than half (52%) of the hrvs detected in these young infants were new hrv strains, and many of them clustered into a distinct genetic group c. these findings indicate that the number of hrv strains has been markedly underestimated by traditional viral culture and serotyping techniques, and also raise additional questions. first, how widespread are these this tree was generated as described in figure 3 . none of the new strains clustered with hrvb viruses. seventeen new strains (blue) belonged to hrva group, and 9 strains (red) cluster into a new group (''c'') that is separate from groups a and b. doi:10.1371/journal.pone.0000966.g004 putative ''group c'' hrv? our study population consisted of a group of infants who experienced frequent illnesses, and additional studies are needed to define the spectrum of hrv infections in unselected populations, and in subjects of other age groups. secondly, is the biology of these novel strains similar to that of other hrv? the inability to culture these viruses suggests that receptor utilization or other growth requirements are distinct. the development of molecular assays for the detection and analysis of respiratory viruses provide important tools for new epidemiologic and mechanistic studies to address these questions. a more complete understanding of the spectrum of respiratory viruses and their biology is essential for efforts directed at prevention and treatment of these common and clinically significant illnesses. this study was approved by the human subject committee of the school of medicine and public health, university of wisconsin -madison. written informed consent was obtained from the parents. for determining the standard reference sequences, infected cell lysates of prototype strains of 101 hrv serotypes were obtained from dr. fred hayden of u. virginia, charlottesville (serotype 1a, 1b, 2 to 10, 12 to 89, 91 to100 and hanks) and atcc (hrv11 and hrv90). hrv87 was excluded from the reference database because it has been reclassified as a human enterovirus [39, 40] . clinical samples were obtained from two sources. first, samples were obtained from infants ages 0-1 year participating in a prospective birth cohort study (childhood origins of asthma) in wisconsin to determine the role of viral and host factors in the pathogenesis of asthma [41] . of the 285 children who completed the first year of the study, 27 infants had frequent ($5) moderate to severe respiratory illnesses, and samples from 26 were available for further study. nasal lavage specimens were collected from 181 illnesses between spring of 1999 and spring of 2001. second, for validation of our molecular typing assay, infected cell lysates of 79 additional hrv clinical isolates were obtained from wisconsin state laboratory of hygiene (wslh). these clinical isolates were recovered from nasal lavages of infants in 1999 and 2000 using wi38 cell culture and identified by the characteristic cytopathic effect and acid lability of hrv [42] . pairwise sequence alignment, multiple sequence alignment, % identity calculation, distance (divergence) calculation, phylogenetic tree reconstruction and bootstrap analysis were performed using software clustal61.8.3 [43, 44] . in a typical analysis, the input sequences were first processed to produce a multiple alignment and matrixes of identity and distances (divergence) between all sequence pairs. the distance matrix was used by the neighbor joining method to produce an unrooted phylogenetic tree with branch length proportional to the divergence of the sequences. the confidence of the clustering of sequences was evaluated by bootstrapping (1000 replicates). bootstrap values of .700 (70%) indicate the highly significant clustering, whereas values ,500 (50%) indicate that the clustering is not statistically significant. the phylogenetic tree with bootstrap values was visualized using software njplot. preparation of cdna from nasal specimens cdna preparation was performed as described elsewhere [33] . briefly, nasal fluid (350 ml) was mixed with extraction carriers (glycogen and glycoblue) and 750 ml of trizol ls (invitrogen 10296), vortexed for 10 minutes, supplied with 230 ml of chloroform, vortexed again for 5 minutes and then microfuged for 5 minutes. the supernatant aqueous phase (,700 ml) was mixed with 600 ml isopropanol and this mixture was incubated at room temperature for 1 hr. the rna precipitant was pelleted by microfugation for 10 minutes, washed once with 75% ethanol, airdried and then dissolved in 20 ml water. to make cdna, 16 ml of rna solution was mixed with 24 ml of reaction solution containing promega amv-reverse transcriptase, amv-rt buffer, random primers, rnasin and dntps and then incubated at 25uc for 5 minutes, 42uc for 10 minutes, 50uc for 20 minutes, and 85uc for 5 minutes. rma is a new high-throughput, multiplex pcr-microsphere flow cytometry assay system for comprehensive detection of common respiratory viruses including rhinoviruses (hrv), enteroviruses, respiratory syncytial viruses (rsv), parainfluenza viruses, influenza viruses, metapneumoviruses, adenoviruses and coronaviruses. details of the rma assay have been previously described [33] . the output signal is expressed as mfi (median fluorescence intensity), and samples with an average signal .6 standard deviations of average negative control signals (typically 400 to 500 mfi) are regarded as positive. the rma is capable of distinguishing closely related hrv and enteroviruses [33] . selection of the target region to identify a genomic region suitable for molecular typing of hrv, we analyzed all published hrv sequences. these included complete genome sequences of 8 serotypes (1b, 2, 9, 14, 16, 39, 85 and 89) [45, 46, 47, 48, 49, 50, 51, 52] [28, 29, 30] , 3d (rna polymerase) sequences of 48 serotypes [53] and partial 5'ncr sequences of 37 serotypes [54, 55, 56] . careful alignment analysis of these sequences showed that only the 5'ncr region had highly conserved sequences (p1, p2 and p3 regions, figure 1 ) that could be used for making pan-hrv primers, and a long variable sequence (p1-p2, figure 1 ) suitable for serotype differentiation. however, 5'ncr sequences were available for only a fraction of 101 serotypes. cloning and sequencing of the 5'ncr of 101 reference hrv serotypes to establish a reference sequence database for molecular typing, we cloned and sequenced the 5'ncr of all 101 hrv serotypes. detailed viral rna preparation, rt-pcr, cloning and sequencing procedure have been described elsewhere [33] . briefly, total nucleic acids were prepared from 100 ml of infected cell lysate by phenol extraction and ethanol precipitation. rt (reverse transcription)-pcr was performed in a rt-pcr mix (invitrogen 11922-028) using the following conditions: 30 min at 50uc, 2 min at 94uc, 33 cycles of (30 sec at 94uc, 30 sec at 50uc, 60 sec at 68uc) and 5 min at 68uc. the pcr primer pairs were forward primer p1-1 (caagcacttctgtywcccc) for all serotypes and a serotype-specific reverse primer. primer p1-1 was designed within the conserved p1 region (figure 1 ). the reverse primer was selected for each of the 101 serotypes within a region corresponding to bases 1000-1100 of hrv16 near the 5' end of vp2 gene ( figure 1 ) according to published sequences [29] . the pcr product, a dna fragment of about 900 bp covering 75% of the 5'ncr, complete vp4 gene and about 200 bases of vp2 gene (pcr fragment a of figure 1 ), was isolated by agarose gel electrophoresis. after the agarose was removed by phenol extraction, pcr fragments were treated with kinase, ligated to a stui-linearized plamsid vector pmj3, and then transformed into e. coli. three plasmids with pcr fragment inserted were isolated for each serotype, amplified and purified. each viral dna fragment was completely sequenced (automated dna sequencing facility, u. wisconsin). the serotype identity of each sequence was verified by matching of its vp4/vp2 sequence to the respective published sequence [29] . semi-nested pcr amplification of p1-p2 region from original clinical specimens primer p1-1, which amplified the 5'ncr of all 101 serotypes, was chosen as the forward primer. for reverse primers, multiple candidates were designed within highly conserved p2 and p3 regions (figure 1 ). primers p3-1 (acgg-acacccaaagtag), p2-1 (ttagccacattcaggggc), p2-2 (ttagccacattcaggagcc) and p2-3 (ttagcc-gcattcagggg) were selected based on efficient hrv sequence amplification. for the first pcr, 2.5 ml of leftover cdna from the rma assay was added to a tube containing 23 ml platinum pcr supermix hf (invitrogen 12532-016), 1 ml of forward primer p1-1 (25 mm) and 1 ml of reverse primer p3-1 (25 mm). the reaction started with 2 min at 94uc, followed by a 'touchdown' cycle (the steps were 94uc for 20 s, 68uc down to 52uc (2uc intervals) for 30 s, and then 68uc for 40 s. there were two cycles for each annealing temperature down to 54uc followed by 12 cycles at 52uc, and then a final 5 min at 68uc. this reaction produces a pcr product of 390 bp (figure 1, fragment b) . for the second pcr, 5 ml of the first pcr product was transferred to a new pcr tube containing 50 ml platinum pcr supermix hf, 1 ml of forward primer p1-1 (25 mm) and 1 ml of each reverse primer p2-1 (25 mm), p2-2 (25 mm) and p2-3 (25 mm). the reaction conditions are 2 min at 94uc, 28 cycles of (20 sec at 94uc, 30 sec at 52uc, 40 sec at 68uc) and 3 min at 68uc. the final product is a 300 bp dna fragment ( figure 1, fragment c) . this semi-nested pcr protocol requires only 10 copies of cdna template per sample to produce sufficient product for cloning and does not produce nonspecific product from original clinical specimens (data not shown). fragment c was then purified, cloned, and 3 plasmids containing fragment c were isolated and sequenced for each sample. for 2 samples, 2 different p1-p2 sequences were found, indicating the presence of more than one serotype/strain, so 6 additional plasmids were analyzed. sequencing of the vp1 and 5'ncr of 79 hrv clinical isolates to determine whether the p1-p2 variable sequences within the 5'ncr were suitable for the differentiation and identification of hrv, we compared the typing results by p1-p2 sequences with those of nim-1a sequences of vp1. nim-1a is a dominant antigenic site of hrv first identified for hrv14 [1] , and the homologous sequences of this region correlate well with serotype of hrvs and enteroviruses [57, 58] . viral rna was isolated from virus-infected culture supernatant of 79 hrv clinical isolates with qiaamp viral rna mini kit (qiagen 52904) as described [24] . to obtain the vp1 sequences, viral rna was first amplified using primer pair ev292 and ev222 as described [24] with modified rt-pcr conditions: 10 min at 38uc, 40 min at 50uc, 3 min at 95uc, 40 cycles of (30 sec at 95uc, 45 sec at 42uc, 30 sec at 65uc) and 1 min at 70uc. the pcr product (,350 bp) was isolated by agarose gel electrophoresis and then sequenced [24] . to obtain the 5'ncr sequences, viral rna was amplified by rt-pcr using p1-1 and p3-1 primers, cloned and sequenced. assignment of serotypes and new strain a phylogenetic tree with bootstrap values and a matrix of % pairwise nucleotide divergence (distance6100%) were generated for each new sequence (nim-1a or p1-p2) and 101 homologous reference sequences using clustal6with default alignment parameters, which were selected after testing a range of parameters. next, a new isolate or detection was assigned the serotype to which it clustered with in the phylogenetic tree with a significant bootstrap value (.50%) [34] . in contrast, if the p1-p2 sequence of a new isolate or detection did not cluster with one of the 101 serotypes in the phylogenetic tree, and had .9% pairwise nucleotide divergence from the nearest reference serotype, it was designated as a new strain (prefixed with w). the threshold pairwise divergence value (9%) for assigning a new strain was determined after considering the pairwise divergence values between the 101 known serotypes (figure 2) , and the pairwise divergence values between reference serotypes and clinical isolates of the same serotype (table 1 , last column). two types of errors could be made in evaluating two sequences: a) deciding they were the same when they were actually different, and b) deciding they were different when they were actually the same. the probability of these errors was determined empirically using the data shown in figure 2 and table 1 , respectively. according to figure 2 , 99%, 98%, 97%, 95%, and 90% of the p1-p2 pairs of all 101 serotypes had .7%, .8%, .9%, .10% and .12% pairwise nucleotide divergence, respectively. as shown in table 1 (last column), the pairwise nucleotide divergence of all p1-p2 pairs between clinical isolates and the respective reference serotypes was consistently #8%. therefore, a threshold of 9% is associated with probabilities of ,3% and ,1% for errors a and b, respectively. the original p1-p2 sequences described in this report have been deposited in the genbank sequence database under accession no. eu126663 to eu126789. picornavirus structure and multiplication a cytopathogenic agent isolated from naval recruits with mild respiratory illnesses the isolation of a new virus associated with respiratory clinical disease in humans a collaborative report: rhinoviruses-extension of the numbering system rhinoviruses: a numbering system a collaborative report: rhinoviruses-extension of the numbering system from 89 to 100 rhinovirus infections in tecumseh, michigan: frequency of illness and number of serotypes the common cold rhinoviruses: important respiratory pathogens epidemiology of viral respiratory infections clinical virology a prospective, community-based study on virologic assessment among elderly people with and without symptoms of acute respiratory infection the september epidemic of asthma exacerbations in children: a search for etiology use of polymerase chain reaction for diagnosis of picornavirus infection in subjects with and without respiratory symptoms predominance of rhinovirus in the nose of symptomatic and asymptomatic infants picornavirus infections in children diagnosed by rt-pcr during longitudinal surveillance with weekly sampling: association with symptomatic illness and effect of season respiratory illness caused by picornavirus infection: a review of clinical outcomes picornavirus infections: a primer for the practitioner rhinovirus: more than just a common cold virus rhinovirus respiratory infections and asthma rhinovirus infections: more than a common cold rhinovirus outbreak in a long term care facility for elderly persons associated with unusually high mortality improved molecular identification of enteroviruses by rt-pcr and amplicon sequencing typing of human enteroviruses by partial sequencing of vp1 the application of molecular phylogenetics to the analysis of viral genome diversity and evolution molecular epidemiology and evolution of emerging infectious diseases phylogenetic analysis of human rhinovirus capsid protein vp1 and 2a protease coding sequences confirms shared genus-like relationships with human enteroviruses genetic clustering of all 102 human rhinovirus prototype strains: serotype 87 is close to human enterovirus 70 vp1 sequencing of all human rhinovirus serotypes: insights into genus phylogeny and susceptibility to antiviral capsid-binding compounds genome-wide diversity and selective pressure in the human rhinovirus conserved rna secondary structures in picornaviridae genomes highthroughput, sensitive, and accurate multiplex pcr-microsphere flow cytometry system for large-scale comprehensive detection of respiratory viruses molecular identification of enterovirus by analyzing a partial vp1 genomic region with different methods molecular evolution of the human enteroviruses: correlation of serotype with vp1 sequence and application to picornavirus classification genotypic variation in coxsackievirus b5 isolates from three different outbreaks in the united states masstag polymerase-chain-reaction detection of respiratory pathogens, including a new rhinovirus genotype, that caused influenza-like illness in new york state during characterisation of a newly identified human rhinovirus, hrv-qpm, discovered in infants with bronchiolitis enterovirus 68 is associated with respiratory illness and shares biological features with both the enteroviruses and the rhinoviruses human rhinovirus 87 and enterovirus 68 represent a unique serotype with rhinovirus and enterovirus features rhinovirus illnesses during infancy predict subsequent childhood wheezing relationships among specific viral pathogens, virus-induced interleukin-8, and respiratory symptoms in infancy multiple sequence alignment with the clustal series of programs the clustalx windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools amino acid changes in proteins 2b and 3a mediate rhinovirus type 39 growth in mouse cells molecular cloning and complete sequence determination of rna genome of human rhinovirus type 14 the complete nucleotide sequence of a common cold virus: human rhinovirus 14 the nucleotide sequence of human rhinovirus 1b: molecular relationships within the rhinovirus genus role of maturation cleavage in infectivity of picornaviruses: activation of an infectosome complete sequence of the rna genome of human rhinovirus 16, a clinically useful common cold virus belonging to the icam-1 receptor group evolutionary relationships within the human rhinovirus genus: comparison of serotypes 89, 2, and 14 human rhinovirus 2: complete nucleotide sequence and proteolytic processing signals in the capsid protein region sequence analysis of human rhinoviruses in the rna-dependent rna polymerase coding region reveals large within-species variation improved detection of rhinoviruses by nucleic acid sequence-based amplification after nucleotide sequence determination of the 5' noncoding regions of additional rhinovirus strains improved detection of rhinoviruses in clinical samples by using a newly developed nested reverse transcription-pcr assay amplicon sequencing and improved detection of human rhinovirus in respiratory samples sensitive, seminested pcr amplification of vp1 sequences for direct identification of all enterovirus serotypes from original clinical specimens comparison of classic and molecular approaches for the identification of untypeable enteroviruses we thank dr. fred hayden (u. virginia, charlottesville) for generously providing infected cell lysates of prototype hrvstrains. we also thank dr. key: cord-017354-cndb031c authors: janies, d.; pol, d. title: large-scale phylogenetic analysis of emerging infectious diseases date: 2008 journal: tutorials in mathematical biosciences iv doi: 10.1007/978-3-540-74331-6_2 sha: doc_id: 17354 cord_uid: cndb031c microorganisms that cause infectious diseases present critical issues of national security, public health, and economic welfare. for example, in recent years, highly pathogenic strains of avian influenza have emerged in asia, spread through eastern europe, and threaten to become pandemic. as demonstrated by the coordinated response to severe acute respiratory syndrome (sars) and influenza, agents of infectious disease are being addressed via large-scale genomic sequencing. the goal of genomic sequencing projects are to rapidly put large amounts of data in the public domain to accelerate research on disease surveillance, treatment, and prevention. however, our ability to derive information from large comparative genomic datasets lags far behind acquisition. here we review the computational challenges of comparative genomic analyses, specifically sequence alignment and reconstruction of phylogenetic trees. we present novel analytical results on two important infectious diseases, severe acute respiratory syndrome (sars) and influenza. sars and influenza have similarities and important differences both as biological and comparative genomic analysis problems. influenza viruses (orthymxyoviridae) are rna based. current evidence indicates that influenza viruses originate in aquatic birds from wild populations. influenza has been studied for decades via well-coordinated international efforts. these efforts center on surveillance via antibody characterization of the hemagglutinin (ha) and neuraminidase (n) proteins of the circulating strains to inform vaccine design. however, we still do not have a clear understanding of (1) various transmission pathways such as the role of intermediate hosts like swine and domestic birds and (2) the key mutation and genomic recombination events that underlie periodic pandemics of influenza. in the past 30 years, sequence data from ha and n loci has become an important data type. in the past year, full genomic data has become prominent. these data present exciting opportunities to address unanswered questions in influenza pandemics. sars is caused by a previously unrecognized lineage of coronavirus, sars-cov, which like influenza has an rna based genome. although sars-cov is widely believed to have originated in animals, there remains disagreement over the candidate animal source that lead to the original outbreak of sars. in contrast to the long history of the study of influenza, sars was only recognized in late 2002 and the virus that causes sars has been documented primarily by genomic sequencing. in the past, most studies of influenza were performed on a limited number of isolates and genes suited to a particular problem. major goals in science today are to understand emerging diseases in broad geographic, environmental, societal, biological, and genomic contexts. synthesizing diverse information brought together by various researchers is important to find out what can be done to prevent future outbreaks [jon03]. thus comprehensive means to organize and analyze large amounts of diverse information are critical. for example, the relationships of isolates and patterns of genomic change observed in large datasets might not be consistent with hypotheses formed on partial data. moreover when researchers rely on partial datasets, they restrict the range of possible discoveries. phylogenetics is well suited to the complex task of understanding emerging infectious disease. phylogenetic analyses can test many hypotheses by comparing diverse isolates collected from various hosts, environments, and points in time and organizing these data into various evolutionary scenarios. the products of a phylogenetic analysis are a graphical tree of ancestor–descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. however, this synthesis comes at a price. the cost of computation of phylogenetic analysis expands combinatorially as the number of isolates considered increases. thus, large datasets like those currently produced are commonly considered intractable. we address this problem with synergistic development of heuristics tree search strategies and parallel computing. in the past, most studies of influenza were performed on a limited number of isolates and genes suited to a particular problem. major goals in science today are to understand emerging diseases in broad geographic, environmental, societal, biological, and genomic contexts. synthesizing diverse information brought together by various researchers is important to find out what can be done to prevent future outbreaks [jon03] . thus comprehensive means to organize and analyze large amounts of diverse information are critical. for example, the relationships of isolates and patterns of genomic change observed in large datasets might not be consistent with hypotheses formed on partial data. moreover when researchers rely on partial datasets, they restrict the range of possible discoveries. phylogenetics is well suited to the complex task of understanding emerging infectious disease. phylogenetic analyses can test many hypotheses by comparing diverse isolates collected from various hosts, environments, and points in time and organizing these data into various evolutionary scenarios. the products of a phylogenetic analysis are a graphical tree of ancestor-descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. however, this synthesis comes at a price. the cost of computation of phylogenetic analysis expands combinatorially as the number of isolates considered increases. thus, large datasets like those currently produced are commonly considered intractable. we address this problem with synergistic development of heuristics tree search strategies and parallel computing. phylogenetics is the study of the evolutionary relationships of genes and organisms, thus providing a retrospective analysis of biological change and adaptation over time. phylogenetic trees are represented by acyclic graphs in which the leaves of these graphs represent the observed biological entities (taxa) being compared (e.g., sequences of genes, genomes, and/or anatomy of individuals, isolates or cultivars, species, or any higher level taxonomic unit). the internal nodes of the tree are interpreted as a nested set of hypothetical evolutionary ancestors of the entities under consideration as depicted in fig. 2 .1. once a tree is complete, changes such as mutations and host shift can be traced along branches of the tree that contain important disease causing strains. this retrospective analysis of features provides means of finding mutations that are diagnostic of pathogens, correlating phenotypes and genotypes, and predicting strains that are important for vaccine design. the classification of organisms dates back to aristotle [ari343] . however, it was only a few decades ago that the theoretical foundations of the field of phylogenetics as it is practiced today were established. the modern school of phylogenetics arose from the application of the ideas, termed cladistics, originally proposed by hennig [hen66] . cladistics lead v x w y p q fig. 2.1 . phylogenetic tree of four taxa labeled v, w, x, and y and two hypothetical ancestors labeled p and q biologists to use shared derived similarities (termed synapomorphies) that distinguish various natural groups of organisms. nested sets of natural groups of organisms based on synapomorphies are then used to discover the evolutionary relationships between organisms and reconstruct patterns of modification in the features of organisms. subsequently, these principles have been used to develop optimization techniques to find the most justifiable sets of synapomorphies in large datasets. optimization techniques are necessary with most large and real world datasets as they often contain several, often conflicting, evolutionary signals (treated below). in contrast, advocates of another way of thinking, termed phenetics [sne73] , group organisms based on gross measures of similarity. groups are based on measures evolutionary distance rather than the concept of shared derived characters. in modern practice, similarity methods espousing phenetic concepts are used in searches of nucleotide databases and some multiple alignment methods. clustering algorithms in which least distant groups are clustered first and then distant clusters are connected are termed distance methods, in phylogenetics. distance methods typically produce a single tree and cannot, on their own, trace patterns of change in the features of organisms as they convert raw data to distances. next we discuss how various viewpoints have influenced methods, algorithms, and implementations in phylogenetics. a wide variety of methods have been proposed in order to infer the phylogenetic relationships of organisms. most methods are based on minimizing edit cost (such as a hamming distance) to transform one string of nucleotides or organismal characters into another. phylogenetic methods can be further classified in two different categories: distance-based and character-based. in this paper, we compare and contrast the applications of distance and characterbased methods used in infectious disease research. we illustrate applications of these technique to study the evolutionary relationships of groups of rna viruses and the patterns of mutations and phenotypic changes that can be reconstructed. among the distance-based methods, the most commonly used is neighbor-joining [sai87] . distance based methods require a precomputed multiple alignment of dna or amino acid sequences drawn from homologous genes. the most similar pair of taxa (as represented by sequences) are clustered. the clustered pair is then considered as a single taxon and the next most similar pair of taxa is clustered until only the last taxon is joined and the tree is completed. although the use of distance-based methods is relatively common in analysis of organisms that cause infectious disease, several authors have criticized the performance of this method for phylogenetic reconstruction (see [far96] ). one strategic flaw of the method is that it is computationally greedy. distance methods form the most similar clusters instantaneously without considering locally suboptimal paths that may lead to a better global optimum. other methods of phylogenetic analysis focus on characters, which are typically polymorphisms, recognized in columns of aligned nucleotides or amino acids from sequences of interest or investigator encoded characters of polymorphic phenotypes. character-based methods seek to find the phylogenetic trees that optimize a particular criterion. major optimality criteria include parsimony [far83] and maximum likelihood [fel73, fel81] . bayesian analysis [ran96, li00, hue02] uses a maximum likelihood optimality criterion but incorporates the probability, termed the prior, that a hypothesis is correct in the absence of data. the unifying feature of character-based methods is that they examine many randomly generated trees each representing an evolutionary hypothesis of character transformations and organismal relationships. as a character based analysis progresses, edit costs are calculated for transformations that candidate tree imply and optimal trees are stored for further consideration and refinement. the concept of optimality can be associated with cladistics or maximum likelihood but not distance methods. distance techniques lack a measure of tree quality and means to compare trees. cladistics employs parsimony as an optimality criterion. the core concept of cladistics is that the least number of transformations in the data implies the most defensible hypothesis. in cladistics, various edit costs can be applied to different genomic and phenotypic transformations. in the case of weighted parsimony the goal of tree search is to minimize weighted costs. under the maximum likelihood criterion the probability of the data, given the tree, calculated with a model for nucleotide or amino acid substitution is optimized. the related technique of bayesian phylogenetic inference uses maximum likelihood to evaluate trees. bayesian analysis aims to capture a posterior probability distribution of trees. typically the results of a bayesian analysis are displayed not as an optimal tree but rather as the probability that a set of evolutionary relationships is "true," given the prior probabilities, the substitution model, and data. all character-based methods of molecular phylogenetics [cladistics, maximum likelihood (and related bayesian methods)] rely on explicit assumptions about ancestral character states to polarize transformations of phenotypes and genotypes that can be reconstructed from data. as an example of such assumptions, in character based analyses is the outgroup criterion (treated below). in contrast, distance based methods do not use an outgroup criterion. distance based methods do not use the outgroup criterion. parsimony is a widely used optimality criterion. this criterion is associated with the concept that simpler explanations provide for more supportable hypotheses. in phylogenetics, the most parsimonious tree(s) is that which implies the minimum number of transformations in sequence and/or phenotypic character states among organisms of interest. the biological justification of this use of parsimony is that descent with modification from a common ancestor is a primary pattern of organismal diversification and the record of transformations can be used to reconstruct that pattern. as such, the tree(s) that minimizes the overall number of independent transformations (convergences or reversals in character state) that are needed to explain the observed data are to be preferred [far83] . recombination and horizontal gene transfer as seen among rna viruses are violation of the assumption of ancestor to descendent evolution, not parsimony per se. some novel techniques for discovery and understanding of reassortment and horizontal gene transfer have been developed under the parsimony criterion [wan05, whe05] . the parsimony score for a tree is measured based on the number of transformations implied by the tree, known as the tree length [far70] . the tree length is the sum, over all edges, of the hamming distances between the labels at the endpoints of the edge [ric97] . the labels located at the leaves of the tree are the observed characteristics (either genotypic of phenotypic) of the organisms being analyzed. the internal nodes are labeled in order to minimize the tree length of each tree being evaluated. given a tree and a matrix of features or aligned sequences for each taxon, the tree length is calculated using the fitch algorithm [fit71] . this algorithm works in polynomial time with the amount of data being analyzed (both in the number of characters and taxa). thus for a sequence alignment of thousands of taxa, each of which is labeled with thousands of nucleotides, the tree length of a particular tree can be computed using modern implementations of the fitch algorithm [gol03] in fractions of a second. given a tree and a data matrix of sequences and features, the parsimony method can pinpoint the branches on which certain evolutionary events are inferred to occur between ancestor or descendent. in an infectious disease context, these events can be a shift by a viral lineage from animal to human host. in the case of standard nucleotide sequence analysis, transformation events include substitution mutations (replacement of a given nucleotide by other) and nucleotide insertions and deletion mutations. some analyses invoke more complex parsimony models with weighted recombination and horizontal transfer events, as well as differentially weighting certain classes of mutation such as transversions (pyrimidine-purine shifts), transitions (pyrimidine-pyrimidine or purine-purine shifts), or insertion-deletion events [whe05] . note that in using the fitch algorithm to optimize a phylogenetic tree, both the tree length and the branch in which a particular transformation event is inferred to occur can be calculated in unrooted or rooted trees. the results of these calculations are independent of the root chosen for the tree. for example, in an unrooted tree relating four taxa known from their nucleotide sequences (see fig. 2 .2), the fitch algorithm can be used to identify a specific branch of a tree in which a transformation occurs (e.g., a mutation between nucleotides c and t of the third sequence position occurring in the only internal edge of the tree in fig. 2.3) . however, the polarity of a transformation event is dependent on how the tree is rooted. inferring polarity of change requires an external criterion, termed the outgroup. in character-based methods explicit assumptions of ancestral character states are set up by the investigator via the designation of at least one taxon as the outgroup. a good outgroup is known to be closely related to the taxa of interest (termed the ingroup). however, the outgroup must be clearly not a member of the ingroup. the underlying logic of the outgroup criterion is that the transformation events that occurred at evolutionary origin of the ingroup can be identified by comparison to modern organisms of another clade but with which the ingroup shares a common ancestor. the common ancestor is a hypothetical organism that provides a baseline set of character states from which polarity determinations can be made. thus the outgroup method, like bayesian inference, incorporates some previous knowledge of the relationships of the organisms. if the phylogenetic results show that the ingroup includes some members of the outgroup the previous knowledge must be reevaluated. the outgroup taxon is included in the data matrix of the phylogenetic analysis and the entire data set is analyzed simultaneously. the phylogenetic position and relationships of the outgroup are determined by the optimality criterion. in the case of the parsimony method, the outgroup is treated as any other taxon and is positioned in the tree in the position that minimized tree length. once the phylogenetic affinities outgroup are established, the outgroup can be used to root the tree, and the polarities of the transformations can be established (note the unidirectional arrows in fig. 2.4) . if chosen carefully, the outgroup will not be clustered with any of the ingroup. model based methods can also be used in reconstruction of ancestral character states (e.g., [cha00, thr04] to communicate the choice of outgroup taxon or taxa and clarify the relationships of the taxa, character based trees are often drawn as directed acyclic graphs with the root positioned on the branch between the outgroup and ingroup. in the example diagrammed in fig. 2 .4, a mutation is inferred to occur in the third sequence position from the ancestral state of c to the derived state of t. the presence of a t in the third sequence position is a synapomorphy, a derived character state that can be used to distinguish the members of the group formed by the taxa w and y. in contrast, the presence of a c in the third sequence position in taxa x and v cannot be used to distinguish these taxa since a c is also present in the third position in the outgroup. in this case the third position c is a primitive similarity of x and y or a symplesiomorphy. the other mutations occurring in sequence positions 2, 5, and 6 are found only in one taxon and thus cannot be used to infer relationships. these are termed autapomorphies. sequence position 4 is inferred to have not changed in this example and is thus of no value in discovering groups. in cladistics only the shared derived characteristics, synapomorphies, are used to diagnose a group. although other criteria have been proposed to root phylogenetic trees, the outgroup criterion is the least arbitrary. as a result, outgroup rooting is widely used for character-based phylogenetic analyses [nix94] . problems of the outgroup criterion. as seen, the use of the outgroup taxon provides an informative way test hypotheses on the content of natural groups and to root the phylogenetic tree in a way that allows interpretation of the polarity of change of evolutionary events. however, the choice of an outgroup taxon is key to the success of this method. if the nucleotide sequences of a candidate outgroup are divergent from the sequences of the ingroup taxa, the phylogenetic position of the outgroup might be hard to establish [whe90] . therefore, the choice of the outgroup requires judicious selection and searches for organisms that (1) are safely outside the ingroup but (2) that have comparable data [whe90] . the cases shown in figs. 2.1-2.4 are based on the simplifying assumption that the genes sequenced for the taxa of interest have equal number of residues (i.e., amino acid or nucleotide sequences of the same length). frequently, in empirical studies of related organisms, homologous genes have sequences with different number of residues. sequence length variation occurs in both coding and noncoding loci. the causes can be genetic drift, mutation, recombination, or horizontal transfer events. the phylogenetic analysis of molecular sequences, like that of all other comparative data, is based on schemes of putative homology that are then tested via phylogenetic analysis. unlike some other data types, however, putative homologies in molecular data are not directly observable. sequences from various organisms are often unequal in length. hence, the correspondences among sequence positions are not evident and some sort of procedure is required to determine which regions are homologous. this procedure is typically multiple sequence alignment. alignment inserts gaps to make the putatively corresponding residue line up into columns. these columns (characters) comprise the matrix used to reconstruct cladograms. the matrix is then submitted to phylogenetic analysis in the same manner as other forms of data such as morphological characters scored by an investigator. thus the primary reason in phylogenetics to create an alignment has a strongly operational basis -to make it possible to submit these data to standard phylogeny programs that were designed to handle column vectors of morphological characters. nevertheless, alignment followed by tree search is the standard procedure. two major options are currently available to analyze sequence data in a phylogenetic framework: a twostep analysis or a one-step analysis. phylogenetic analysis of large genomic datasets can present several nested npcomplete problems: multiple alignment, tree-search, and in some cases, gene order and complement differences among organisms. just as in distance methods, in most character-based methods, alignments are precomputed before any phylogenetic analysis. the alignment procedure is usually done through algorithms that produce a matrix from the raw dna sequences of the organisms being analyzed (fig. 2 .5). this data set is then analyzed (second step) in order to find the optimal tree (see sect. 2.1.4). the multiple alignment procedure ranges from easy in many coding loci to very difficult in noncoding loci such as functional rnas and genes containing introns [mor97] . in the case of some protein coding loci the alignment may be a nonissue if there are no significant length differences in sequences. however, various investigators who employ different primer sets and editing styles often produce various length sequences. leading and trailing gaps produced by experimental artifact should not be counted in tree length calculations. results of multiple alignment of functional rnas and genes containing introns can be sensitive to parameter choices [fit83] . important parameters include the addition order of taxa, relative costs of various classes of mutations (transversions, transitions, insertion-deletion), and differential costs applied to opening or extending regions of insertion-deletions. analyses of different alignments of the same raw sequences can lead to different trees irrespective of tree search procedures [mor97] . in such cases investigators must search parameter space [whe95, phi00] or otherwise justify their assumptions during alignment [gra03] just as they are required to justify optimality criteria used during tree search. several researchers have noted that performing phylogenetic analysis into two steps is not consistent with the goals of finding the most parsimonious solutions due to the interdependence of multiple alignment and tree estimation [phi00, jan02] . in fact, popular multiple alignment programs such as clustal [tho94] use a guide tree used to construct the alignment. therefore, methods have been proposed to make a simultaneous estimation of the optimal sequence alignment and the optimal phylogenetic tree [san83] . a modern implementation of the one-step concept in poy [whe05] , termed direct optimization, allows unaligned sequence data to be analyzed without precomputing an alignment. in direct optimization, sequence data are aligned as various trees are built and their optimality is assessed. thus for each tree considered in a search, various sets of homology statements for the sequence data are considered. one advantage of direct optimization is that the outgroup need not be designated by the investigator. poy allows for randomization of the outgroup taxon and thus adds rigor to the search for optimal trees and homology statements. in some implementations of character-based methods where prealignment is necessary the outgroup can be randomized by scripting a series of analyses, e.g., tnt [gol03] . one important difference is that in molecular data a rigorous tree search with on a prealigned dataset with unordered characters should lead to the same tree length irrespective of outgroup choice; whereas in direct optimization the homology statements and hence tree length can be dependent on outgroup choice. several groups are developing algorithms for simultaneous estimation of alignment and phylogenetic trees. methods for a one-step phylogenetic analysis have been developed using maximum likelihood [tkf92, fle05] and parsimony optimality criteria [whe96] , as well as for bayesian analysis [red05] . although the one-step approach has the appeal of using a unified and epistemologically consistent method of alignment and tree estimation, the time and space requirements for computation are considerable. this problem of tree-based alignment is known to be np-complete [wan94] . in this situation, genes that vary in length (as most noncoding and intronic containing genes do) present a huge number of possible hypothetical ancestral sequences even for a single binary tree. during a phylogenetic analysis many trees will be examined and compared. for s taxa and l nucleotides per taxon, the cost of computation per tree ranges from (s−l)l 2 to (s−2)l 3 , depending on the heuristics applied. the memory requirements scale proportional to l 3 . fortunately, procedures such as the optimized diagonal transition algorithms described by ukkonen [ukk85] abate the space and time dependence on l, the number of nucleotides [whe05] . phylogenetic analysis under the parsimony criterion is based on an objective function (tree length). tree length is used to evaluate the optimality of each phylogenetic tree considered. however, finding the optimal phylogenetic tree (among all possible topologies) is an np-hard problem [fg82] , that resembles the steiner tree problem. the combinatorial optimization problem of phylogenetic analysis consists of finding the optimal solution from a very large number of possible trees. the number of possible trees increases dramatically with the number of organisms being analyzed [fel78] . the number of possible (unrooted) phylogenetic trees (t ) for a given set of organisms increases following where s is the number of organisms (leaves) of the phylogenetic tree. therefore, the number of possible phylogenetic trees is extremely large even for trees with moderate number of organisms. as stated above, a phylogenetic analysis consists evaluating topologies in order to find the optimal solution [i.e., the tree(s) with the minimum length]. it is interesting to note that the computing time of an exhaustive evaluation of all possible trees for a fixed number of taxa will increase nearly linearly with the number of characters (e.g. length of dna sequences) because the fitch algorithm [fit71] for evaluating the tree length of a particular topology works in polynomial time. however, the excessively large number of possible phylogenetic trees of 20 or more organisms (see table 2 .1) makes exhaustive evaluation of all phylogenetic trees intractable. note on multiple optimal trees. frequently, in phylogenetic analysis based on an optimality criterion, there are multiple trees that score the same minimum order of magnitude is given for the number of trees with more than 10 organisms for tree length or likelihood. in the set of known optimal trees, the transformations may be differentially distributed and different organismal groups may be implied. therefore, this set of known optimal trees must be considered equally valuable. these cases represent alternative hypotheses (i.e., phylogenetic trees) that are equally supported by the available data and can be summarized through a consensus tree. several kinds of techniques for consensus estimation exist. the strict consensus tree is one of the most frequently used. a strict consensus calculation represents a tree that has all the edges shared by all the known optimal trees. see [swo91] for further information on various consensus trees. two algorithms can be applied to perform exhaustive searches that evaluate (explicitly or implicitly) all possible phylogenetic trees in order to find the optimal tree. the first of these is exhaustive enumeration, which computes the optimality value (e.g., tree length) of every possible phylogenetic tree and select the tree (or trees) with the minimal value. the second method is the branch and bound algorithm [hen82] that implicitly evaluates all possible trees but avoids, in practice, computing all possible trees (see [sea96] ). in current phylogenetic software packages (e.g., [swo02, gol03] ) this algorithm can be applied to data sets of up to 20 (or 25) organisms and guarantees to find the optimal trees (or trees) for a given phylogenetic data matrix. for analysis of data sets with larger number of organisms, the number of trees is prohibitively large for conducting an exhaustive search. in modern biology, interesting data sets consider hundreds to thousands organisms. thus the problem of phylogenetic tree search is compute bound and must be approached through heuristic searches. in these tree searches, a large number of phylogenetic trees are evaluated and the best solution is kept as known estimate of minimum tree length. some early examples of heuristic tree searches include the algorithm to compute wagner trees [far70] . the wagner algorithm creates a phylogenetic tree of three taxa and progressively adds organisms, attaching them to the branch that generates the minimal increase in tree length at that step. this stepwise procedure is conducted until the last organism is added to the tree. although this procedure usually results in a tree that has a suboptimal tree length, in most cases this score is significantly better than that obtained with a random choice among all possible topologies. as various starting points are used for building wagner trees, this aspect of phylogenetic analysis can be considered a type of monte carlo randomization. given one or many wagner trees, the next standard heuristic refinement techniques that would be typically applied in tree search are known as branch swapping or hill-climbing procedures (see [sea96] ). this class of refinement procedures consists of performing minor rearrangements of branches in the starting tree. each wagner tree is modified by pruning a subtree and reattaching it to a different branch of the remaining tree. the tree length of the modified tree is then calculated. if the modified tree has a shorter tree length it is kept in a buffer of new candidate trees. branch swapping is applied to all wagner and candidate trees until the algorithm converges. when no further rearrangements can improve the current topology the branch swapping is finished. the results of the branch-swapping algorithm depend on the quality of the starting point (i.e., wagner tree). in many cases, the tree resulting from the application of branch swapping to a wagner tree is a local optimum that cannot be further improved by swapping. therefore, multiple replicates (100s to 1,000s) of independent wagner trees followed by swapping are typically preformed. at the end of these stages of analysis, the best trees found in all the replicates are kept as a set representing topologies at the known minimum length. replication of wagner builds plus swapping (or random-restart hill climbing) is the most widely used routine implemented in most software packages (e.g., [swo02, gol03] ). one major drawback of wagner builds plus swapping is that this procedure is subject to finding only local optima. finding the globally optimal tree(s) for a dataset of >20 taxa is a np-hard problem [fg82] . however, performing multiple replicates of this procedure can provide a relative degree of confidence if the minimum length tree(s) converge at the same tree length from numerous independent starting points [gol99] . replication of wagner builds plus swapping is usually efficient for data sets smaller than a hundred taxa. because of advances in automated dna sequencing technology, the size of modern comparative data sets far exceed the limits for which these techniques are efficient analytical tools for phylogenetic analysis. large phylogenetic problems are becoming increasingly common across the life sciences due to the prevalence of high throughput nucleotide sequencing technology. large data sets are of interest to biologists because they provide a rich context of phenotypes and genotypes and permit worldwide and longitudinal sampling of genomes. these large phylogenetic problems will become increasingly common in the years ahead. thus, phylogenetic methods suited to large datasets will have important consequences not only for the study of organismal classification and evolution, but also for many aspects of public health (see sect. 2.2). furthermore, in a operational context, strong organismal sampling has been shown to correlate with improved performance of phylogenetic methods [hil96, poe98, ran98, zwi02, hil03]. the dauntingly high cost of computation of large-scale phylogenetic analysis stunted this line of research. however, in recent years two main lines of research have provided efficient tools to analyze large phylogenetic datasets: the development of new algorithms and the use of parallel computing. several researchers have combined groups of algorithms into heuristic tree search strategies that have proven to be efficient for phylogenetic analyses of hundreds to thousands of organisms under the parsimony criterion. these heuristic search strategies are based on basic monte carlo and hill climbing techniques with the addition of other classes of algorithms including simulated annealing [gol99] , data perturbation [nix99] , divide-and-conquer [gol99, ros04] , and genetic algorithms [moi99, gol99] . similar search strategies that combine several layers of algorithms have been employed using other optimality criteria such as maximum likelihood [lew98, sal01, lem02, bra02] . the judicious application of various algorithms has provided efficient solutions for the analysis of datasets of several hundreds organisms [gol99] in a single cpu [teh03] . in particular, the successive combination hill-climbing, genetic, and simulated annealing algorithms of tree search have produced a drastic speed up in comparison to other strategies [gol99] . efficient implementations of these algorithms have become recently available in software packages [gol03] . the need of phylogenies depicting the evolutionary relationships of datasets consisting of thousands of taxa has prompted the synergistic implementation of efficient heuristic tree search strategies and parallel computing hardware. an increasing number of researchers are developing software suited for parallel computing using beowulf class clusters [ste00] . beowulf clusters are simply arrays of commodity pcs and switches enabled by scalable, open source operating systems (e.g. linux) and message passing software (e.g. pvm or mpi). although the advantages of parallel computing in phylogenetics and multiple alignment have been clear for some time [whe94] , the means to exploit this potential for research gain have not been broadly and economically available until the beowulf concept was developed by the end of the 1990s. alignment and tree search problems are naturally suitable for parallel computing. phylogenetic researchers quickly realized the opportunity presented by beowulf computing [cer98, jan01] . finding an optimal phylogeny requires the evaluation of the same objective function on a large number of alternative trees. because many trees can be examined concurrently and independently, this has led several authors to implement phylogenetic tree searches in parallel [jon95, sne00, cha01, gol02, bra02, sta02]. these implementations use the parsimony and maximum likelihood optimality criteria as well as bayesian analysis. researchers have also used parallelism to speedup one-step phylogenetic analysis [jan01] and multiple alignment [whe94, li03] . originally, phylogenetics was considered relevant only to taxonomic and evolutionary studies. however, the ability to identify conserved and divergent regions of genomes is becoming critical data for numerous disciplines in biology and medicine. these fields include vascular genomics [rub03] , ecology [sil97] , physiology [car94] , pharmacology [sea03] , epidemiology [ros02] , developmental biology [whi03] , and forensics [bud03] . phylogenetics has even been used in successful criminal prosecution of a doctor who attempted to cause hiv infection in his former girlfriend via blood products taken from a hiv patient under his care [met02] . here we focus on cases in which phylogenetic analyses have helped researchers to understand the evolution and spread of infectious diseases. we provide exemplar cases in which phylogenetic analyses of viral genomes have been crucial to understand complex patterns of transmission among animal and human hosts: severe acute respiratory syndrome (sars) [ksi03] and influenza [web92] . emergent infectious diseases often evolve via zoonosis; shifts of an animal pathogen to human host. in fact, most category a pathogens and potential agents of bioterrorism and more than 75% of emergent diseases have zoonotic origins [tay01, fra02] . a typical set of tests for the hypothesis of animal host of a disease might be (1) experimentally exposing the candidate host animals with isolated viruses and ascertaining whether infection and viral shedding occurs [mar04] ; or (2) survey populations of animals with antibodies for exposure to the virus [gua03] . these activities often provide model organisms for vaccine and drug development, data on seroprevalance, and sequence data for viruses isolated from various candidate hosts. phylogenetic analysis of genomes is a complement to laboratory and survey studies with a distinct advantage. with phylogenetics, the researcher is not restricted to testing a single hypothesis for a specific candidate host in each experiment. provided with sequence data for a diverse set of candidate hosts, a researcher performing a single phylogenetic analysis makes a vast number of comparisons, thus evaluating simultaneously many alternative hypotheses. these hypotheses include the evaluation of pathways of transmission among several hosts and the polarity of the transmission events. for example, experimentalists report that small carnivores in chinese markets have been exposed to sars-cov [gua03] and the virus can infect domestic cats [mar04] . on their own these data do not necessarily reconstruct the history of the zoonotic and genomic events that underlie the sars epidemic. furthermore, whether or not interspecies transmission is observed or enhanced under controlled laboratory conditions, phylogenetic research is distinct as it can address whether the genomic record has evidence to support a hypothesis for a particular transmission pathway. for example, if phylogenetic analysis reveals multiple independent events of human to avian transmission of influenza viruses without intermediate hosts such as swine that provides a strong argument to reevaluate the hypothesis that pigs serve as "mixing vessels" for avian and human viruses leading to influenza epidemics [sch90] . in many cases, host shifts occur via recombination between two ancestral pathogen genomes to produce a chimeric descendent. epidemics can occur when, subsequent to recombination, a lineage of pathogens establishes itself in a new population of hosts, vectors, or reservoir species that can amplify and distribute the pathogen [mor95] . a host shift can require key mutations and rearrangement of the pathogen genome to infect cells of new hosts followed by adaptation to novel regulatory machinery. phylogenetics can reconstruct genomic changes at the level of each nucleotide and unravel parental and descendent strains in recombination mediated host shifts [wan05] . influenza is a widespread respiratory disease caused by an rna virus (orthomyxoviridae). the influenza virus has been traditionally divided in three major types: a, b, and c. influenza viruses of type a are known from many strains that infect both mammal and avian hosts, whereas the other two type are primarily known from humans. influenza a is characterized by antigenic subtypes (see sect. 2.3). influenza is interesting from both epidemiological and evolutionary points of view due to the interplay between genetic changes in the viral population and the immune system of hosts [ear02] . there are two basic hypotheses on how influenza a viruses escape the immune response in host population to cause epidemics: (1) antigenic drift, meaning that random point mutations produces novel influenza strains that succeed and persist if they can infect and spread among hosts; (2) antigenic shift, meaning that genes derived from two or more influenza strains reassort thus creating a novel descendent genome with a constellation of genes that can infect and spread among hosts. in both scenarios zoonosis is often involved. in case of antigenic shift the ancestry of only a fraction of the influenza genes may be zoonotic. two major classes of influenza epidemics are recognized in humans: seasonal outbreaks and large-scale epidemics known as pandemics [web92] . seasonal influenza is a significant public health concern causing 36,000 deaths and 200,000 hospitalizations in the united states in an average year [ger05] . elderly and children account for many of these severe cases of seasonal influenza. much of the population has partial immunity to seasonal influenza strains that are typically descendents of strains circulating in previous years. pandemics are often caused by infection, replication, and transmission among the human population with influenza strains of zoonotic origin to which few people have prior immunity. pandemics are rare but can affect the entire human population, irrespective of an individual's predisposition to respiratory diseases. in fact, the 1918 pandemic disproportionately affected young adults [tau06] , suggesting that older adults may have had some immunity. there have been three major influenza a pandemics, 1918 (h1n1), 1957 (h2n2), and 1968 (h3n2). the pandemic of 1918 is estimated to have killed tens to a hundred million people worldwide and 675,000 in the united states [tau01] . the asian flu pandemic of 1957 and the hong kong flu pandemic of 1968 were less severe, but caused tens of thousands of deaths in the united states [hhs04] all of these pandemic strains are thought to have originated in wild birds [web92] . the 1957 and 1968 strains are believed to be the results of antigenic shift. however, recent studies suggest that the h1n1 influenza virus that caused the pandemic of 1918 was entirely of avian origin rather than a humanavian reassortant [tau05] . other researchers have countered that the 1918 h1n1 strains had a more commonly accepted route to infection of human populations by reassortment in mammals [gib06; ant06]. pandemics can theoretically occur with any strain of influenza. most influenza infections since 1968 have been attributed to influenza a h3n2 or h1n1 strains. however, there have been several recent reports of novel human infections from avian strains of influenza with subtypes thought to occur rarely in humans. several cases of human infection of viruses of subtype h7 of avian origin have recently occurred in canada [twe04] and the netherlands [koo04] . avian influenza of antigenic subtype h5 and h7 viruses can be found as low or high pathogenic forms depending on the severity of the illness they cause in poultry. thus far, influenza h9 virus has only been identified as strains with low pathogenicity [lin00] . alarmingly, highly pathogenic strains of influenza a with an h5n1 subtype have spread rapidly among various species of birds in china, southeast asia, russia, india, the middle east, africa, eastern, and western europe [who07a] . these h5n1 influenza a strains share common ancestry with the outbreak of h5n1 that lead to a massive chicken cull and six human deaths in hong kong in 1997 [li04] . between 2003 and september 10, 2007, there have been 328 cases and 200 deaths among humans [who07b] . there are several instances of h5n1 infection of felids and swine in asia. there is scant evidence of human-to-human transmission in thailand [ung05] and indonesia [yan 2007 ]. if lethality to human cases of h5n1 drops, the virus might spread rapidly and without being detected. many predict an upcoming avian influenza pandemic of devastating human and economic costs. in the united states alone, it is projected that 15-35% of the population will be affected and the costs could range from 71.6 to 166.5 billion united states (us) dollars [ger05] . although vaccine production can in theory be modified to include h5n1 strains [dut05] , the genomes of interest are moving targets. it remains unknown whether the descendents of the contemporary h5n1 virus will achieve efficient human-to-human transmission and if this will occur via incremental mutations or a more punctuated reassortment mediated change. thus phylogenetics is a key technology to track the evolution of h5n1 and compare those changes to genomic and zoonotic events that underlie pandemics. the viruses of influenza type a are classified as various subtypes that represent differences in the antigenic reaction of two key glycoproteins: hemagglutinin (ha) and neuraminidase (na). these proteins reside on the surface of the virion. these proteins play key roles in recognition and infection of susceptible hosts (ha) and viral replication (na). these surface proteins are primary antigens recognized by the host immune system [web92] . the subtypes of influenza a are labeled according to the reaction of standard monoclonal antibodies to these ha and na proteins provided by the us centers for disease control to laboratories participating in the world health organization's (who) surveillance program [hhsb] . although this number will soon expand, there are currently 16 different antigenic subtypes recognized for ha (labeled from h1 to h16) and 9 different antigenic subtypes of na (from n1 to n9). thus, a subtype of influenza virus type a is labeled with the number associated with ha and na proteins (e.g., the most common subtype found in humans h3n2). since 1948, influenza viruses have been the focus of a coordinated surveillance program organized by the who [who05] . the hemagglutinin gene (ha) is the major target of the influenza surveillance. this program helps track predominant strains to inform the development of new vaccines. influenza viruses are sampled worldwide through the national influenza centers located in 54 countries [who05] . many of the viral isolates sampled by these programs are sequenced for the hemagglutinin gene, although there has been an increasing interest in sampling complete influenza genomes [ghe05, obe06] . an extensive record of hemagglutinin sequences of the influenza viruses type a isolated since 1902 are publicly available. these data provide a unique set of challenges and opportunities for phylogenetics. the geographically wide and temporally long sampling of viral isolates provides an unprecedented opportunity to study evolutionary patterns underlying the spread and host range of an infectious disease. however, as described earlier, large datasets present an enormous search space of possible evolutionary scenarios to be evaluated. the availability of nucleotide sequences of influenza viruses has triggered numerous research groups to attempt reconstruction of the phylogenetic history of these viruses (e.g., [bus99, yua02, fer03, bus04] ). these groups draw on data from currently circulating strains as well as from historically important strains gathered from archival tissue samples. examples of archival tissues that have provided date of interest to the 1918 epidemic include lung biopsies of deceased soldiers, victims frozen in alaskan permafrost [tau97] , and waterfowl collected for the smithsonian in 1916-1917 [fan02] . phylogenetic analysis of seasonal influenza sequence data has been used to classify nucleotide substitution mutations. in many codons of the ha gene mutations that produce a change in protein sequence are more frequent than those that do not [bu99] . this finding indicates that selective pressures imposed by the immune system of the hosts can drive the evolution of some codons of ha. thus an evolutionary perspective can illuminate functional studies of infectious disease [ear02] . as noted, phylogenetics have been widely used to understand history of influenza epidemics, host shifts, as well as evolutionary interactions with the hosts immune system (see sect. 2.3.1). however, most phylogenetic analyses of influenza thus far have used only fractions of the dataset of influenza nucleotide sequences in the public domain. the sequences in the public domain are largely ha, but recently whole genomes have been produced. the institute for genomic research (tigr) is rapidly sequencing and releasing into the public domain thousands of influenza genomes under the microbial sequencing center (msc) program sponsored by the national institute of allergy and infectious disease (niaid) [ghe05] . st. jude children's research hospital in memphis has contributed a significant increase in the number of avian influenza genomes sequences [obe06] . most existing phylogenetic analyses of influenza have focused on the phylogenetic relationships of particular subgroups of influenza type a, such as the h5n1 subtype (e.g., [li04] ) or the h3n2 subtype (e.g., [bus99] ). these analysis have provided useful information but have depicted a disjoint picture of the evolution of the major lineages of influenza. in contrast, other studies have attempted broader subtype scope; however, they included a single viral isolate as an exemplar of each subtype [suz02] . this study failed to include an extensive sampling of strains. poor strain sampling can have a negative impact on the performance of phylogenetic methods (see sect. 2.1.5) and does not test whether the subtypes are natural groups (i.e. monophyletic). a very recent study has used whole genomes of 136 isolates drawn from a variety of avian influenza subtypes [obe06] . here we show results of a comprehensive phylogenetic analysis based on hemagglutinin dna sequences of 2,359 viral isolates. these sequences include representatives of the 16 different subtypes of the hemagglutinin protein of influenza type a, recorded worldwide by the world health organization surveillance program. the analyzed viruses were isolated as early as 1902, from tissues of patients who died during the 1918 spanish flu epidemic, to recently sequenced isolates from the 2004 seasonal flu and h5n1 outbreak. the analyzed dna sequences also implies a broad range of host organisms, including multiple species of wild and domestic birds, humans, swine, horses, felids, and whales. an inclusive phylogenetic analysis with a large number of taxa require the use of efficient tree search strategies (see sect. 2.1.5) and the use of multiple computers dedicated to the phylogenetic analysis. the cost of computation is tied primarily to the number of strains, not nucleotides. thus the inclusion of whole genomes does not contribute significantly to the compute bound nature of phylogenetic analysis. however, the inclusion of whole genome data does increase memory demands. this 2,359 isolate dataset was analyzed with a parallelization of the tree search strategy implemented in a recently developed software for parsimony analysis [gol03] . the results of this analysis are used here to illustrate two new uses, longitudinal analyses of patterns of zoonotic transmission and assessment of surveillance quality. our results on the relationships of ha subtypes shown in fig. (2.6) has similarities with the results of suzuki and nei [suz02] , including the clades ((h8 h12) h9), ((h15 h7) h10), ((h4 h14) h3), and (((h2 h5) h1) h6). however, the position of h13 and h11 differ in our trees due to our inclusion of h16. moreover, the relationship of these clades to one another differs in our assessments. our tree has a staircase shape with ((h8 h12) h9) basal most, whereas suzuki and nei's [suz02] tree has a symmetrical shape with no clear basal group. influenza a viruses from wild aquatic birds have been identified as the source of influenza viruses isolated from birds of the order galliformes (e.g., turkeys, grouse, quails, pheasants, domestic chickens, and their ancestral stock the jungle fowl) [web92] . direct human infection by avian strains of influenza a is considered rare [lip04] . after the discovery of receptors for both avian fig. 2.6 . phylogeny of hemagglutinin (ha) sequences representing 2,358 isolates of influenza a, with a single sequence of influenza b as outgroup. to summarize the source tree we have condensed each subtype clade into a single branch. the numbers of isolates included in the full tree are presented as the numerals above each branch. the numerals below each branch represent jackknife support values (0 worst to 100 best). sequence and character data was drawn from genbank (www.ncbi.nlm.nih.gov) and the influenza sequence database (www.flu.lanl.gov) and mammalian strains of influenza in the trachea of pigs, it has been hypothesized that domestic swine act as intermediate hosts in which human and avian viruses can recombine [sch90] . this mechanistic hypothesis of viral transmission is widespread. however, as discussed above, a number of events of suspected direct transmission of avian influenza viruses to humans have been reported [lip04, ung05] . hypotheses on the relative frequency of host shifts can be made on a phylogenetic tree through the optimization [fit71] of a character with states representing various hosts of the viral isolates under consideration (see sect. 2.2.1). we performed this analysis on our tree of 2,359 ha sequences and found that most of the internal nodes close to the root are optimized as having an avian origin (fig. 2.7) . thus the results of this analysis are consistent with the hypothesis of an avian origin of all influenza type a viruses [web92] . these results also show that most major lineages of influenza a that infect domestic birds originated in aquatic birds. this is compatible with the hypothesis that wild aquatic birds as the natural reservoir of influenza viruses of type a [web92] . however, the pattern of host shifts resulting from our study of 2,359 ha sequences seems to be much more complex than previously thought [gam90, lip04] . for instance, in many cases, after the spread of influenza type a viruses into domestic bird and mammal populations (including humans), some derived lineages are later spread again to aquatic birds. furthermore, the results indicate that direct shifts from avian to human hosts have occurred 18-27 times independently in different lineages (without observed intermediate hosts). it must be noted that the possibility of an intermediate host in avian-to-human transmission events cannot be completely rejected. it is possible that an intermediate host existed in nature but it was not sampled by the surveillance program and therefore not included in the analysis. however, based on the available evidence, it seems that host shifts from birds to humans have been frequent in the evolutionary history of influenza type a. moreover, avian-to-human shifts are more common than swine to human shifts in the history of influenza. multiple direct avian-to-human shifts appear to occur in the case of the putative pandemic strains of influenza a (subtype h5n1) that have spread across eurasia since 1997 [who05] . in addition to being highly pathogenic, these h5n1 strains have independently infected other hosts such as felids and pigs in several instances. phylogenetics is practiced by most as a historical science; however, several researchers noted that aspects of the tree shape may be used in predicting future genetic lineages of influenza against which it is important to design vaccines [gre04] . notable among these assertions are the studies in the shape of influenza a phylogeny as viewed through the hemagglutinin (ha) gene fig. 2.7 . two character optimizations on the for hemagglutinin (ha) sequences representing 2,358 isolates of influenza a, with an influenza b outgroup at the root. the top tree has an optimization of the character "ha antigenic subtype". the lower tree depicts optimization of the character "host". character data was drawn from genbank (www.ncbi.nlm.nih.gov) and the influenza sequence database (www.flu.lanl.gov). optimizations and tree graphics were made with mesquite (www.mesquiteproject.org). for better visualization contact the authors for files in scalable pdf format [bus99, fer02] . the ha gene codes for a surface glycoprotein of the virion responsible for binding to sialic acid on host cell surface receptors. at a genomic level, lineages of influenza are constantly changing due to mutation that occurs at high rates in rna viruses. extinction of evolutionary lineages of viruses to which hosts have become immune or when susceptible hosts are in short supply is common [gre04] . this process of constant replacement of influenza lineages produces a characteristic coniferous shape to a phylogeny reconstructed from ha sequences [bus99] . the "conifer" metaphor refers to the hypothesis that influenza ha is constantly changing but there is limited diversity at any time [fer02] . thus an influenza ha tree appears to be formed by addition of strains to the apex of the tree's trunk that contains the contemporary "infectious" viruses rather than more basal presumably "extinct" lineages to which hosts are immune. other groups of researchers have used the assumption that there is limited influenza a diversity at any one time to downplay the utility of phylogenetic approaches [plo02] . as an alternative to phylogenetics, which they consider difficult, these groups make predictions based on size of various clusters of related isolates, termed "swarms" [plo02] . several groups, whether using trees or swarms, have identified putatively dominant strains of influenza to predict the genetic makeup of future viral populations [plo02] [bus99] . if these assumptions were never violated, the diversity of a previous year's flu season could be assessed, forthcoming strains predicted, and thus used to inform vaccine design. in practice, the cdc uses a mixture of viral strains comprised of h1n1 and h3n2 of influenza a and an influenza b virus. [pal06] . notably the h5n1 strain (or any of the other avian strains with potential to infect humans) is currently not considered in the vaccine that is seasonally administered to civilians in the united states. the ability to predict influenza viral strains that will affect human and animal populations is important. however, prediction methods and experimental designs that are relevant to those methods are in their infancy. current surveillance programs are focused on detection of antigenically novel strains. as such, surveillance programs are not designed as ecological experiments to quantitatively measure strain-specific incidence and cluster size. furthermore, the current sample of influenza diversity may be biased by partial genomic sequencing, differences in effort within various geographic and political boundaries, focus on certain subtypes of interest, and differential efforts over time due to variable public concern. recent papers using whole genome data have indicated that the conifer like growth assumption of habased phylogenies that has been central to predictive models of h3n2 seasonal influenza [bus99, fer02] may be violated. full genome analysis of h3n2 has shown that there are multiple co-circulating lineages; some of which may be overlooked by vaccine designs [hol05, ghe05] . similarly, our large scaleanalysis of 2,359 ha sequences depicts that many subtypes and lineages within subtypes of influenza are circulating and being exchanged among human and animal populations at any one time fig. 2. 7. in addition to providing hypotheses on the relationships of a group of organisms, phylogenetic trees imply a temporal order of the successive internal nodes (i.e., the time at which a single evolutionary lineage splits producing two independent descendent lineages). minimal estimates on the date at which these evolutionary splits occur can be obtained through the analysis of the time at which the descendant organisms (leaves) are known to occur. these estimates can be computed with the implementation of an irreversible sankoff character in which the cost of transformation between two character states represents the amount of time elapsed between the time of appearance of two terminal taxa [pn01] . influenza a viral sequences are named with the host, locale, and year in which each isolate was sampled by the surveillance program. several methods exist to measure the correlation between the temporal dates of sampled organisms and the relative order they show in the phylogenetic tree. here we adapt the manhattan stratigraphic metric (msm*) to influenza surveillance. the msm* was originally developed to assess the quality of the fossil record [pn01] (table 2. 2). however, the msm* is simply a quantitative measure of how well the available data reflects the diversification pattern of the taxa present in the optimal phylogenetic trees and is thus of general utility. an extensive sampling of sequences, such as the one gathered for the study of 2,359 isolates, is critical to comparatively assess quality of surveillance in various regions, among various strains, and over periods of time. our results show that this correlation between branching pattern and dates of viral isolation is good in that it significantly differs from a random expectation. this is true over the entire tree as well as when some individual lineages are measured. however, the relative quality of surveillance differs markedly between lineages. one example of differential surveillance quality occurs in two closely related groups of avian influenza of h5 hemagglutinin subtype. one group in this example contains the highly pathogenic h5n1 strains that currently circulate in eurasia, the middle east, and africa. this large clade has been the focus of intense surveillance since the discovery of widespread infection among wild and domestic birds and some avian-to-human transmission [yua02, ung05] . the h5n1 viral isolates form a sister clade to h5n2 known from domestic and wild bird in the americas (h5n2). the number of available hemagglutinin sequences of h5n2 comprise less than one fifth of the number of ha sequences for h5n1. this in itself represents a measure of the surveillance intensity devoted to these two groups of avian influenza. however, even if the number of sequences is normalized at 100 sequences to perform the msm test, the surveillance quality of the h5n1 clade is far superior to the h5n2 clade. we can also use visualization techniques to assess surveillance quality. typically branches of a phylogenetic tree are scaled used to depict the number of mutations or other character changes assigned to each branch. however, we have adapted this use of branch scaling to reflect the number of years that have passed between sampling of related isolates rather than mutations or characters. compare fig. 2 .8 which has short branch lengths reflecting good surveillance quality with fig. 2 .9 which has long branch lengths implying poor surveillance quality. cases in which there is poor correlation between the date of sampling of a given isolate and its inferred date of origin would indicate that the surveillance program is failing to closely monitor the persistence of diverse lineages of influenza (2.9). no matter the type of phylogenetic perspective they may espouse, most virologists produce the same basic data by surveying putative host animals and patients with antibodies, then isolating and sequencing partial or whole genomes of various viruses detected in hosts. molecular phylogenetic analyses of the nucleotide or inferred amino acid sequence data from various viral isolates can then be used to reconstruct the history of the transmission events the virus among hosts. the fundamental belief associated with this research program is that the branching pattern of the phylogeny will reveal a temporal series of transformations when character of interest such as the host is optimized on the viral phylogeny. most virology researchers rely on distance methods. the most popular distance method among virologists is neighbor-joining (nj) [sai87] . distance methods require a precomputed multiple alignment of dna or amino acid sequences drawn from homologous genes of the viral strains of interest. then in nj, the most grossly similar pair of isolates (as represented by sequences) are clustered. the clustered pair is then considered as a single taxon and the next most similar pair of taxa is to cluster until only two taxa remain and are joined. in distance methods no outgroups are proposed and no assumptions of ancestral character states are considered. as a result, polarity of transformations can only be inferred as from dissimilar to similar. distance methods output a single unrooted, star-shaped graph. nevertheless preparing figures, some investigators who use distance methods choose to impart directionality by selecting an edge of the graph to serve as a root of the tree. the choice of root is crucial in depicting the polarity of host shifts and depicting clades. the rooting step has been executed variably by researchers comparing sequence data from covs isolated from humans with sequence data covs isolated from small carnivores. in the case of guan et al. fig. 1a) , rooted on a clade comprised of two sars-cov isolates, one from human and the other from a carnivor ( [son05] their fig. 1b) , and in a regression analysis a date for a common ancestor of sars-cov isolated from humans is calculated using a human basal group ([son05] their fig. 3) . in papers comparing sequence data from sars-like cov recently isolated from bats to that from humans and small carnivores lau et al., [lau05] do not root their trees (their fig. 2) and li et al., [li05] and force the root position on their drawing such that one of the bat sequences is ancestral (their fig. s4 of the supplemental material). thus, although all these studies employ distance methods, the researchers use various, often facultative means to infer the animal origins of sars-cov. other methods of phylogenetic analysis focus on characters, states, edit costs for changes among states, outgroup assumptions, and polarity of change among states -rather than gross similarity in the case of distance methods. characters can be polymorphisms recognized in columns of aligned nucleotides or amino acids from sequences of interest or phenotypic states such as host, date of isolation, or antigenic subtype. another feature of most character based methods that differs from nj is that character based methods examine many randomly generated trees (each representing an evolutionary hypothesis of character transformations and organismal relationships). thus the concepts of optimality and hypothesis testing are tightly associated with cladistic and maximum likelihood inference. optimal trees represent more defensible hypotheses. moreover, character based methods of molecular phylogenetics rely on explicit choices of outgroup to make assumptions about ancestral character states and thus polarize transformations of phenotypes and genotypes that can be reconstructed from data. in order to make an explicit assumption of ancestral character states the investigator designates at least one taxon as the outgroup. the outgroup method originated in cladistics [wat81] and has become central to the phylogenetic inference [nix94] . if chosen carefully, the outgroup estimates baseline character states in sequence and phenotypes such that transformations (such as host shifts) can be reliably inferred. to illustrate the choice of outgroup taxon or taxa and clarify the relationships of the organisms, character based trees are often rooted from the outgroup and ingroup. just as in distance methods, in most character-based methods, sequence data is aligned before the phylogenetic analysis. novel implementations, termed direct optimization, allow unaligned sequence data to be analyzed without precomputing an alignment, wheeler [whe96] . in direct optimization, sequence data are aligned as various trees are built and their optimality is assessed (using maximum likelihood and cladistic optimality criteria as specified by the investigator). thus for each tree a specific alignment that is optimal for that tree is constructed. one additional advantage of direct optimization is that the outgroup need not be designated by the investigator but rather randomized during the search for optimal trees and alignments. in some implementations of character based methods where prealignment is necessary, the outgroup can be randomized by scripting a series of analyses. outgroup randomization enables analyses of taxa where previous knowledge of ingroup/outgroup relationships is lacking or is among the hypotheses the investigator wants to test via tree search. large-scale phylogenetic analyses are particularly useful to study global problems of infectious disease. however, phylogenetic analysis of large number of organisms and whole genomes is an extremely challenging computational problem. recent advances in heuristic tree search algorithms, alignment methods, and parallel computing strategies have been successful. these advances have pushed upward the limits of taxon sampling considered tractable. large data sets analysis is interesting not only because it presents interesting computational challenges, moreover large dataset analysis is leading to new knowledge about natural phenomena. for example, in the recent past, researchers working on small datasets argued that influenza had limited diversity at any one time and that this should allow us to predict which strains are important for vaccine design. on the contrary, with large datasets, we find that there are multiple co-circulating lineages at any one time. thus, large datasets and means to analyze them are important for future vaccine design. character-based approaches to phylogenetics provide a wide variety of tools that can be used to better understand the evolutionary processes underlying the spread of infectious diseases. we have discussed here only some of the wide array of applications that phylogenetics and outgroup criteria can have on genomic studies of infectious disease. we look forward to the continued synergistic development of technological and scientific means to better understand infectious and zoonotic disease. molecular virology: was the 1918 flu avian in origin? arising from genetic algorithms and parallel processing in maximum-likelihood phylogeny inference positive selection on the h3 hemagglutinin gene of human influenza virus a forensics and mitochondrial dna: applications, debates, and foundations influenza as a model system for studying the crossspecies transfer and evolution of the sars coronavirus predicting the evolution of human influenza a phylogeny and physiology of drosophila opsins parallel implementation of dnaml program on message-passing architectures recreating ancestral proteins hitch-hiking: a parallel heuristic search strategy, applied to the phylogeny problem the chinese sars molecular epidemiology consortium: molecular evolution of the sars coronavirus during the course of the sars epidemic in china therapeutic and vaccine manufacturers working to combat the avian flu ecology and evolution of the flu avian influenza virus sequences suggest that the 1918 pandemic virus did not acquire its hemagglutinin directly from birds methods for computing wagner trees the logical basis of phylogenetic analysis parsimony jackknifing outperforms neighbor-joining maximum likelihood and minimum-step methods for estimating trees from data on discrete characters the number of evolutionary trees evolutionary trees from dna sequences: a maximum likelihood approach predicting evolutionary change in the influenza a virus ecological and immunological determinants of influenza evolution the steiner problem in phylogeny is np-complete towards defining the course of evolution: minimum change for a specific tree topology long term trends in the evolution of h(3) ha1 human influenza type a optimal sequence alignments simultaneous statistical alignment and phylogeny reconstruction in the emergence of zoonotic diseases: understanding the impact on animal and human health phylogenetic analysis of nucleoproteins suggests that human influenza a viruses emerged from a 19th century avian ancestor pandemic planning and preparedness feldblyum and 14 others, large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution was the 1918 pandemic caused by a bird flu? arising from parallel searches of large datasets tnt: tree analysis using new technologies, software package distributed by the authors and available at analyzing large datasets in reasonable times: solutions for composite optima data exploration in phylogenetic inference: scientific, heuristic, or neither unifying the epidemiological and evolutionary dynamics of pathogens isolation and characterization of viruses related to the sars coronavirus from animals in southern china phylogenetic systematics branch and bound algorithms to determine minimal evolutionary trees pandemics and pandemic scares in the 20th century part 2 public health guidance for state and local partners is sparse taxon sampling a problem for phylogenetic inference? inferring complex phylogenies bao, and 6 others, whole-genome analysis of human influenza a virus reveals multiple persistent lineages and reassortment among recent h3n2 viruses potential applications and pitfalls of bayesian inference of phylogeny efficiency of parallel direct optimization theory and practice of parallel direct optimization integrating methodologists into teams of substantive experts parallelizing the phylogeny problem transmission of h7n7 avian influenza a virus to human beings during a large outbreak in commercial poultry farms in the netherlands peret and 22 others, a novel coronavirus associated with severe acute respiratory syndrome severe acute respiratory syndrome coronavirus-like virus in chinese horseshoe bats the origin and control of pandemic influenza the metapopulation genetic algorithm: an efficient solution for the problem of large phylogeny estimation a genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data avian-to-human transmission of h9n2 subtype influenza a viruses: relationship between h9n2 and h5n1 human isolates phylogenetic tree construction using markov chain monte carlo clustalw-mpi: a parallel implementation of clustal-w, based on genesis of a highly pathogenic and potentially pandemic h5n1 influenza virus in eastern asia smith, and 12 others, bats are natural reservoirs of sars-like coronaviruses influenza emergence and control the genome sequence of the sars-associated coronavirus rimmelzwaan and 5 others, sars infection of cats and ferrets molecular evidence of hiv-1 transmission in a criminal case searching for most parsimonious trees with simulated evolutionary optimization some effects of nucleotide sequence alignment on phylogeny estimation factors in the emergence of infectious diseases on outgroups the parsimony ratchet, a new method for rapid parsimony analysis large-scale sequence analysis of avian influenza isolates making better influenza virus vaccines? multiple sequence alignment in phylogenetic analysis hemagglutinin sequence clusters and the antigenic evolution of influenza a virus comments on the manhattan stratigraphic measure the effect of taxonomic sampling on accuracy of phylogeny estimation, test case of a known phylogeny probability distribution of molecular evolutionary trees: a new method of phylogenetic inference taxon sampling and the accuracy of large phylogenies joint bayesian estimation of alignment and phylogeny parsimony is hard to beat, computing and combinatorics phylogenetic analysis indicates transmission of hepatitis c virus from an infected orthopedic surgeon to a patient rec-idcm3: a fast algorithmic technique for reconstructing large phylogenetic trees campagnoli, and 30 others, characterization of a novel coronavirus associated with severe acute respiratory syndrome perspectives for vascular genomics the neighbor-joining method: a new method for reconstructing phylogenetic trees stochastic search strategy for estimation of maximum likelihood phylogenetic trees simultaneous comparison of three or more sequences related by a tree in d. sankoff and pigs as the mixing vessel for the creation of new pandemic influenza a viruses pharmacophylogenomics: genes, evolution and drug targets phylogenetic inference plant life histories ecology, phylogeny and evolution parallel phylogenetic inference numerical taxonomy the principles and practice of numerical classification zheng, and 21 others, crosshost evolution of severe acute respiratory syndrome coronavirus in palm civet and human accelerating parallel maximum likelihood-based phylogenetic tree calculations using subtree equality vectors how to build a beowulf. a guide to the implementation and application of pc clusters origin and evolution of influenza virus hemagglutinin genes paup: phylogenetic analysis using parsimony (and other methods) when are phylogency estimates from molecular and morphological data incongruent? integrating historical, clinical and molecular genetic data in order to explain the origin and virulence of the 1918 spanish influenza virus initial genetic characterization of the 1918 spanish influenza virus characterization of the 1918 influenza virus polymerase genes influenza: the mother of all pandemics risk factors for human disease emergence the full-length phylogenetic tree from 1551 ribosomal sequences of chitinous fungi, fungi clustal w: improving the sensitivity of progressive multiple sequence alignments through sequence weighting, position specific gap penalties and weight matrix choice resurrecting ancient genes, experimental analysis of extinct molecules inching toward reality: an improved likelihood model of sequence evolution petric, and 10 others, human illness from avian influenza h7n3 emerg infect dis probable person-to-person transmission of avian influenza a (h5n1) algorithms for approximate string matching, information and control archive on the complexity of multiple sequence alignment genetic diversity and recombination of porcine sapoviruses the outgroup comparison method of character analysis evolution and ecology of influenza a viruses, microbiol malign: a multiple sequence alignment program sequence alignment, parameter sensitivity, and the phylogenetic analysis of molecular data nucleic acid sequence phylogeny and random outgroups dna sequence alignment and parallel processing, mcgraw-hill yearbook of science and technolog phylogeny program for optimization of nucleic acids and other data optimization alignment: the end of multiple sequence alignment in phylogenetics? functional genomics and the study of development, variation and evolution h5n1 avian influenza: timeline of major events cumulative number of confirmed human cases of avian influenza a/(h5n1) reported to who detecting human-to-human transmission of avian influenza a (h5n1) emergence of multiple genotypes of h5n1 avian influenza viruses in hong kong sar increased taxon sampling greatly reduces phylogenetic error mr. farhat habib m.s. was instrumental to build and maintain cluster computers. rebecca allen and jiarui lian helped to organize phenotype and genetic data. the authors have no competing interests, financial or otherwise. dp acknowledges the national science foundations (nsf) support of the mathematical biosciences institute (mbi) and the mbi. dp and dj acknowledge the support of the department of biomedical informatics and the ohio state university medical center. dj acknowledges that this material is based upon work supported by, or in part by, the us army research laboratory and the us army research office under contract/grant number w911nf-05-1-0271 and nsf 0531763. key: cord-274056-9t3kneoo authors: abd elwahaab, marwa a.; abo-elkhier, mervat m.; abo el maaty, moheb i. title: a statistical similarity/dissimilarity analysis of protein sequences based on a novel group representative vector date: 2019-05-08 journal: biomed res int doi: 10.1155/2019/8702968 sha: doc_id: 274056 cord_uid: 9t3kneoo similarity/dissimilarity analysis is a key way of understanding the biology of an organism by knowing the origin of the new genes/sequences. sequence data are grouped in terms of biological relationships. the number of sequences related to any group is susceptible to be increased every day. all the present alignment-free methods approve the utility of their approaches by producing a similarity/dissimilarity matrix. although this matrix is clear, it measures the degree of similarity among sequences individually. in our work, a representative of each of three groups of protein sequences is introduced. a similarity/dissimilarity vector is evaluated instead of the ordinary similarity/dissimilarity matrix based on the group representative. the approach is applied on three selected groups of protein sequences: beta globin, nadh dehydrogenase subunit 5 (nd5), and spike protein sequences. a cross-grouping comparison is produced to ensure the singularity of each group. a qualitative comparison between our approach, previous articles, and the phylogenetic tree of these protein sequences proved the utility of our approach. sequence comparison is used to study structural and functional conservation and evolutionary relations among the sequences. the importance of similarity/dissimilarity of biological sequences returns to its relationship with the structures and functions. proteins with similar sequences usually have similar structures. the rate of addition of new sequences to the databases is increasing exponentially [1] . comparing these new sequences to those with known functions is a key way of understanding the biology of an organism. thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. there are many tools and techniques that provide the sequence comparisons. sequence comparison can be classified into alignmentbased methods and alignment-free methods [2, 3] . alignment-based methods assign scores to different possible alignments, picking the alignment with the highest score. some algorithms do global alignment or local alignment [4] [5] [6] . blast [7] and fasta [8] are the most widely used applications. alignment-based methods are computationally difficult with multiple sequence alignments at the same time. a wide range of scoring systems has been proposed such as amino acid substitution scoring matrices pam and blosum for protein alignment [9] . alignment-free approaches overcome the limitations of alignment-based methods. graphical representation approaches are one of them. graphical representations are usually accompanied by numerical characterization and then a descriptor to describe each protein sequence. a similarity/dissimilarity analysis is then done using these descriptors by evaluating euclidean distance or correlation angle among them. the smallest euclidean distance or correlation angle is the more similar. many graphical representations of dna and protein primary sequences have been proposed. some other approaches characterize numerically protein sequences without previous graphical representation and nongraphical representation methods [10, 11] . in this article, an alignment-free method is introduced. it is considered a nongraphical representation method. three groups of protein sequences are selected to illustrate our approach. they are beta globin, nadh dehydrogenase subunit 5 (nd5), and spike protein sequences. they are selected as each group has sequences of similar range of lengths. the 1 human aaa16334 147 2 chimpanzee caa26204 125 3 gorilla caa43421 121 4 mouse caa24101 147 5 rat caa29887 147 6 gallus caa23700 147 7 opossum aaa30976 147 opossum np 007105 602 most common sequences of each group are selected. the selected sample is assumed to be unbiased and the population distribution of each group is normal. therefore, the selected sample represents the group. statistics can be used to estimate the population's parameters. the adjacency vector is introduced as a novel descriptor for protein sequences. it is computed for each sequence in the selected sample of three groups. a reference vector is then computed for each group. this vector acts as a representative of the group. each sequence's degree of similarity in each group is measured according to its group's representative vector. so, a similarity/dissimilarity vector is constructed instead of ordinary similarity/dissimilarity matrix. our approach is independent of the protein sequence length. it does not require any previous graphical representation. it is a mathematically simple approach. the protein sequences used in this article are listed in tables 1, 2 , and 3. the sequences are downloaded from the national center for biotechnology information (ncbi) "https://www.ncbi.nlm.nih.gov/" as fasta files. these fasta files are imported into wolfram mathematica 8 where all the results and figures are produced. the phylogenetic tree of these protein sequences is also created by the basic local alignment search tool (blast) "https://blast.ncbi.nlm.nih .gov/blast.cgi". table 1 shows the 1 st sample set that consists of seven species of beta globin protein sequences. their range of lengths is from 121 to 147. this sample set is applied before in [12] . table 2 shows the 2 nd sample set which consists of nine nd5 protein sequences. their range of lengths is from 602 to 610. this sample set is applied before in [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] . table 3 shows the 3 rd sample set which consists of 29 spike protein sequences. their range of lengths is from 1162 to 1447. these viruses are coronavirus. they are classified into four classes: class i that includes the porcine epidemic diarrhea virus (pedv) and the transmissible gastroenteritis virus (tgev). class ii includes the bovine coronavirus (bcov), human coronavirus oc43 (hcov-oc43), and the murine hepatitis virus (mhv). class iii contains the infectious bronchitis virus (ibv). the others are severe acute respiratory syndrome coronaviruses (sars-cov). this sample set is applied before in [26] . in this approach, a new vector is suggested to be a descriptor of a protein sequence. this vector is called the adjacency vector ( ); x refers to the species' protein sequence and y refers to its related group. it counts the occurrence of all possible pairwise adjacencies obtained by reading the protein primary sequence from left to right. the protein sequence table 4 aa ar an ad ac aq ae ag ah ai al ak am af ap as at aw ay av 1 0 1 0 0 0 0 1 4 0 3 0 0 1 0 0 1 0 1 2 table 5 va table 6 aa ar table 7 va is composed of 20 common different amino acids which are "a," "r," "n," "d," "c," "q," "e," "g," "h," "i," "l," "k," "m," "f," "p," "s," "t," "w," "y," and "v" as ordered alphabetically according to 1 st letter code. therefore, the adjacency vector (a xy ) consists of 400 elements. every 20 elements are related to each amino acid. the first 20 elements are related to "a" amino acid. the second 20 elements are related to "r" amino acid. the third 20 elements are related to "n" amino acid and so on by the same order which is illustrated previously according to 1 st letter code. we borrow our idea from the 20 ×20 adjacency matrix [27] . the adjacency vector counts the possibilities of each pair. in other words, it counts the number of times that each pair is repeated along the sequence length. if the pair does not exist, its value in the adjacency vector is zero. for example, to evaluate the adjacency vector of the two short segments of "yeast saccharomyces cerevisiae" protein [16, 19, [22] [23] [24] 28] protein i: "wtfesrndpakdpvilwlnggpgcssltgl" protein ii: "wffesrndpandpiilwlnggpgcssftgl" the two protein sequences are composed of 30 amino acids. protein i is converted to 29 adjacent pairs that are wt, tf, fe, es, sr, rn, nd, dp, pa, ak, kd, dp, pv, vi, il, lw, wl, ln, ng, gg, gp, pg, gc, cs, ss, sl, lt, tg, gl as reading sequence from left to right. protein ii is converted to 29 adjacent pairs that are wf, ff, fe, es, sr, rn, nd, dp, pa, an, nd, dp, pi, ii, il, lw, wl, ln, ng, gg, gp, pg, gc, cs, ss, sf, ft, tg, gl as reading sequence from left to right. for example, "nd" pair has a count one in protein i and two in protein ii. "dp" pair has a count two in both protein i and protein ii. "sl" and "lt" pairs have a count one in protein i and zero in protein ii. our approach is applied on three selected groups of protein sequences. the groups are beta globin, nd5, and spike protein sequences as illustrated in tables 1, 2 , and 3, respectively. the most common protein sequences are selected in each group. the selected sample is assumed to be unbiased and the population distribution of each group is normal. therefore, the selected three samples can represent the three groups. the samples consist of seven beta globin, nine nd5, and 29 spike protein sequences. seven adjacency vectors for beta globin proteins, nine adjacency vectors for nd5 protein sequences, and 29 adjacency vectors for spike proteins are evaluated. for example: (1) human (beta globin) protein sequence's first 20 elements of its adjacency vector (a human beta globin ) are as shown in table 4 . (2) gorilla (nd5) protein sequence's last 20 elements of its adjacency vector (a gorilla nd5 ) are as shown in table 5 . the adjacency vector is used to describe each protein sequence individually in its corresponding group. this article provides a descriptor to the group itself. the median vector is selected to play the role of the group representative (gr y ); y refers to its group. it acts as a reference vector for each group. the median is a better measure of central tendency. it separates the higher half from the lower half of the sample's data. it is not sensitive to extreme values like average. the suggested group representative vector (gr y ) is a vector which is composed of also 400 elements. each element of 400 is the median of the corresponding elements in all adjacency vectors related to its sample that represents the group. beta globin, nd5, and spike protein sequences' representative vectors are computed. for example: (1) beta globin representative vector's (gr beta globin ) 1 st 20 elements are as shown in table 6 . (2) nd5 representative vector's (gr nd5 ) last 20 elements are as shown in table 7 . (3) spike proteins representative vector's (gr spike proteins ) 1 st 20 elements are as shown in table 8 . biomed research international table 8 aa ar an ad ac aq ae ag ah ai al ak am af ap as at aw ay av 9 1 3 5 2 4 3 6 0 7 6 2 1 3 6 4 7 1 5 3 a similarity/dissimilarity vector is introduced instead of the regular similarity/dissimilarity matrix [10, 11] . the similarity/dissimilarity matrix is a square symmetric matrix with zeros in its main diagonal. in order to evaluate this matrix, it is required to measure the degree of similarity between each protein sequence and others in the same group. if the 1 st row represents human and the 2 nd row represents gorilla, the similarity of all species according to human in 1 st row is measured. then the similarity is measured again of all species in 2 nd row according to gorilla and so on. the calculations' number of this matrix equals ∑ 1 = (k − 1)/2 where n is the number of compared species. the similarity/dissimilarity vector is suggested to save time and number of calculations. it is a vector that has a number of elements equal to the number of protein sequences in the selected sample of each group. it measures the degree of similarity between each protein sequence's adjacency vector and the group representative vector. in other words, it measures the degree of similarity between each protein's descriptor and the "group representative." it is simpler than previous matrix. it is calculated only one time for each sequence. the calculations' number of this vector equals n where n is the number of compared species. to measure the degree of similarity, we suggest two methods: (ii) e nd method. compute the angle between each sequence's adjacency vector (a xy ) and the group representative vector (gr y ) in radians by for beta globin protein sequences, seven species are selected in our sample set: human, chimpanzee, gorilla, mouse, rat, gallus, and opossum, as illustrated in table 1 . there are seven adjacency vectors corresponding to them. the group representative gr beta globin is evaluated based on these seven adjacency vectors. therefore, the similarity/dissimilarity vector has seven elements. the 1 st element corresponds to human, 2 nd element corresponds to chimpanzee, and so on, by the same order as in table 1 . in the tables 2 and 3 , respectively. the similarity/dissimilarity vectors that are corresponding to beta globin, nd5, and spike protein sequences are illustrated in tables 9, 10, and 11, respectively, based on the two methods discussed before. the results in table 9 show that the magnitude ( , where x: species) cannot measure the similarity/dissimilarity degree well among all beta globin sequences. the human, chimpanzee, and gorilla have the same value that is equal to 0.5568, while the similarity is well measured between mouse and rat. also, the dissimilarity between opossum and human is very clear. the angle ( ) is successfully measured similarity/dissimilarity among all the species as shown in figure 1 . the closest values of both and mean more similarity. the results in table 10 show that both the magnitude ( 5 ) and the angle ( 5 ) can measure similarity/dissimilarity degree well among nd5 protein sequences as shown in figure 2 . it is obvious that pigmy chimpanzee, common chimpanzee, human, and gorilla are very similar. also it shows the similarity of the blue whale, fin whale, and the mouse and rat as pairs and the dissimilarity between human and opossum. these results are satisfied with [13, 14, 16, 18, 19, [21] [22] [23] [24] [25] . the results in table 11 show that both and classified the 3 classes of viruses and sars covs well each as a single coherent class except only the "mhvjhm" virus. this virus belongs to class ii but our approach cannot classify it well. the classification of 29 spike proteins into classes by our approach is illustrated in figure 3 . the mhvjhm virus is the only wrong classified sequence. it is colored red. despite the wrong classification of mhvjhm virus, our approach corrects the broken classification of class i in [26] . according to the results in tables 9, 10 , and 11, the angle is preferred to be used as shown in figures 1, 2 , and 3. the group representative vector ( ) carries the information of its group. a cross-group comparison is done to prove the singularity of each group. tables 9, 10, and 11 are evaluated based on the group's sample set of protein sequences related to their corresponding group representative vector. tables 12, 13, 14, and 15 are evaluated based on each group sample set of protein sequences with another group representative vector. the similarity/dissimilarity analysis among the seven beta globin sequences measured according to ( 5 ) is illustrated in table 12 and shown in figure 4 . the similarity/dissimilarity analysis among the nd5 sequences measured according to ( ) is illustrated in table 13 and shown in figure 5 . the similarity/dissimilarity analysis among the beta globin sequences measured according to (gr spike ) is illustrated in table 14 and shown in figure 6 . the similarity/dissimilarity analysis among the nd5 sequences table 15 and shown in figure 7 . the results show a big distortion that ensures the individuality of each group. the phylogenetic tree is a branching diagram showing the evolutionary relationships among various biological species based upon similarities and differences in their sequences. a qualitative comparison between our results and the phylogenetic tree of protein sequences is used to prove the utility of our approach. the matching between the results and phylogenetic trees means matching with the naïve measure of sequence similarity (sequence homology). the basic local alignment tool (blast) is used to draw the phylogenetic trees. the phylogenetic trees of beta globin's seven species, nd5 nine species, and 29 spike protein sequences are illustrated in figures 8, 9 , and 10, respectively. the qualitative comparison of the results of tables 9, 10, and 11 and figures 8, 9 , and 10 shows the utility of our work especially the angle results. the proposed method is an alignment-independent method. an adjacency vector is suggested as a descriptor of any protein sequence. it does not require any graphical representation. a group representative vector is introduced to represent each group of protein sequences. a similarity/dissimilarity vector is produced instead of the regular similarity/dissimilarity matrix. the similarity/dissimilarity analysis is done by two methods. our approach is applied on three sample sets of three groups of protein sequences. each sample has a different range of lengths than the others. our approach does not depend on protein sequence length. it successfully measured similarity/dissimilarity among different lengths. it is very mathematically simple. a cross-grouping comparison is introduced to prove the singularity of each group. the results approved the utility of our approach compared with previous articles and phylogenetic tree obtained by blast program. we hope to make the method available to include ambiguous amino acid residues and nonstandard amino acids. we hope also to include the analyses of partial or gapped sequences. all data is mentioned clearly in the manuscript in section 2 under the title "dataset." in this section, we illustrate the data in three tables: tables 1, 2, and 3. we also mention in the 1st paragraph of dataset that data are downloaded from "gene bank." all data files are with extension ". fasta". the authors declare that they have no conflicts of interest. dna sequence comparison by a novel probabilistic method linear regression model of short kword: a similarity distance suitable for biological sequences with various lengths sequence comparison via polar coordinates representation and curve tree a general method applicable to the search for similarities in the amino acid sequence of two proteins identification of common molecular subsequences an improved algorithm for matching biological sequences basic local alignment search tool rapid and sensitive protein similarity searches amino acid substitution matrices from protein blocks graphical representation of proteins similarity/dissimilarity calculation methods of dna sequences: a survey 3-d maps and coupling numbers for protein sequences a novel descriptor for protein similarity analysis similarity analysis of protein sequences based on 2d and 3d amino acid adjacency matrices a new method to analyze protein sequence similarity using dynamic time warping a 2d graphical representation of protein sequence and its numerical characterization graphical representation and similarity analysis of protein sequences based on fractal interpolation adld: a novel graphical representation of protein sequences and its application comparative analysis of protein primary sequences with graph energy uc-curve: a highly compact 2d graphical representation of protein sequences the graphical representation of protein sequences based on the physicochemical properties and its applications f-curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids a novel method of 2d graphical representation for proteins and its application 3d graphical representation of protein sequences and their statistical characterization novel numerical characterization of protein sequences based on individual amino acid and its application similarities/dissimilarities analysis of protein sequences based on pca-fft on novel representation of proteins based on amino acid adjacency matrix a sequence-segmented method applied to the similarity analysis of long protein sequence it is a figure which summarizes our approach. it is submitted under the name of graphical abstract. (supplementary materials) key: cord-287634-64zqe4cz authors: al-ssulami, abdulrakeeb m.; azmi, aqil m.; hussain, muhammad title: codseqgen: a tool for generating synonymous coding sequences with desired gc-contents date: 2020-01-31 journal: genomics doi: 10.1016/j.ygeno.2019.02.002 sha: doc_id: 287634 cord_uid: 64zqe4cz abstract identification of regulatory elements is essential for understanding the mechanism behind regulating gene expression. these regulatory elements—located in or near gene—bind to proteins called transcription factors to initiate the transcription process. their occurrences are influenced by the gc-content or nucleotide composition. for generating synthetic coding sequences with pre-specified amino acid sequence and desired gc-content, there exist two stochastic methods, multinomial and maximum entropy. both methods rely on the probability of choosing the codon synonymous for usage in regard to a specific amino acid. in spite the latter exhibited unbiased manner, the produced sequences are not exactly obeying the gc-content constraint. in this paper, we present an algorithmic solution to produce coding sequences that follow exactly a primary amino acid sequence and a desired gc-content. the proposed tool, namely codseqgen, depends on random selection for smaller subsets to be traversed using the backtracking approach. the two regulatory elements, promoters which are near the coded area and enhancers which are further upstream or downstream, are dna sequences located in the non-coding regions and bind to proteins called transcription factors. these interactions enable rna polymerase from transcribing the related gene to produce mrna which in turn is translated to a specific protein. identifying such sequences is important to understand the mechanism of regulating gene expression. the occurrences of regulatory elements are highly influenced by common features such as gc-content [1] , di-nucleotide profile [2] , and codon bias [3, 4] . thus, identifying over/under-represented regulatory elements or genome-scale patterns relies on generating random sequences that obey the pre-specified amino acid sequence and gc-content constraints. there are many tools that are used to generate random sequences with various constrains. the well-known ones include: sms [5] , fabox [6] , and genrgens [7] . sms tool generates random coding sequences of specific length given the translation table. the primary goal of this tool is evaluating the results of sequence analysis. fabox is used to construct random dna sequences with a predefined nucleotide composition. genrgens creates random sequences with several models, such as markov chains, hidden markov models, weighted context-free grammars, and others. the sequences generated by genrgens are essentially used for structural motif evaluation. although, these tools generate random dna and coding sequences, none of them are capable of producing coding sequences given the amino acid sequence and gc-content. generating random coding sequences in response to complicated constraints is computationally expensive. this is because the number of codons are 1 ∼ 6 for each amino acid. thus, for a particular protein sequence of length n, we have to test at most 6 n coding sequences to find those satisfying the desired amino acid sequence and gc-content constraints. that is, the time complexity is exponential with respect to the length of amino acid sequence. currently, there are two solutions to solve this problem, multinomial [8, 9] and nullseq [10] . multinomial method chooses the synonymous codon c of amino acid a with probability p a (c) given the nucleotide composition. the probability p a (c) is computed as a normalized product of probability distribution of the individual nucleotides within the codon c. as shown in [10] , the multinomial method generates unbiased coding sequences. a more restricted method was presented recently, which the authors named nullseq. nullseq [10] uses the maximum entropy approach where the synonymous codon usage probability is derived from a strict function that expresses the expected gc-content in the reference amino acid sequence. however, majority of the resulting coding sequences defy the gc-content constraints. in this paper, we present codseqgen, the first exact solution to produce synonymous coding sequences with the desired gc-contents accurately. the proposed method uses the backtracking approach which is smarter than an exhaustive search, where only promising candidate solutions are tracked instead of all possible combinations. the rest of the paper is divided as follows. section 2 covers the implementation details of codseqgen. the results of testing our method on different set of proteins and comparison against nullseq is in section 3. the last section concludes the paper. our proposed method utilizes the backtracking approach. backtracking is more efficient than the exhaustive search when there is a very large search space where not all combinations of raw components are inspected. this technique is similar to tree depth-first search and depends on adding one component to the candidate solution at a time. where, as soon as a violation for constraints is detected, the algorithm discards the current solution and backtracks to the parent to seek another candidate solution, see e.g., [11] . since the reference amino acid sequence is long, it is infeasible to apply the backtracking approach directly. this is because a very large amount of memory will be required to generate the coding sequences. for tackling this problem, we devised a trick to permute the indices of the reference amino acid sequence. thus, random sets are selected without interferences and backtracked to produce coding sequences with a random distribution for the gc-content over the whole sequence, fig. 1 depicts our methodology. formally, let p refer to the reference amino acid sequence of length n, and p refer to the corresponding dna coding sequence of length 3n. assuming φ is the gc-content of p , then the problem is to generate a random set of synonymous coding sequences each of φ gc-content which in turn could be translated to p. let φ be the set of indices of p, where φ = {0, 1, … , n − 1}. by maintaining the set of indices, amino acids within the reference amino acid sequence can be accessed easily. according to the proposed methodology in fig. 1 , φ is divided into smaller subsets each of size s, with the last subset being a fraction of s. thus, the total number of subsets γ = ⌈n/s⌉. each subset g i is constructed by randomly selecting n indices, where the subsets are disjoint. in other words, the constructed subsets must satisfy three conditions. first, the union of all constructed subsets equals φ, which is the set of indices of reference amino acid sequence p (eq. (1)). second, the sum of all subsets' lengths equals the same as the length of p (eq. (2)). and finally, the subsets are disjoint (eq. (3)), setting s = 10, means that an array of size 59,049 on average is required to hold the coding subsequences. the worst case is that when all the indices in the subset are pointing to leucine, serine, or arginine. in this case, the search space will be large, 60,466,176 coding subsequences. but, with backtracking and gc-content constraint, this large search space will be pruned to about 100,000, which is feasible for memory limitations. with more complicated constraints such as nucleotide composition, this search space will be further pruned to about 10,000. our algorithm reads amino acid sequence symbols one by one from left to right and is checked each time the amino acid is replaced with the corresponding codon synonyms and the gc-content constraint. if the total gc-content so far for subset g i is greater than c i and the end of sequence is not reached, the current path is discarded and the algorithm backtracks to find a more promising path. fig. 2 illustrates the backtracking with a tree of height s + 1. the nodes at level 2 represent the codons of the first amino acid in the subset and the variable c i refers to the gc-content so far, where each node has its own c i that accumulates the gc-content of the path from the second level to the node. algorithms 1 and 2 further elaborate our method. note that the total gc-content of all subsets equal φ. for lines 3-13 in algorithm1, the gc-content is provided either by the input coding sequence that corresponds to the primary amino acid sequence or gc-content value in the interval (0,100). in the second case, initial coding sequence is created and adjusted to contain the desired gc-content. unlike stochastic methods, codseqgen is easy to modify for accommodating more complicated constraints such as nucleotide composition and di-nucleotide profile by simply altering the gc-content constraint to a new constraint. we tested codseqgen on a set of proteins in various species: human, saccharomyces cerevisiae s288c, bovine popular stomatitis virus, zika virus, sars coronavirus, and hantavirus. table 1 , lists these proteins with their ncbi accession numbers and their size, in term of the number of amino acids. for each protein, the gc-content is computed for the reference coding sequence. for instance, the reference coding sequence of human titin protein has a gc-content of 44.04%, whereas zika virus's reference coding sequence has 49.60%, and so on. in addition, we measured the possible range of the gc-content for each amino acid sequence. this range indicates the minimum and maximum amount of gc-content allowed to create coding sequences. as an example, the primary amino acid sequence of human titin protein has the possible range 30.42-66.67% of gc-content. therefore, it is infeasible to create coding sequences of gc-content less than 30.42% or over 66.67%. we ran both tools, codseqgen and nullseq [10] , to generate 1000 coding sequences given the primary amino acid sequence and the target gc-content of the reference coding sequence. results in table 2 show that codseqgen is more accurate in generating 1000 sequences with exactly the desired gc-content of the reference coding sequence, while nullseq produced coding sequences that do not exactly match the targeted gc-content. for example, the coding sequences that are generated for titin protein to match the gc-content of its reference coding sequence (dna nucleotides) do not match the target gc-content of 44.04% but rather vary in range 43.54-44.68%. scatter plots in fig. 3 , and fig. 4 for human titin protein and zika virus protein, clearly, show gc-contents of the 1000 generated random sequences by both tools. these figures evidently display that sequences generated by codseqgen tool fit with the gc-content constraint. all sequences have the precise gc-contents of 44.04% and 49.60%, as shown by the red vertical lines in fig. 3 and fig. 4 , respectively. since codseqgen tool generates random synonymous coding sequences having the exact gc-content, it is important to test how gccontent is distributed over complete sequences. this was tested by dividing the complete sequence into subsets and then computing the gccontents for all subsets. thus, standard deviation (std) and average (avg) are computed for each generated sequence. finally, a range is computed over 1000 random coding sequences. the same experiment was repeated for 1000 random coding sequences generated by nullseq. table 3 lists the ranges of std and avg for both tools over 1000 generated coding sequences in addition to the std and avg for the reference coding sequences. since it is a single reference coding sequence for each protein, no ranges are shown. smaller std means that all subsets have similar gc-content and larger std means that gc-content is non-uniformly distributed over subsets. thus, larger std and avg ranges means that a diversity of random sequences are produced. both tools generate sequences with a good std and avg ranges. however, random sequences by nullseq originally have gc-content smaller or greater than targeted gc-contents which cause the ranges of std and avg for subsets gc-content to be a bit larger. graphically, figs. 5-6 depict gc-content distribution for subsets over five randomly selected sequences on titin protein by codseqgen (fig. 5) and nullseq (fig. 6) . as another example, figs. 7-8 show as has been shown from results, codseqgen is a powerful tool to achieve the exact target gc-content with all generated synonymous coding sequences where gc-content is distributed more randomly over whole generated sequences. moreover, our tool can be adjusted easily when we have more complicated constraints and it can be useful in generating random rna structures with pre-specified constraints. in this paper, we presented a tool called codseqgen. the proposed tool produces synonymous coding sequences following pre-specified amino acid sequence and desired gc-content. our approach uses the backtracking technique to produce exact coding subsequences and these subsequences are aggregated to produce the desired synonymous coding sequences. the proposed tool will help researchers for identifying accurate protein-dna binding sites and producing sequences with more complicated constraints. table 2 the gc-content achieved by nullseq [10] vs codseqgen (our approach) for 1000 generated synonymous coding sequences based on primary reference coding sequences. codseqgen executable is available for free download at: https:// github.com/abdulrakeeb/codseqgen gc-content evolution in bacterial genomes: the biased gene conversion hypothesis expands codon pair bias is a direct consequence of dinucleotide bias codon influence on protein expression in e. coli correlates with mrna levels codon usage affects the structure and function of the drosophila circadian clock protein period the sequence manipulation suite: javascript programs for analyzing and formatting protein and dna sequences fabox: an online toolbox for fasta sequences genrgens: software for generating random genomic sequences and structures the 'effective number of codons' used in a gene accounting for background nucleotide composition when measuring codon usage bias nullseq: a tool for generating random coding sequences with desired amino acid and gc contents introduction to the design and analysis of algorithms the authors extend their appreciation to the deanship of scientific research at king saud university for funding this work through research group no. rg-1439-067. key: cord-267500-x3u9i1vq authors: rose, rebecca; constantinides, bede; tapinos, avraam; robertson, david l; prosperi, mattia title: challenges in the analysis of viral metagenomes date: 2016-08-03 journal: virus evol doi: 10.1093/ve/vew022 sha: doc_id: 267500 cord_uid: x3u9i1vq genome sequencing technologies continue to develop with remarkable pace, yet analytical approaches for reconstructing and classifying viral genomes from mixed samples remain limited in their performance and usability. existing solutions generally target expert users and often have unclear scope, making it challenging to critically evaluate their performance. there is a growing need for intuitive analytical tooling for researchers lacking specialist computing expertise and that is applicable in diverse experimental circumstances. notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. various software tools have been developed to accommodate the unique challenges and use cases associated with characterizing viral sequences; however, the quality of these tools varies, and their use often necessitates computing expertise or access to powerful computers, thus limiting their usefulness to many researchers. in this review, we consider the general and application-specific challenges posed by viral sequencing and analysis, outline the landscape of available tools and methodologies, and propose ways of overcoming the current barriers to effective analysis. in the last decade, at least seven separate viral outbreaks have caused tens of thousands of human deaths (woolhouse, rambaut, and kellam, 2015) , and the ever-increasing density of livestock, rate of habitat destruction, and extent of human global travel provides a fertile environment for new pandemics to emerge from host switching events (delwart 2007; fancello, raoult, and desnues 2012) , as was the case for sars, ebola, middle east respiratory syndrome (mers), and influenza-a (h1n1) (castillo-chavez et al. 2015) . at present we have a limited grasp of the extent of viral diversity present in the environment: the 2014 database release from the international committee for the taxonomy of viruses classified just 7 orders, 104 families, 505 genera, and 3286 species (http://www.ictvon line.org/virustaxonomy.asp); yet, one study estimated that there are at least 320,000 virus species infecting mammals alone (anthony et al. 2013 ). v c the author 2016. published by oxford university press. this is an open access article distributed under the terms of the creative commons attribution non-commercial license (http://creativecommons.org/ licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. for commercial re-use, please contact journals.permissions@oup.com high throughput (or so-called 'next generation') sequencing of viruses during the most recent outbreaks of mers in south arabia (gire et al. 2014; carroll et al. 2015; park et al. 2015) and ebola in west africa (quick, j et al. 2016 ) has facilitated rapid identification of transmission chains, rates of viral evolution, and evidence of the zoonotic origin of these outbreaks. access to such information during initial stages of an outbreak would offer invaluable insight into when, where, and how an epidemic might emerge, informing intervention and mitigation measures or even stopping it altogether. a major step towards this goal is therefore to identify existing zoonotic and environmental pathogens with pandemic potential. this is a significant undertaking, demanding considerable investment and close collaboration between government, ngos and academia, for example, the usaid program predict http://www.vetmed.ucdavis.edu/ohi/predict/index.cfm, as well as on the ground surveillance by local authorities and scientists in areas of the world most at risk. the characterization of unknown viral entities in the environment is now possible with modern sequencing; however, current tooling for exploiting these data represents a practical and methodological bottleneck for effective data analysis. practically, most available software tools are inaccessible to the majority of potential users, demanding expertise and computing resources often lacked by the researchers from diverse backgrounds involved in sample collection, sequencing, and analysis. there is a need for robust and intuitive analytical tools without requirements for fast internet connectivity, which may be unavailable in remote or developing regions. more fundamentally, the intended scope of published analytical tools and workflows is often less than clear, and given the diverse applications of viral sequencing, it can be difficult to gauge the relevance of newly published tools without first testing them. for example, a fast sequence classifier might fail entirely to detect a novel strain of a well-characterized virus, and equally might perform well with illumina sequences yet deliver poor results for data generated with the ion torrent platform. furthermore, results arising from these analyses should be replicable, intelligible, and useful to the end user, with provision for quality control and error management. software tools that target expert users should be tested, documented and robustly distributed as packages or containers so as to streamline the processes of installation and generating results. methodologically, most genomic sequence analysis software is not well suited for viral genomes. generic tools that are able to address the challenges posed by viral sequences are often applicable only in limited circumstances. choosing between approaches is made difficult due to an abundance of disparate yet functionally equivalent methodologies and in general a lack of rigorous benchmarks for viral datasets. while there is much ongoing research in this area, both the sensitive detection of previously characterized viruses and viral discovery remain key challenges open for innovation. here we survey the landscape of available approaches for analyzing both known and unknown viruses within genomic and metagenomic samples, with focus on their practical and methodological suitability for use by a broad spectrum of researchers seeking to characterize viral metagenomes. within metagenomes the proportion of viral nucleic acids is typically far lower than that of host or other microbes, limiting the amount of signal available for analysis after sequencing. to (ruby, bellare, and derisi 2013) . alternatively, pcr amplification may be used to generate an abundance of specific viral sequences present in a sample, a widely used strategy, which was employed in the identification and analysis of mers coronavirus (zaki et al. 2012; cotten et al. 2013 cotten et al. , 2014 , although effective primer design can be challenging in the presence of high genomic diversity in the target viral species. conversely, an excess of sequencing coverage can lead to the construction of overly complex and unwieldy de novo assembly graphs in the presence of high genomic diversity, reducing assembly quality. using in silico normalisation (crusoe et al. 2015) , excess coverage may be reduced by discarding sequences containing redundant information. this approach increases analytical efficiency when dealing with high coverage sequence data, and we have shown that it can benefit de novo assembly of viral consensus sequences. another in silico strategy for increasing analytical efficiency by discarding unneeded data is to filter sequences from known abundant organisms through alignment with one or more reference genomes using an aligner or specialist tool (approaches reviewed in daly et al. 2015) . there are several sequencing technologies in widespread use that are capable of reading hundreds of thousands to billions of dna sequences per run (reuter, spacek, and snyder 2015) . the current market leader, illumina, manufactures instruments capable of generating billions of 150 base pair (bp) paired end reads (see 'glossary') per run, with read lengths of up to 300 bp. the illumina short read platform is widely used for analyses of viral genomes and metagenomes, and, given sufficient sequencing coverage, enables sensitive characterization of lowfrequency variation within viral populations (e.g. hiv resistance mutations as low as 0.1% (li et al. 2014) ). ion torrent (thermofisher) is capable of generating longer reads than illumina at the expense of reduced throughput and a higher rate of insertion and deletion (indel) error (eid et al. 2009 ). single molecule real-time sequencing commercialized by pacific biosciences (pacbio) produces much longer (>10 kbp) reads from a single molecule without clonal amplification, which eliminates the errors introduced in this step. however, this platform has a high (10%) intrinsic error rate, and remains much more expensive than illumina sequencing for equivalent throughput. the nanopore platform from oxford nanopore technologies, which includes the pocket sized minion sequencer, also implements long read single molecule sequencing, and permits truly real-time analysis of individual sequences as they are generated. although more affordable than pacbio single molecule sequencing, the nanopore platform also suffers from high error rates in comparison with illumina (reuter, spacek, and snyder 2015) . however, the technology is maturing rapidly and has already demonstrated potential to revolutionize pathogen surveillance and discovery in the field, as well as enabling contiguous assembly of entire bacterial genomes at relatively low cost (feng et al. 2015; quick et al. 2015; hoenen et al. 2016 ). hybrid sequencing strategies using both long and short reads leverage the ability of long reads to resolve repetitive dna regions while benefitting from the high accuracy of short reads, at the expense of additional sequencing, library preparation and data analysis (madoui et al. 2015) . the reconstruction of sequencing reads into full length genes and genomes can be performed by means of either referencebased alignment or de novo assembly, a decision dependent on experimental objectives, read length, quality and data complexity. in reference-based approaches, reads are mapped to similar regions of a supplied template genome, a well-studied and computationally efficient process implemented with a suffix array index of the reference genome. in contrast, de novo assembly is computationally exhaustive but important in cases where either a target genome is poorly characterized or reconstruction of genomes of a priori unknown entities in metagenomes is sought, such as in surveillance studies. for short read data, the increased sequence length afforded by assembly can be necessary to distinguish members of highly conserved gene families from one another. assembly is also widely used for generating whole genome consensus sequences to facilitate analyses of viral variation, and is a typical starting point for analyses of diverse populations of well-characterized viruses. even where glossary contigs: contiguous nucleotide sequences assembled from multiple overlapping reads. coverage: the number of times a genome (or part thereof) has been sequenced. de bruijn graph: a network of nodes and edges, where each edge represents a k-mer found in the collection of reads, and each node represents either the prefix or suffix of the k-mer. de novo assembly: reconstruction of short sequences into longer sequences (or contigs), without use of a reference sequence digital signal processing data transformation: analytical techniques for transforming sequential data into a domain representative of data features. discrete fourier transform: a spectral analysis technique for identifying sine and cosine frequency components in numerical signal data. discrete wavelet transform: a spectral analysis technique for decomposing data to its frequency and spatial components. k-mer: a subsequence of length k. many genomic analyses involve decomposition of sequences into all possible subsequences of a specified length k. numerical sequence representation: numerical mapping of nucleotide sequences, permitting the application of signal processing transformation approaches. paired-end reads: reads generated from both 5 and 3 ends of the same dna molecule. depending on the length of the molecule and that of the reads, these pairs may or may not overlap in the middle. read overlap graphs: a network of nodes and edges, where each edge represents a read and each vertex represents an overlap between two nodes. reference-based alignment: orientation/alignment of reads with respect to a specified reference sequence. scaffolds: dna sequences comprising contigs with gaps between them, often generated using read pairing information. suffix array: a sorted array of all suffixes of a string, such as a dna sequence, enabling efficient sequence comparison. long reads are available, assembly plays an important role in mitigating the high error rates associated with single molecule sequencing technologies, yielding accurate consensus sequences from inaccurate individual reads. modern de novo assemblers generally leverage either de bruijn graphs or read overlap graphs as part of the approach known as overlap layout consensus (olc). figure 1 illustrates the differences between the two methods. olc assemblers use the similarity of whole reads in order to construct a graph wherein each read is represented by a node, and subsequently merge overlapping reads into consensus contigs (deng et al. 2015) . olc is relatively time and memory intensive, scaling poorly to millions of reads and beyond. however, the fewer, longer reads generated by emerging single molecule sequencing technologies tend to be well suited to olc assembly, which can be easily implemented to tolerate long and noisy sequences (compeau, pevzner, and tesler 2011) . older, notable, de novo assemblers implementing olc include cap3 (huang and madan 1999) and celera (http://www.jcvi.org/cms/research/projects/cabog/over view/), while mhap (berlin et al. 2015) , canu (berlin et al. 2015) , and miniasm (li 2016) represent the current state of the art. there also exist a number of olc assemblers intended for use with viral sequences: vicuna was designed for short, nonrepetitive and highly variable reads from a single population (yang et al. 2012) , and price (ruby, bellare, and derisi, 2013) iteratively assembles low to moderate complexity metagenomes (e.g. runckel et al. 2011; grard et al. 2012 ;) using a similar algorithm to the actively developed consensus assembler iva (hunt et al. 2015) , which like vicuna is designed for single virus populations rather than metagenomes (see table 1 for additional details on programs). a de bruijn or k-mer graph represents a set of reads in terms of its k-mer composition, where k-mers are subsequences of a length k, specified by the user. each k-mer is assigned to an edge in a graph, where the nodes are k-1 prefixes and suffixes of the k-mer. the assembler identifies the path through the graph in which each edge is visited only once (reviewed in compeau, pevzner, and tesler 2011) . de bruijn graphs are much more efficient to construct than overlap graphs and are suited to large numbers of short reads, and where coverage is high, since redundant k-mers occupy negligible random access memory (ram). however, with this efficiency comes a lack of error tolerance in identifying overlaps, less tolerance of repeated sequences in comparison to overlap graphs, and a loss of read coherence, meaning that k-mers originating from different reads may be co-assembled. examples of assemblers using de bruijn graphs include soapdenovo (luo et al. 2012 ), allpaths figure 1 . two widely used methodologies in de novo assembly of short reads. reads are not represented explicitly within a de bruijn graph; they are instead decomposed into distinct subsequence 'words' of length k, or k-mers, which can be linked together via overlapping k-mers to create an assembly graph. in olc, a pairwise comparison of all reads is performed, identifying reads with overlapping regions. these overlaps are used to construct a read graph. next, overlapping reads are bundled into aligned contigs in what is referred to as the layout step, before finally the most likely nucleotide at position is determined through consensus. this figure is simplified to demonstrate the theory for the assembly of single genomes; note that the process has additional complexities for the reconstruction of metagenomes. (butler et al. 2008) , spades (bankevich et al. 2012) , and abyss (simpson et al. 2009 ). typical de novo assemblers are designed to reconstruct genomes with uniform sequencing coverage across their length. this is problematic for metagenomes (including viromes) where coverage typically varies considerably both among different genomes and within individual genomes. to address this problem, dedicated metagenome assemblers have been developed. omega (haider et al. 2014 ) is an olc-based method that uses a minimum cost flow analysis of the olc graph to generate initial contigs, merging these to create longer contigs and scaffolds using mate-pair information. genovo (laserson, jojic, and koller 2011) is another olcbased method that generates a probabilistic model for the dataset and subsequently uses an iterative approach to reconstruct the most likely genome contigs. megahit (li et al. 2015) prioritizes speed, leveraging a succinct de bruijn graph to rapidly reconstruct high complexity metagenomes, such as those of soil or seawater, on a single computer. noteworthy is the iterative de bruijn graph assembler spades, which although not initially intended for metagenome assembly, has been widely adopted for its effectiveness in assembling variable coverage metagenomes of limited complexity. metaspades (nurk et al. 2016 ) is a metagenome-specific release of the spades pipeline with refinements to its graph simplification and repeat resolution algorithms, counterintuitively capable of leveraging rare strain information so as to improve its consensus reconstruction capabilities. other de bruijn graph metagenome assemblers based on their genomic counterparts include ray-meta (boisvert et al. 2012) , metamos (treangen et al. 2013) , metavelvet (namiki et al. 2012; afiahayati, sato, and sakakibara 2015) , and idba-ud (peng et al. 2012) . for example, unlike the genome assembler velvet, metavelvet's de bruijn graph is decomposed into many subgraphs (using coverage difference and graph connectivity), and scaffolds are built independently for each subgraph. metavelvet-sl addresses limitations with metavelvet, using supervised learning to detect and classify chimeric nodes within the de bruijn graph. idba-ud partitions a de bruijn graph into isolated components, constructs a multiple alignment, and subsequently identifies variation within these partitions using multiple depth relative thresholds to remove erroneous k-mers. ray meta (boisvert et al. 2012 ) extends the massively distributed assembly model of ray to variable coverage metagenomes, while metamos (treangen et al. 2013 ) is both a metagenomic extension and successor to the amos genome assembler. we recently proposed a method based on numerical sequence representations and digital signal processing data transformation (spdt) approaches to reduce the size of working datasets, permitting fast and sensitive read alignment and de novo assembly of diverse viral populations (tapinos et al. 2015) . spdt methods, such as the discrete fourier transform (dft) (agrawal, faloutsos, and swami 1993) , and discrete wavelet transform (dwt) (percival and walden 2006) (fig. 2) , are used to reduce sequences into lower dimensional space, preserving only prominent data characteristics. analysis is subsequently performed with these lower dimensionality transformations, enabling faster data comparison. since spdt methodologies such as the fourier and wavelet transforms are applicable only to numerical sequences, nucleotide sequences must first be numerically transformed with one of several techniques including real number representations (chakravarthy et al. 2004 ), complex number representations (anastassiou 2001) , the dna walk (lobry 1996) , and the voss method (voss 1992) . although metagenome assemblers generally outperform single genome assemblers in reconstructing different genomes simultaneously, the complexity of this task stipulates their tendency to collapse variation at or beneath strain level into consensus sequences. even to this end, their effectiveness may be limited as a consequence of extreme variation within specific rna virus populations due to mutation and recombination, and low and/or uneven sequencing coverage across a particular genome. furthermore, it should be noted that de novo assembly is particularly sensitive to the quality of input sequences, meaning that problems during sample extraction, enrichment and library preparation can be highly detrimental to downstream analyses. of key importance therefore are quality control methods for detecting, and where appropriate correcting, problems associated with contamination (darling et al. 2014; orton et al. 2015) , primer read-through and low quality reads (reviewed in leggett et al. 2013 ). viral genomes and metagenomes comprising high intraspecific variation can be challenging targets for assembly, giving rise to complex assembly graphs and fragmented assemblies. this is often the case for clinical samples from hiv and hepatitis c patients, in which high rates of mutation and long durations of infection can contribute to extreme population divergence, but can also be observed in environmental samples. where such diversity exists, alignment based probabilistic population reconstruction approaches can be effective, permitting the reconstruction of individual viral variants into 'haplotypes' exceeding read length. this problem has been well studied, and tools such as shorah, qure, and predicthaplo (giallonardo et al. 2014 ) are designed for haplotyping viral populations. shorah (zagordi et al. 2011 ) extracts local alignments of a specified window length, reconstructs haplotypes for each 'cluster' in that window, and removes mutations from sequences in the cluster not matching the reconstructed haplotype using a model-based probabilistic clustering algorithm. qure (prosperi and salemi 2012; prosperi et al. 2013 ) removes nucleotide substitutions and indels with a poisson model and reconstructs haplotypes using a heuristic algorithm based on a multinomial distribution. both approaches have the advantage of reporting probabilities for the reconstructed haplotypes. predicthaplo is notable for taking into account the read pairing information in illumina data. a limitation of all of these approaches; however, is their reliance upon a single reference sequence with which to perform the initial alignment, a process which assumes a degree of sequence similarity which may not always be observed in diverse regions, such as regions encoding envelope proteins, of rna virus genomes. this can be mitigated through construction of a data-specific template through iterative reference mapping and consensus refinement strategies (archer et al. 2010; b rinda, boeva, and kucherov 2016) . other possibilities for broader utility of these approaches include the use of multiple viral reference sequences, either through consideration of multiple linear sequences or by direct alignment of sequences to a variation graph [https://github.com/vgteam/ vg], an emerging approach for modeling genomic variation. sequence classification is one of the most studied problems in computational biology, and taxonomic assignment is a key objective of metagenome analysis. all classification methods, to some extent, depend upon detecting similarity between a query sequence and a collection of annotated sequences. classification may be undertaken using either unassembled reads or the reconstructed contigs arising from the assembly process. the computational requirements of available approaches vary dramatically according to their ability to detect homology in divergent sequences; for example, exact k-mer matching approaches permit rapid sequence classification, yet typically struggle to identify divergent sequences of viral origin, while high-sensitivity protein alignment searches may be prohibitively slow, especially in application to entire sequencing datasets. some of the more contemporary and speed-optimized taxonomic assignment approaches also have high ram requirements, limiting scope for their use with readily available computer hardware. the output of sequence homology search tools is not itself easily interpreted, requiring post-processing in order to yield meaningful classifications. retroactive taxonomic assignment using these results is non-trivial, requiring additional database lookups, for example, for determination of a conservative 'lowest common ancestor' (lca) taxon shared by all matches for each query sequence. this kind of complexity necessitates the need for the integration of different tools within application-specific 'pipelines'. viral identification approaches typically depend on similarity searches against a database using an aligner such as blast (altschul et al. 1990 ). comprehensive databases (e.g. genbank) or smaller custom databases containing for example, only viral sequences of interest may be used, although the latter can generate misleading results. provide (ghosh et al. 2011 ) uses virusspecific alignment parameters and thresholds to assign viruses at different taxonomic levels from blast matches to a protein database. virome (wommack et al. 2012 ) is a multifaceted tool integrating results from searches of several sequence and function databases. megan (huson et al. 2011 ) is a generally applicable metagenomic classifier, which uses blast results to infer the lca for a given sequence and provides functional analyses through a graphical interface. automatic pipelines which combine various homology search strategies to identify a final set of viral reads include virushunter (zhao et al. 2013) , a perl script that automates viral identification using blast prior to assembly; metavir (roux et al. 2011) , a web application that compares users' datasets to published viral sequences; and virsorter (roux et al. 2015) , which identifies prophages and viruses by comparison with custom datasets. with the exception of web applications, however, these are not intuitive tools for the majority of users, requiring manual configuration and installation of software dependencies. furthermore, similarity search approaches are in general extremely resource-intensive, and performing sensitive blast-like database searches with millions of reads is intractable without use of specialist computational resources. to address this problem, tools have emerged leveraging optimized search algorithms and prebuilt databases so as to increase the tractability of classifying millions of reads. for example, kraken (wood and salzberg 2014) and clark (ounit et al. 2015) are fast exact k-mer matching approaches that use prebuilt databases of viruses, bacteria, human, and fungi, although custom databases may also be built. one codex is a proprietary web-based metagenome analysis platform with an integrated fast k-mer matching engine (similar to that of kraken) which is both fast, very easy to use, and free for academic use (minot, krumm, and greenfield) . lambda (hauswedell, singer, and reinert 2014) and diamond (buchfink, xie, and huson 2015) are sensitive and heavily optimized blast-like aligners which leverage alphabet reduction to permit protein searches three to five orders of magnitude faster than blast, offering prebuilt database indexes for common applications. although exhaustive blast-like methods can detect homology in divergent sequences, these methods are in general limited by the relatively few validated viral sequences deposited in public databases, the high diversity within viral families which can obscure relatedness, and the lack of a defined set of core genes common to all viruses that can be used to distinguish species (e.g. the 16s gene for bacteria) (fancello, raoult, and desnues 2012) . these features make it difficult to assign similarity thresholds for classification that are applicable to all potential viruses in a sample (simmonds 2015) . comparison methods that do not rely on sequence similarity include phylopythia (mchardy et al. 2007 ), which uses nucleotide frequencies to classify reads, and phymm (brady and salzberg 2009) , which uses interpolated markov models to find variable length oligonucleotides that characterize species in the ncbi refseq database. although these approaches are less accurate than blast searches, phymmbl (brady and salzberg 2011) combines phymm and blast and outperforms either one on its own. alignment-free comparison approaches, for example, based on dinucleotide frequencies, codon usage patterns, or small but conserved regions of family wide ubiquitous genes, may be more robust to the limitations of the database than sequence similarity searches. these features may also reduce the computation required and highlight evolutionary relationships otherwise obscured by high sequence variability. a fundamental challenge in the classification of viral sequences with any of these methods remains their limited representation within curated sequence databases. while the rate at which new viruses are being added to ncbi's refseq collection has increased considerably, from a year average 0.34 species/ day in 2010 to 2.5 species/day in 2015 (fig. 3) , our documented understanding of the extent of viral diversity remains superficial (anthony et al. 2013) . reads of true viral origin are therefore liable to be missed in many cases. the rate of database growth also highlights the need to maintain frequently updated search indexes for sequence classification, construction of which often demands specialist servers equipped with hundreds of gigabytes of ram. even if up-to-date indexes are maintained inside a public repository, their file sizes are substantial, demanding users have access to a fast internet connection. consequently, complete outsourcing of sequence classification to remote web services is a compelling prospect for those with adequate internet connections but without powerful computing hardware, increasing scope for conducting analyses with portable computers. we see several barriers to realizing the goal of active, on-theground surveillance and early detection of viruses with epidemic potential. 1. the emergence of virus-specific assembly and metagenomic tools is a relatively recent phenomenon, with many of the methodologies in use today repurposing one or more existing algorithms. these tools mostly target a small audience of expert users and, as with most research software, decay after initial release due to a lack of ongoing funding, poor software development practices and/or authors' change of circumstances (duck et al. 2016 ). there is a need for a better balance between research software presenting novel methodologies and for sustainably developed, documented and tested software distributed through robust and user friendly channels such as package managers so as to increase the useful life of viral informatics software. researchers and granting agencies should consider the importance of this step and allocate resources accordingly. 2. democratisation of routine analyses through development of user friendly, locally installable software and remote web services is critical. preconfigured cloud virtual machines offer a convenient, low cost way to run analyses, yet must permit straightforward sequence database and software version updates so as to remain relevant after their initial release. 3. maintaining up to date indexes of large sequence databases is a problem all classification tools must address, stipulating access either to powerful computers for index construction or the ability to download the prebuilt indexes over a fast connection. furthermore, classification of viral sequences is critically dependent upon the quality of curated viral databases such as refseq, to which submitting newly discovered sequences can be prohibitively time consuming. a solution might involve the creation of a central database containing for any given sequencing project both raw reads as well as filtered, assembled and/or annotated reads, and analysed using a single central pipeline. on a regular basis, the database could report sequences and corresponding metadata for unclassified 'dark matter', which is often discarded and yet is likely to contain sequences belonging to novel pathogens. by combining the dark matter from multiple studies, trends within these unclassified reads may be identified that could lead to greater power to identify new biological entities. 4. benchmarking of software also remains an open problem within the field, which lacks standardized test datasets that are used across multiple studies. often benchmarking datasets are chosen to highlight the advantages of the method under study, and therefore may be quite specific for a given application. thus the field needs to agree upon a set of standard, well-characterized reference datasets for virusfocused studies. the future of the field is promising, with emerging technologies showing potential to eliminate certain challenges. single molecule sequencing, for example, permits the sequencing of whole viral genomes as single reads, with forthcoming portable and smartphone operated sequencers promising potentially revolutionary analyses in the field. innovative analytical approaches are constantly being published, and it is evident that the motivation, creativity and expertise needed to meet these challenges exists within the community. broader communication among developers and end users is essential, and in conjunction with well-funded international initiatives directed at this goal, intelligent viral surveillance could soon be realized. metavelvet-sl: an extension of the velvet assembler to a de novo metagenomic assembler utilizing supervised learning efficient similarity search in sequence databases basic local alignment search tool genomic signal processing a strategy to estimate unknown viral diversity in mammals the evolutionary analysis of emerging low frequency hiv-1 cxcr4 using variants through time-an ultra-deep approach spades: a new genome assembly algorithm and its applications to single-cell sequencing assembling large genomes with single-molecule sequencing and locality-sensitive hashing ray meta: scalable de novo metagenome assembly and profiling phymm and phymmbl: metagenomic phylogenetic classification with interpolated markov models dynamic read mapping and online consensus calling for better variant detection fast and sensitive protein alignment using diamond' allpaths: de novo assembly of whole-genome shotgun microreads temporal and spatial analysis of the 2014-2015 ebola virus outbreak in west africa beyond ebola: lessons to mitigate future pandemics autoregressive modeling and feature analysis of dna sequences how to apply de bruijn graphs to genome assembly full-genome deep sequencing and phylogenetic analysis of novel human betacoronavirus', emerging infectious diseases the khmer software package: enabling efficient nucleotide sequence analysis', f1000res host subtraction, filtering and assembly validations for novel viral discovery using next generation sequencing data phylosift: phylogenetic analysis of genomes and metagenomes viral metagenomics', reviews in medical virology an ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data a survey of bioinformatics database and software usage through mining the literature real-time dna sequencing from single polymerase molecules computational tools for viral metagenomics and their application in clinical research nanopore-based fourth-generation dna sequencing technology provide: a software tool for accurate estimation of viral diversity in metagenomic samples full-length haplotype reconstruction to infer the structure of heterogeneous virus populations genomic surveillance elucidates ebola virus origin and transmission during the 2014 outbreak a novel rhabdovirus associated with acute hemorrhagic fever in central africa omega: an overlap-graph de novo assembler for metagenomics lambda: the local aligner for massive biological data nanopore sequencing as a rapidly deployable ebola outbreak tool', emerging infectious disease cap3: a dna sequence assembly program iva: accurate de novo assembly of rna virus genomes integrative analysis of environmental sequences using megan4 genovo: de novo assembly for metagenomes sequencing quality assessment tools to enable data-driven informatics for high throughput genomics megahit: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph minimap and miniasm: fast mapping and de novo assembly for noisy long sequences comparison of illumina and 454 deep sequencing in participants failing raltegravir-based antiretroviral therapy a simple vectorial representation of dna sequences for the detection of replication origins in bacteria soapdenovo2: an empirically improved memory-efficient short-read de novo assembler genome assembly using nanopore-guided long and error-free dna reads accurate phylogenetic classification of variable-length dna fragments fast and sensitive taxonomic classification for metagenomics with kaiju one codex: a sensitive and accurate data platform for genomic microbial identification metavelvet: an extension of velvet assembler to de novo metagenome assembly from short sequence reads metaspades: a new versatile de novo metagenomics assembler distinguishing low frequency mutations from rt-pcr and sequence errors in viral deep sequencing data clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers ebola virus epidemiology, transmission, and evolution during seven months in sierra leone idba-ud: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth hiv haplotype inference using a propagating dirichlet process mixture model empirical validation of viral quasispecies assembly algorithms: state-of-the-art and challenges rapid draft sequencing and real-time nanopore sequencing in a hospital outbreak of salmonella virsorter: mining viral signal from microbial genomic data price: software for the targeted assembly of components of (meta) genomic sequence data temporal analysis of the honey bee microbiome reveals four novel viruses and seasonal prevalence of known viruses, nosema, and crithidia methods for virus classification and the challenge of incorporating metagenomic sequence data abyss: a parallel assembler for short read sequence data alignment by numbers: sequence assembly using compressed numerical representation metamos: a modular and open source metagenomic assembly and analysis pipeline evolution of long-range fractal correlations and 1/f noise in dna base sequences virome: a standard operating procedure for analysis of viral metagenome sequences kraken: ultrafast metagenomic sequence classification using exact alignments lessons from ebola: improving infectious disease surveillance to inform outbreak management de novo assembly of highly diverse viral populations shorah: estimating the genetic diversity of a mixed sample from next-generation sequencing data isolation of a novel coronavirus from a man with pneumonia in saudi arabia identification of novel viruses using virushunter-an automated data analysis pipeline the virogenesis project receives funding from the european union's horizon 2020 research and innovation program under grant agreement no 634650. bede constantinides receives funding through a biotechnology and biological sciences research council (bbsrc) doctoral training program and avraam tapinos receives funding from a bbsrc project grant, bb/m001121/1. we thank katrina lithgoe and two anonymous reviewers for their helpful edits and suggestions. conflict of interest: none declared. key: cord-302798-q0mbngqy authors: ge, junwei; gu, shanshan; cui, xingyang; zhao, lili; ma, dexing; shi, yunjia; wang, yuanzhi; lu, taofeng; chen, hongyan title: genomic characterization of circoviruses associated with acute gastroenteritis in minks in northeastern china date: 2018-06-14 journal: arch virol doi: 10.1007/s00705-018-3908-5 sha: doc_id: 302798 cord_uid: q0mbngqy mink circovirus (micv), a virus that was newly discovered in 2013, has been associated with enteric disease. however, its etiological role in acute gastroenteritis is unclear, and its genetic characteristics are poorly described. in this study, the role of circoviruses (cvs) in mink acute gastroenteritis was investigated, and the micv genome was molecularly characterized through sequence analysis. detection results demonstrated that micv was the only pathogen found in this infection. micvs and previously characterized cvs shared genome organizational features, including the presence of (i) a potential stem-loop/nonanucleotide motif that is considered to be the origin of virus dna replication; (ii) two major inversely arranged open reading frames encoding putative replication-associated proteins (rep) and a capsid protein; (iii) direct and inverse repeated sequences within the putative 5ʹ region; and (iv) motifs in rep. pairwise comparisons showed that the capsid proteins of micv shared the highest amino acid sequence identity with those of porcine cv (pcv) 2 (45.4%) and bat cv (batcv) 1 (45.4%). the amino acid sequence identity levels of rep shared by micv with batcv 1 (79.7%) and dog cv (dogcv) (54.5%) were broadly similar to those with starling cv (51.1%) and pcvs (46.5%). phylogenetic analysis indicated that micvs were more closely related to mammalian cvs, such as batcv, pcv, and dogcv, than to other animal cvs. among mammalian cvs, micv and batcv 1 were the most closely related. this study could contribute to understanding the potential pathogenicity of micv and the evolutionary and pathogenic characteristics of mammalian cvs. circoviruses (cvs) are small nonenveloped icosahedral viruses measuring 16-26 nm in diameter and possessing a single-stranded circular dna genome with a length of 1.7-2.1 kb and a major structural protein [1] . in the current taxonomy release, the family circoviridae is divided into two genera: circovirus and cyclovirus [1, 2] . the most recent release of the universal virus database of the international committee on taxonomy of viruses reported that the genus circovirus includes 29 recognized species whose members infect mammals and birds (ictv, https ://talk.ictvo nline .org/taxon omy/p/taxon omy-histo ry?taxno de_id=20175 768, released on july 2017). cvs have been identified in numerous species associated with various clinical disorders, including asymptomatic infections and lethal diseases [3] [4] [5] . in birds, cv infections are associated with lethargy, weight loss, respiratory distress, diarrhea, and poor race performance [6] [7] [8] . in geese, these infections cause death. in other animals, these infections induce different clinical signs, including respiratory, vesicular, hemorrhagic, and gastroenteritic manifestations and reproductive failure [3, 9] . cv infections are also related to lymphoid depletion and immunosuppression, which likely increase the severity of secondary infection [10] . mink circovirus (micv) is a novel pathogen that was found in cases of minks with diarrhea in dalian city, liaoning province, china, in 2013 [11] . despite the discovery of cv infection in minks, the pathogenic role of micv in single or polymicrobial infections has not been determined, 1 3 and the prevalence and economic importance of this virus have yet to be elucidated. in this study, micv was detected and molecularly characterized from the collected feces of minks with acute gastroenteritis in harbin city, heilongjiang province, in northeastern china in march 2015. the complete genome sequences of micv strains were compared with those of other cv strains, including recent cvs detected in birds and animals, to enhance our understanding of their genetic and biological differences. in march 2015 an outbreak of acute gastroenteritis occurred on three farms under one cooperative in harbin, northeastern china. minks had a sudden onset of gastrointestinal symptoms, and their clinical signs included severe diarrhea, vomiting, lethargy, anorexia, dehydration, and rough fur. their stool was thin or soft and watery. more than 2000 minks were owned by the cooperative in the reproduction period, and the same feed and immune procedures were used. a 5-year vaccination protocol against mink enteritis virus (mev) and canine distemper virus (cdv) was completed. fewer than 10 minks died after 1 week of illness, and most of those that died were malnourished. the other minks completely recovered within 5-7 days after showing clinical signs. local veterinary clinicians sampled three or five diarrheic fecal samples from each farm and immediately submitted them to our laboratory in the animal teaching hospital of the college of veterinary medicine, northeast agricultural university. one mink carcass (after flaying) was frozen at −20 °c to −30 °c after 2 days of storage at 10 °c and sent to our department for necropsy and laboratory investigations. samples from the heart, liver, spleen, lung, kidney, and intestine of the carcass were collected and homogenized in dulbecco's modified eagle's medium supplemented with 5% fetal calf serum and 1000 iu of penicillin, 1000 mg of streptomycin, and 10 mg of amphotericin b per ml. tissue homogenates were clarified by centrifugation at 2500×g for 10 min. the fecal samples were immediately examined for common enteric bacterial, protozoan, and viral pathogens, including salmonella spp., clostridium spp., campylobacter spp., shiga-toxin-producing escherichia coli and enterotoxigenic e. coli, coccidium spp., cryptosporidium spp., mink aleutian disease virus [12] , mev [13] , cdv [14] , caliciviruses [15] , rotaviruses [16, 17] , coronaviruses [18] , and astroviruses [19] . standardized procedures were carried out to isolate bacteria commonly associated with enteritis in vitro. the samples were plated on 5% sheep blood agar (biocell biotech., co., zhengzhou, china) and cultured aerobically or anaerobically at 37 °c for 48 h for pathogen detection. bacteriological investigations were conducted in accordance with standard biochemical procedures, and bacterial strains were identified using the biolog microbial id system (geneiii, biolog inc. usa). feces or intestinal contents were examined to detect intestinal parasites by zinc sulfate flotation. stools and intestinal sections were subjected to ziehl-neelsen staining to identify cryptosporidium spp. dna/rna extracts were screened to detect enteric viral pathogens that are common in minks. tissue homogenates or fecal suspensions of each sample were prepared by diluting the feces in 0.01 m phosphate-buffered saline at a proportion of 1:10 and a ph of 7.2. the suspensions were then vortexed for 30 s and centrifuged at 1000×g for 15 min. the supernatant (200 μl) was used for dna/rna extraction using a tianamp virus dna kit (tiangen biotech. co., beijing) or trizol ls (gibco-brl, life tech.) in accordance with the manufacturer's protocol. nucleic acid templates were stored at −80 °c until use. nucleic acid templates were also extracted from healthy mink feces as a negative control. the primers and the pcr and rt-pcr conditions used to detect viral pathogens have been described elsewhere (table 1) . pcr was performed as described previously to determine the full-length genome sequence of the newly identified micv strain, which we have named "heb15" [20, 21] . the pcr product was analyzed via electrophoresis using an agarose gel and visualized by staining with ethidium bromide. the amplified product from the positive sample was 600 bp long. the complete genome was amplified from the positive sample obtained via pcr using the primers cv-3, cv-4, repf, and repl. the target dna bands were extracted from the gel using a qiagen gel extraction kit in accordance with the manufacturer's instructions. the purified pcr products were cloned into the pmd18-t simple vector (takara, japan), and the plasmids were introduced into e. coli dh5a using a standard transformation technique. at least three independent plasmid clones were analyzed, confirmed, and sequenced by comate biotech. co., changchun, china. sequence reads were assembled using seqman (dnastar inc., madison, wi, usa), and the genome sequence of micv was analyzed with the aid of ncbi (http://www.ncbi.nlm.nih.gov) and embl (http://www. ebi.ac.uk) analysis tools. multiple alignments of nucleotide and predicted amino acid sequences were performed and compared with those of other members of the family circoviridae, which were extracted from the genbank database using clustalw in mega 6.06 [22] . putative open reading frames (orfs) were identified using orf finder (ncbi; http://www.ncbi.nlm.nih.gov/ gorf/gorf.html). the nucleotide sequences of different orfs were aligned with the cognate sequences of the reference micv strain dl13 and other cvs through translation-based alignment. the hairpin and stem-loop structures were identified using the mfold web server (http://unafo ld.rna.alban y.edu/?q=mfold ). furthermore, direct and inverted repeat sequences within the intergenic regions were determined using an oligonucleotide repeat finder (http://wwwmg s.bione t.nsc.ru/mgs/progr ams/oligo rep/inpfo rm.htm). phylogenetic trees based on the full-length genome sequence and the nucleotide sequences of replicase and capsid protein genes were constructed in mega 6.06 using neighbor-joining method, which provided statistical support with bootstrapping of over 1000 replicates. other tree-building methods, including maximum parsimony and maximum likelihood, were used to confirm the topology of the neighbor-joining tree. the following reference cv strains were used in the phylogenetic analysis: micv isolate dl13 (nc_023885. of the 12 samples tested, 11 (91.7%) were positive for micv. no evidence of mixed infection with different pathogens was found in the analyzed samples. the heart, kidney, and gut samples of the deceased mink tested positive in the pcr targeting the replicase gene of micv. e. coli isolated from the liver of the mink carcass harbored iuta, sfa virulence genes and were pathogenic to mice. the complete sequence was deposited in the gen-bank database under accession no. kx268345. the complete genome of micv strain heb15 comprises 1753 nucleotides as a covalently closed circular dna and has a nucleotide composition of 25.50% a, 26.81% t, the genome sequence of micv isolate heb15 was compared individually to those of other cvs, and was found to be 20.3% identical to that of cocv and 66.6% identical to that of batcv 1. by contrast, micv was more closely related to batcv 1 (66.6% nucleotide sequence identity) than to dogcvs and gocvs (20.3.1%-55%). nucleotide sequence analyses indicated that the micv heb15 strain exhibited a genome organization similar to that of previously characterized cvs, such as pig cvs (pcvs) and bfdv [9, 23] . sequence features included a distinctive inverted repeat sequence to form a thermodynamically stable stem-loop with a highly conserved nonanucleotide motif tag tat tac at its top, where rolling-circle replication (rcr) of the virus dna strand has been postulated to initiate. accordingly, the convention for nucleotide numbering and labeling of orfs has been adopted from previous reports [10] . thus, the a residue, which is found in the eighth position in the nonamer and lies immediately downstream of the rcr cleavage site, is regarded as nucleotide position 1. the micv genome was found to contain seven orfs potentially encoding proteins containing more than 50 amino acids. similar to pcv 1, pcv 2, and bfdv, the micv genomes possessed two major orfs, one in the viral sense orientation (v1, encoding the viral replicase) and one on the complementary strand in the opposite orientation (c1, encoding the capsid protein). fig. 1 presents the genome locations of the major orfs in micv. the largest orf, v1, is located at nucleotide position 48-941 and is postulated to encode the replication-associated protein (rep), with a predicted length of 297 aa. these values were within the range of 289 amino acids (bfdv) to 317 amino acids (pigeon cv, [picv]) observed in other cvs. the rep gene of heb15 contained one nucleotide mismatch when compared with dl13, and the substitution resulted in a nonsynonymous codon change. the predicted rep of micv shared 42.1% (gucv) and 79.7% (batcv 1) amino acid sequence identity with those of other cvs. pairwise comparisons showed that the amino acid identity levels of the putative v1 orf product shared by micv with batcv 1 (79.7%), dogcv (54.5%), and swan cv (51.1%) were broadly similar to those with pcvs (46.5%). a sequence alignment of the putative rep of micv with those of known members of the genus circovirus identified several highly conserved amino acid motifs, including wwdgy (aa 195-199 in micv), ddfygw (aa 208-213), and dryp (aa 224-227) [24, 25] . a closer examination of the v1 orf amino acid sequences showed that the amino acid motifs associated with rcr and the p-loop were present in the sequences of micv and those of bfdv and pcvs. the motifs related to rcr activity [26] include ftinn, which occurs at aa 13 to 17 in micv, and tphlqg, which appears at aa 49 to 54 in ficv. csk is the sequence of the third rcr motif located at aa 91 to 93 in the micv rep. g*gks is the fourth motif, which is putatively associated with dntpase activity [27] , at aa 171 to 175 in the micv rep. the c1 orfs, which were the putative cap genes of micv (nt 1032-1715), encode proteins with 270 amino acids. this value was the smallest and was below the range of 289 (bfdv) to 317 amino acids (picv) observed in other cvs. the predicted cap of micv shared 20.7% (ducv) and 45.4% (pcv 2 and batcv) amino acid identity with those of other cvs. in addition to the high level of identity shared by pcv 2 and batcv (45.4%), the highest amino acid sequence identity was shared by pcv 1 (43.6%). micv also shared higher levels of identity with mammalian cvs, such as pcvs (43.6% and 45.4%) and batcv 1 (45.4%), than with avian cvs (20.7%-26.0%). similar to other animal cvs, fig. 1 diagram of the circovirus genome isolated from mink. the nucleotide position is indicated for each orf. rep, replicase; cap, capsid protein. the hairpin-like palindromic structure, the origin of replication, is shown on the right micv had a putative capsid protein with an amino terminus containing an arginine (r)-rich stretch with a length of 30 aa. the cap nucleotide sequences of micv heb15 and the two other strains of micv, namely, dl13 and hebei13, were 99.3% and 99.7% identical, respectively. furthermore, a high degree of identity (99.3%-99.7%) was observed between the nucleotide sequences of micv strains. the cap gene contained five nucleotide mismatches compared with the dl13 isolate. three of the five substitutions were transversions, and all of the substitutions resulted in nonsynonymous codon changes. the alignment of the entire cap protein sequence of micv heb15 with the published sequences of dl13 and hebei13 indicated that the sequences were generally highly conserved, showing no variation at all. moreover, the five nucleic acid substitutions did not result in changes in amino acids. database searches with the micv sequence revealed that none of the minor orfs shared significant amino acid sequence similarity with any of the orfs specified by other cvs. however, a putative orf, postulated to use an alternative ttg start codon, occurred on the complementary strands (nt 887-504), which shared weak homology with the c2 orf of ducv (26%) and was in a similar location to those of previously characterized cvs. like other cvs, the heb15 genome was found to contain two intergenic noncoding regions located between the replicase and the capsid protein genes ( table 2 ). the 5′ and 3′ intergenic regions of micv were 85 and 90 nt in length, respectively ( table 2 ). in the 5′ intergenic region of micv, the conserved nonanucleotide motif of micv was flanked by a potential stem-loop with 11 base pairs in the stem. in addition, the intergenic region of micv possessed two tandemly repeated decamers ggc aca cctc (nt 13-22 and 23-32) adjacent to the potential stem-loop. in the 3′-intergenic region, a 45-nt stretch within the 3′ intergenic sequence showed 97.78% (44/45) nt sequence identity to its counterpart in batcv 1 (isolate xor), which was recently detected in bats by metagenomic analysis of tissue samples [28] . potential poly(a) signals appropriately located for use with possible transcripts containing the v1 and c1 orfs were identified in the micv genomes by analogy to similar sequences in the pcv and bfdv genomes [23, 24] . potential poly(a) signals for the v1 orf were determined at nt 924-929 (aat aaa ) and nt 937-943 (att aaa a). in the complementary strand, three possible signals for the c1 orf were identified at nt 944-950 (aat aaa a), nt 961-968 (gat aaa aa), and nt 1030-1036 (ctt aaa a). a phylogenetic analysis based on the complete genome sequences of the two micv strains and representative cvs was performed to investigate the genetic relationship between micv and other cvs (fig. 2) . in the neighbor-joining tree, the micv strains were grouped with mammalian cv, which comprised batcv 1, dogcv, and pcvs, and these were clearly separated from bird cvs. micvs were more closely related to pcv 1 and pcv 2 than to dogcv (fig. 2a) . a phylogenetic analysis based on rep sequences (fig. 2b) demonstrated that the two micv strains were grouped with batcvs and were more closely related to dogcv than to pcv 1 and pcv 2. in trees constructed based on the cap gene, the micvs were tightly intermingled with batcv 1 isolate and were close to pcv 1 and pcv 2, but dogcv tag tat tac ggc aca cctc 227 297 pcv1 1759 82 36 tag tat tac cgg cag c 233 312 pcv2 1768 83 38 aag tat tac cgg cag cac ctc 233 314 bfdv 1993 145 234 tag tat tac ggg gca ccg 247 289 picv 2037 90 171 tag tat tac gga ccc ac 273 317 cacv 1952 77 249 cag tat tac gga gcc ac 250 290 ficv 1962 71 307 tag tat tac tgg aac c 249 291 gucv 2035 207 72 tag tat tac ggg gcc at 245 305 gocv 1821 132 54 tat tat tac gta ctc cg 250 293 ducv 1996 111 232 tag tat tac tac tcc g 257 292 dogcv 2063 135 203 tag tat tac cacag 270 303 batcv 1862 86 171 tag tat tac cac ttc ggca 238 295 stcv 2063 79 293 cag tat tac gga gcc a 276 289 swcv 1783 107 39 tat tat tac actac 251 293 racv 1898 86 204 gag tat tac gga gcc 243 291 was grouped into a separate cluster (fig. 2c) . these results were verified using the maximum-likelihood and maximumparsimony methods, which yielded the same tree topology (data not shown). micv circulated on mink farms in dalian city, liaoning province, china, in 2013 [11] . farmers observed disease in mev-vaccinated minks in 1978. clinical signs included diarrhea, lethargy, anorexia, pale muzzle, and unkempt fur. on farms with diseased minks, all of the animals were affected, and 70%-80% showed clinical signs. however, most of the animals recovered, and fewer than 10% of the affected minks died. micv was found in the liver, digestive tract, and fecal specimens of minks and was shown to be responsible for diarrhea [11] . however, its pathogenic role remains unknown, particularly its infection or coinfection process. in the present study, we investigated a major outbreak in a large population. samples were tested for other causative agents, including madv, mev, cdv, coronavirus, calicivirus, rotavirus, bacterial pathogens, and parasites. after these pathogens were excluded, micv was detected alone, and was not found in combination with other key pathogens, such as madv, mev, and cdv. thus, our findings suggested a possible association between micv and mink acute enteritis. consistent with the detection results presented by lian et al. [11] , our findings showed that the virus was located in the gut and feces. the virus was also found in the heart and kidney, but not in the liver. the sites where carcasses and feces were collected in this study were approximately 1000 km away from where lian et al. obtained their samples [11] , and the isolates were therefore of different geographical origin. moreover, the mink breed, disease symptoms, and disease duration differed. as such, the sustained direct transmission of micv among minks was excluded, and micv heb15 was considered a new strain. the data gathered in this study expanded the known geographical distribution of micv. unfortunately, the carcass of the micv-infected mink was not stored properly by the resident veterinarians and underwent repeated cycles of freezing and thawing, making it unsuitable for histopathological examination. for this reason, valuable information on tissue localization and alterations of micv could not be collected. the present study could contribute fig. 2 phylogenetic analysis of the genome sequences of mink circovirus and other circoviruses, using the neighborjoining method, with 1,000 bootstrap replicates (mega 6.06). a analysis of the whole genome sequence. b analysis of the rep gene sequence. c analysis of the cap gene sequence. only bootstrap support values greater than 60% are shown. the bar indicates the genetic distance. the sequence of the circovirus isolated in this study is indicated by a star. other sequences were obtained from genbank; accession numbers of those sequences are included in the tree to our knowledge of the pathogenic potential of micv and its association with mink enteritis if our results were corroborated by further reports. the presence of micv might help to explain the different disease outcomes and severity often observed in madv-/mev-infected animals. our results indicate that minks with acute gastroenteritis should be screened for micv and established pathogens, such as e. coli and mev. this study was conducted to identify the viral agents associated with acute diarrhea in minks, and no attempt was made to determine the primary cause of death. to date, attempts to isolate micv have been unsuccessful. current surveillance methods are limited to viral detection using molecular assays, and no seroprevalence information is available. as a consequence, the identification and characterization of the virus depends on genetic approaches such as pcr and sequence determination and analysis. in this study, the nucleotide sequence of the micv genome was determined and analyzed. our results showed that the genome sequence of micv isolate heb15 was highly similar to that of micv dl13 in terms of size (1753 nt), nucleotide sequence (99.66% identity), and orf analysis. the examination of other micv sequences from different regions will help to assess the level of genetic diversity. in our study, sequence analysis confirmed that micv genomes displayed the characteristics of members of the genus circovirus, and the common features included their genome organization, the presence of a potential stem-loop and conserved nonanucleotide motif postulated to be the origin of viral dna replication, and major orfs and repeats [26, 27] . conserved amino acid motifs, including wwdgy, ddfygwlp, and dryp, which have unknown functions, were also recognized within rep associated with rcr and dntpase activity. these motifs could be utilized to design primers that could amplify cv-specific dna for discovery of new cvs in other hosts. additional micv surveillance in animals is required to clarify cv epidemiology. further efforts are necessary to identify and characterize the infection or coinfection processes. future studies should investigate experimental infections to obtain conclusive information on the pathogenic role of micv and to monitor the circulation of these viruses and their effects on mink populations. this study could contribute to our knowledge about the pathogenic potential of micv and its association with acute gastroenteritis in minks. the complete genome sequence of micv strain heb15 is reported here to help elucidate the epidemiological, evolutionary, and pathogenic characteristics of mammalian cvs. the availability of the micv genome sequence also provides a basis for the development of molecular reagents that could be used to identify other novel cvs that infect other mammalian species. ictv virus taxonomy profile: circoviridae circovirus in tissues of dogs with vasculitis and hemorrhage genomic characterization of a circovirus associated with fatal hemorrhagic enteritis in dog porcine circovirus type 2 (pcv2) infections: clinical signs, pathology and laboratory diagnosis avian circovirus diseases: lessons for the study of pmws circovirus-like infection in a pigeon particles resembling circovirus in the bursa of fabricius of pigeons animal circoviruses porcine circoviruses-small but powerful circoviruses: immunosuppressive threats to avian species: a review novel circovirus from mink the relationship between capsid protein (vp2) sequence and pathogenicity of aleutian mink disease parvovirus (adv): a possible role for raccoons in the transmission of adv infections detection of mink enteritis virus by loopmediated isothermal amplification (lamp) alfieri af (2006) detection of canine distemper virus by reverse transcriptase-polymerase chain reaction in the urine of dogs with clinical signs of distemper encephalitis design and evaluation of a primer pair that detects both norwalk-and sapporo-like caliciviruses by rt-pcr identification of group a rotavirus gene 4 types by polymerase chain reaction evidence of interspecies transmission and reassortment among avian group a rotaviruses molecular characterization of a new species in the genus alphacoronavirus associated with mink epizootic catarrhal gastroenteritis molecular characterization of a novel astrovirus associated with disease in mink molecular epidemiology of mink circovirus cloning and sequence analysis of the n gene of porcine epidemic diarrhea virus ljb/03 mega6: molecular evolutionary genetics analysis version 6.0 psittacine beak and feather disease virus nucleotide sequence analysis and its relationship to porcine circovirus, plant circoviruses, and chicken anaemia virus genome sequence determinations and analyses of novel circoviruses from goose and pigeon identification of a novel circovirus in australian ravens (corvus coronoides) with feather disease conserved sequence motifs in the initiator proteins for rolling circle dna replication encoded by diverse replicons from eubacteria, eucaryotes and archaebacteria rep protein of tomato yellow leaf curl geminivirus has an atpase activity required for viral dna replication virome profiling of bats from myanmar by metagenomic analysis of tissue samples reveals more novel mammalian viruses conflict of interest the authors declare no conflicts of interest.ethical approval this article does not contain any studies with human participants performed by any of the authors. this study was performed in accordance with the recommendations in the guide for the care and use of laboratory animals of the ministry of health, china. prior to experiments, the protocol of the current study was reviewed and approved by the institutional animal care and use committee of northeast agricultural university (approved protocol number 2014key: cord-279528-41atidai authors: abo-elkhier, mervat m.; abd elwahaab, marwa a.; abo el maaty, moheb i. title: measuring similarity among protein sequences using a new descriptor date: 2019-11-22 journal: biomed res int doi: 10.1155/2019/2796971 sha: doc_id: 279528 cord_uid: 41atidai the comparison of protein sequences according to similarity is a fundamental aspect of today's biomedical research. with the developments of sequencing technologies, a large number of protein sequences increase exponentially in the public databases. famous sequences' comparison methods are alignment based. they generally give excellent results when the sequences under study are closely related and they are time consuming. herein, a new alignment-free method is introduced. our technique depends on a new graphical representation and descriptor. the graphical representation of protein sequence is a simple way to visualize protein sequences. the descriptor compresses the primary sequence into a single vector composed of only two values. our approach gives good results with both short and long sequences within a little computation time. it is applied on nine beta globin, nine nd5 (nadh dehydrogenase subunit 5), and 24 spike protein sequences. correlation and significance analyses are also introduced to compare our similarity/dissimilarity results with others' approaches, results, and sequence homology. information encoded in the genome of any organism plays a central role in defining the life of that organism. e nucleotide sequence that forms any gene is translated into its corresponding amino acid sequence. is sequence of amino acids becomes functional only when it adopts its tertiary structure. experimental methods such as x-ray diffraction and nuclear magnetic resonance are considered authoritative ways for obtaining proteins' structure and function. ese experimental methods are very expensive and time consuming. erefore, computational methods for predicting protein structure have become very useful. proteins with similar sequences are usually homologous, typically displaying similar 3d structure and function. sequence alignment is the first step of 3d structure prediction for protein sequences. alignment approaches are classified into alignment-based and alignment-free methods. blast (basic local alignment search tool) and clustalw are the most widely used computer programs for alignmentbased approaches [1] [2] [3] . results of these programs provide an approximate solution to the protein alignment problem. on the other hand, many alignment-free approaches are proposed for sequence comparison. most biological sequence analysis methods still have weaknesses, including having low precision and being time consuming [4, 5] . similarity/dissimilarity analysis of biological sequences is used to extract information stored in the protein sequence. many mathematical schemes have been proposed to this end. graphical representations of biological sequences identify the information content of any sequence to help biologists choose another complex theoretical or experimental method. graphical representation provides not only visual qualitative inspection of gene data but also mathematical characterizations through objects such as matrices. some 2d and 3d graphical representations are created by selecting a geometrical object that is used to describe nucleic acid bases or residues [6] [7] [8] [9] [10] . others are based on assigning vectors of two or three components to nucleic acid bases or amino acids [11] [12] [13] [14] [15] [16] [17] . adjacency matrices are also introduced in some articles [18] [19] [20] [21] , where an exact solution is obtained to the protein alignment problem. additional methods use discrete fourier transform (dft) in which dna sequences are mapped into four binary indicator sequences, followed by the application of dft on these indicator sequences to transform them into a frequency domain [22, 23] . dynamic representation is used to remove degeneracies in the previously mentioned approaches [24] [25] [26] [27] [28] [29] [30] [31] . another method is based on the simplified pulsecoupled neural network (s-pcnn) and huffman coding where the triplet code was used as a code bit to transform dna sequence into numerical sequence [32] . in this study, we introduce a new alignment-free method for protein sequences. each amino acid in the protein sequence is represented by a number, and a new 2d graphical representation is suggested. a new descriptor is introduced, comprising a vector composed of the mean and standard deviation of the total numbers of each protein sequence (a t , sa t ). our graphical representation eliminates degeneracy and has no loss of information. it is suitable for both short and long sequences. as a proof of concept, our approach is applied on nine beta globin protein sequences and nine nd5 (nadh dehydrogenase subunit 5) protein sequences. it can be applied on any sequence length with the same efficiency. correlation and significance analyses are introduced among our results, along with pid% [15] and clustalw [33] to demonstrate the utility of our approach. all the protein sequences used in this study were downloaded from e national center for biotechnology a new 2d graphical representation is introduced. each amino acid in any protein sequence is represented by the suggested intensity y x (i) and intensity level a x (i). e intensity (y x (i)) of each amino in the sequence depends on its abundance and location in the different sequences. it is calculated using where f x is the frequency of amino acid x in the sequence, number of times of x/n. n is the protein sequence length, number of residues in protein sequence. i is the position of each amino acid x in a sequence. en, the intensity level a x (i) of each amino acid (x) in the sequence is calculated by using the natural logarithm function as in erefore, each amino acid has its own intensity level which is a vector of n elements according to equation (2) . finally, the combined intensity level of the protein sequence a t (i) is obtained by the summation of the 20 intensity levels' vectors a x (i) of the protein sequence by using equation (3) . e combined intensity level a t (i) is also a vector of n elements: each amino acid has its own graph. now, twenty graphs are obtained for each sequence of the 20 different amino acids. e combined graph is obtained by combining these 20 graphs within a single graph. is combined intensity level is our new 2d graphical representation. our approach is first applied on two short segments of protein from "yeast saccharomyces cerevisiae": protein i: "wtfesrndpakdpvilwlnggpgcs-sltgl" protein ii: "wffesrndpandpiilwlnggpgcs-sftgl" ese two short proteins consist of 30 amino acids each. e two sequences are different in amino acids at positions 2, 11, 14, and 27. e values y x (i) and a x (i) for each amino acid in the two sequences are calculated. for protein i, the g amino acid is repeated four times in the protein sequence. ese four repeats occur in positions 20, 21, 23, and 29. e frequency, f g , equals (4/30). by substituting in equations (1) and (2), the results of y g (i) and a g (i) are presented in table 4 . by summing the values of a x (i) for all amino acids in protein i, the total value of a t (i) is obtained, as shown in figure 1 (a). e position i of each amino acid is located on the x-axis, and the total intensity level a t (i) is located on the y-axis. we next apply our approach on nine beta globin and nine nd5 (nadh dehydrogenase subunit 5) protein sequences, which are illustrated in tables 1 and 2 . e 2d graphical representation for human, chimpanzee, and opossum beta globin protein sequences is illustrated in representations for fin whale and rat nd5 protein sequences are illustrated in figures 3(a) and 3(b), respectively. we finally apply our approach on 24 coronaviruses protein sequences which are illustrated in table 3 . e 2d graphical representation of tgevg from class i and gd03t0013 from sars_cov protein sequences is illustrated in figures 4(a) and 4(b) respectively. mathematical descriptors help in recognizing major differences among similar protein sequences quantitatively. a new descriptor for protein sequences is suggested, which is a vector composed of the arithmetic mean a t and standard deviation sa t of the combined intensity level value a t (i) of the protein sequence. ey are evaluated according to the following equations: is descriptor compresses the information from primary protein sequences into a single vector composed of only two values. e beta globin, nd5, and coronaviruses protein sequence descriptors are illustrated in tables 5-7 , respectively. table 7 shows that the mean of all 24 coronaviruses is around 38.7 and with a range from 38.601 to 38.838 while their standard deviation varies according to their class. ey are divided into four classes. e first four viruses belong to class i. e fifth to the ninth coronaviruses belong to class ii. class iii contains the tenth and eleventh viruses. e rest viruses from the 12th to the 24th belong to sars-cov. according to our approach, the standard deviation of class i ranges from 10.94 to 11.17. class ii's standard deviation ranges from 10.68 to 10.77. class iii's standard deviation has values from 10.6271 to 10.6458. sars-cov's standard deviation almost equals 10.58. e resulting standard deviation values of the 24 coronaviruses classify them correctly to the four classes. e coronaviruses classes' ranges according to our approach are shown in figure 5 . to compare the species' protein sequences, the euclidean distance among species' descriptors is evaluated. for example, the human beta globin protein sequence's descriptor is (37.145, 11 .505) and the chimpanzee beta globin protein sequence's descriptor is (36.912, 11.586) . to measure the degree of similarity between human and chimpanzee, the euclidean distance between these vectors is evaluated. e similarity/dissimilarity matrices of beta globin and nd5 protein sequences are illustrated in tables 8 and 9 , respectively. table 8 results show that human and chimpanzee sequences are similar. ere is also striking similarity between mouse and rat sequences, while human and opossum sequences are obviously dissimilar. species id length 1 gorilla caa43421 121 2 chimp caa26204 125 3 human aaa16334 147 4 rat caa29887 147 5 mouse caa24101 147 6 gutta ach46399 147 7 duck caa33756 147 8 gallus caa23700 147 9 opossum aaa30976 147 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 pigmy chimpanzee, common chimpanzee, human, and gorilla nd5 protein sequences are similar, while the blue whale is similar to the fin whale, and mouse is similar to rat. similar to the other sequence, human and opossum are still dissimilar. however, our algorithm cannot measure the degree of similarity very well for pigmy chimpanzee. e distance between human and pigmy chimpanzee is 0.1826, while the distance between human and gorilla is 0.0575, as shown in table 9 . e results of both tables 8 and 9 are approximately comparable to previous reports [13, 15, 21, 33-39]. we got the phylogenetic trees of beta globin and nd5 protein sequences by applying the upgma (unweighted pair group method with arithmetic mean). e phylogenetic tree based on tables 8 and 9 of our method is presented in figures 6 and 7 , respectively. figure 6 proves the utility of our similarity/dissimilarity analysis for beta globin protein sequences. figure 7 shows our analysis of similarity/dissimilarity of nd5. it is mentioned that our algorithm cannot measure the degree of similarity very well for pigmy chimpanzee with human. is appears of course in figure 7 . e p. chimp branch should be close to c. chimp. despite this error, the tree shows that human, common chimpanzee, pigmy chimpanzee, and gorilla belong to the same cluster. to check the effect of this error on our algorithm, the results of our algorithm are compared to sequence homology. a correlation and significance analysis is also provided. e results of our algorithm are compared to the sequence homology by two methods. first, we use the smith waterman algorithm to calculate the number of identical residues in each pair of protein sequences [15] . e results of the pid% of nine beta globin sequences are illustrated as a similarity/dissimilarity matrix in table 10 . e larger pid% represents the more similar protein sequences. a correlation and significance analysis is provided to compare our approach in table 8 with pid% in table 10 . e correlation of the two sets of data is sufficiently strong when the correlation coefficient (r) is greater than 0.7. e negative sign of (r) indicates that when the first data set increases, the second data set decreases. we then assess statistical significance for correlation coefficient values greater than 0.7 to ensure that they likely do not occur by chance. our sample set is composed of nine protein sequences. erefore, we use 7 degrees of freedom. a t-value of 2.385 or greater indicates that a less than 0.05 chance of the results occurred by coincidence. e results for correlation coefficients and t-values for our approach are illustrated in table 11 . second, clustalw is a widely used system for aligning any number of homologous nucleotides or protein sequences [33] . e clustalw program's distance matrix of nine nd5 protein sequences is illustrated in table 12 . correlation and significance analyses are also provided to compare our approach in table 9 with clustalw results in table 12 . e results of the correlation and significance analyses of our approach and other approaches [15, 33] are illustrated in table 13 . our sample set of nd5 is also composed of nine protein sequences. erefore, we use 7 degrees of freedom and a t-value of 2.385 or greater. despite the unusual result for pigmy chimpanzee that appeared in table 11 : e correlation and significance analysis between our similarity analysis results of beta globin protein sequences in table 8 and pid% similarity matrix in table 10 . tables 9 and 7 in [33] and table 3 in [15] and clustalw similarity matrix in table 12 . correlation coeff. (r) of our approach t-value of our approach correlation coeff. (r) of [33] t-value of [33] correlation coeff. (r) of [15] (table 3) t-value of [15] ( table 9 , the correlation coefficient of pigmy chimpanzee in our similarity matrix and clustalw matrix is 0.8811. is value likely does not occur by chance, as the t-value equals 4.928, as illustrated in table 13 . e comparison between our results and both pid% and clustalw and other approaches' results indicate the utility of our approach. a new graphical representation of protein sequences is introduced. it is the combined intensity level of the 20 amino acids composing any protein sequence. each amino acid in a given protein sequence has its own intensity and intensity level. ey are vectors of n elements as n is the protein sequence length. e combined intensity level is then computed and graphed to represent any protein sequence graphically. our 2d graphical representation effectively displays differences between protein sequences without degeneracies. e graph does not overlap or intersect with itself. our new descriptor suggested a vector of two elements, which are the mean and standard deviation of the combined intensity level (a t and sa t ). a similarity/dissimilarity analysis is evaluated by computing euclidean distance between each two species' descriptors. examination of similarity/dissimilarity among nine beta globin, nine nd5, and 24 coronaviruses protein sequences provided good results compared to previous approaches. e suggested approach is effective for both short and long sequences, and the computations are very simple. furthermore, loss of sequence information is avoided. correlation and significance analyses with pid% and clustalw are also introduced to show the utility of our approach. basic local alignment search tool gapped blast and psi-blast: a new generation of protein database search programs clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice graphical representation of proteins similarity/dissimilarity calculation methods of dna sequences: a survey highly compact 2d graphical representation of dna sequences, sar and qsar unique graphical representation of protein sequences based on nucleotide triplet codons novel 2-d graphical representation of proteins representation of protein sequences on latitude-like circles and longitude-like semi-circles on a geometry-based approach to protein sequence alignment dna sequence comparison by a novel probabilistic method 2-d graphical representation of proteins based on physico-chemical properties of amino acids a 2d graphical representation of protein sequence and its numerical characterization 3-d maps and coupling numbers for protein sequences 3d graphical representation of protein sequences and their statistical characterization dna sequence representation without degeneracy protein map: an alignment-free sequence comparison method based on various properties of amino acids on novel representation of proteins based on amino acid adjacency matrix protein alignment: exact versus approximate. an illustration very efficient search for protein alignment-vespa similarity analysis of protein sequences based on 2d and 3d amino acid adjacency matrices a measure of dna sequence similarity by fourier transform with applications on hierarchical clustering a new method to cluster dna sequences using fourier power spectrum descriptors of 2d-dynamic graphs as a classification tool of dna sequences distribution moments of 2d-graphs as descriptors of dna sequences similarity studies of dna sequences using genetic methods 2d-dynamic representation of dna sequences 3d-dynamic representation of dna sequences spectral-dynamic representation of dna sequences four-component spectral representation of dna sequences 20d-dynamic representation of protein sequences a novel dna sequence similarity calculation based on simplified pulse-coupled neural network and huffman coding f-curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids a novel descriptor for protein similarity analysis adld: a novel graphical representation of protein sequences and its application comparative analysis of protein primary sequences with graph energy e graphical representation of protein sequences based on the physicochemical properties and its applications a novel method of 2d graphical representation for proteins and its application novel numerical characterization of protein sequences based on individual amino acid and its application all data are mentioned clearly in the manuscript in section 2 under the title "dataset, technology, and tools." in this section, we illustrate the data in three tables: tables 1, 2, and 3. we also mention that data are downloaded from "gene bank." all data files are with extension", fasta". e authors declare that they have no conflicts of interest. key: cord-280881-5o38ihe0 authors: wlodawer, alexander; durell, stewart r; li, mi; oyama, hiroshi; oda, kohei; dunn, ben m title: a model of tripeptidyl-peptidase i (cln2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases date: 2003-11-11 journal: bmc struct biol doi: 10.1186/1472-6807-3-8 sha: doc_id: 280881 cord_uid: 5o38ihe0 background: tripeptidyl-peptidase i, also known as cln2, is a member of the family of sedolisins (serine-carboxyl peptidases). in humans, defects in expression of this enzyme lead to a fatal neurodegenerative disease, classical late-infantile neuronal ceroid lipofuscinosis. similar enzymes have been found in the genomic sequences of several species, but neither systematic analyses of their distribution nor modeling of their structures have been previously attempted. results: we have analyzed the presence of orthologs of human cln2 in the genomic sequences of a number of eukaryotic species. enzymes with sequences sharing over 80% identity have been found in the genomes of macaque, mouse, rat, dog, and cow. closely related, although clearly distinct, enzymes are present in fish (fugu and zebra), as well as in frogs (xenopus tropicalis). a three-dimensional model of human cln2 was built based mainly on the homology with pseudomonas sp. 101 sedolisin. conclusion: cln2 is very highly conserved and widely distributed among higher organisms and may play an important role in their life cycles. the model presented here indicates a very open and accessible active site that is almost completely conserved among all known cln2 enzymes. this result is somehow surprising for a tripeptidase where the presence of a more constrained binding pocket was anticipated. this structural model should be useful in the search for the physiological substrates of these enzymes and in the design of more specific inhibitors of cln2. although the existence of tripeptidyl-peptidase i (tpp-i) was first noted over 40 years ago [1] , the structural and mechanistic basis of its activity has been largely misunder-stood until quite recently. the situation changed after it was shown that tpp-i is identical to an independently characterized enzyme named cln2. it was also demonstrated that mutations leading to abolishment of the enzymatic activity of cln2 were the direct cause of a fatal inherited neurodegenerative disease, classical late-infantile neuronal ceroid lipofuscinosis [2] . this important observation was followed by the identification of cln2 as a serine peptidase [3, 4] , without, however, specifying its structural fold and the details of the catalytic site. more accurate placement of cln2 within the context of a family of related enzymes became possible only after high-resolution crystal structures of two bacterial enzymes with a limited sequence similarity to cln2, sedolisin and kumamolisin, became available [5] [6] [7] . these structures defined a novel family of enzymes, now called sedolisins or serine-carboxyl peptidases, that is characterized by the utilization of a fully conserved catalytic triad (ser, glu, asp) and by the presence of an asp in the oxyanion hole [8] . sedolisin and its several variants (e.g., kumamolisin, aorsin [9] , and physarolisin [10] ) have now been found in archaea, bacteria, fungi and amoebae, whereas the higher organisms seem to contain only variants of cln2 [8] . the physiological role of sedolisins in the lower organisms has not yet been elucidated. despite the potential medical importance of cln2 and related enzymes, no systematic studies of their genomic distribution have been published to date. there are also no published reports of the crystallization of this enzyme. in the absence of an experimental structure obtained by crystallography or nmr it is sometimes necessary to resort to molecular modeling in order to provide a structural basis for the explanation of the biological properties of an enzyme, and, in particular, to initiate design of its inhibitors. examples of such very successful and useful modeling efforts are provided by hiv protease [11] , or very recently by the peptidase from a coronavirus involved in the severe acute respiratory syndrome [12] , among others. we have now applied the tools of molecular homology modeling to predicting a structure of cln2 that could be used as a basis for a search for the biological substrates of this family of enzymes and for the design of specific inhibitors. mammalian enzymes homologous to human cln2 [2, 4] form a subfamily of sedolisins with highly conserved sequences ( figure 1 ). these enzymes are expressed with a prosegment consisting of 195 residues that is cleaved off during maturation, yielding the active catalytic domain. complete sequences are available for cln2 from six species in which it has been found so far (human, macaque, dog, mouse, rat, and cow). the full-length enzymes consist of 563 amino acids arranged in a single polypeptide chain containing both the prosegment and the catalytic domain, with the exception of mouse cln2 that has a single deletion in the prosegment. the overall sequence iden-tity for these enzymes is 81%, whereas the similarity is 92%. a pairwise comparison of the human and mouse enzymes yields 88% identity and 94% similarity, considerably higher than the median 78.5% identity reported for all identified mouse-human orthologs [13] . thus, mammalian cln2 appears to be a highly conserved enzyme. in addition to the mammals, cln2-like enzymes are also found in fugu (puffer fish -unannotated record sinfrup00000077297 in the ncbi fugu sequence database, http://www.ncbi.nlm.nih.gov/blast/genome/ fugu.html) and zebrafish (contig wz4596.2 in the zebrafish est database, http://fisher.wustl.edu/fish_lab). only a fragment of the sequence of the latter enzyme agreed with the former. however, a few comparatively minor modifications can bring the zebrafish sequence into good agreement with that of the fugu cln2. these modifications include a deletion of a single nucleotide from a run of three, as well as three insertions of repeated nucleotide pairs ( figure 2 ). it must be stressed that these modifications are speculative and may lead to prediction of several incorrect amino acids; however, they bring the two sequences into good global agreement (69% identities and 83% similarities). the available amino acid sequence of the fugu cln2 analog, named by us sedolisin-tpp [8] , is also in good agreement with the sequences of the mammalian orthologs ( figure 3 ). the only major difference in the translated amino acid sequence compared to the mammalian and zebrafish enzymes is in the amino terminus of the propeptide region that is shorter by 30 amino acids (not shown). it is very likely that this represents a fault in the assembled sequence rather than a real variation, since the current coding frame is not initiated with a methionine, and a few extra residues are present in the full genomic sequence available from the fugu sequencing consortium http://genome.jgi-psf.org/fugu6/fugu6.home.html. cln2 is present not only in fish, but also in amphibians, in particular in xenopus tropicalis (a species of frog). a partial sequence of its sedolisin-tpp (al594774) found in the est database http://www.sanger.ac.uk/projects/ x_tropicalis/blast_server.shtml spans the middle part of the catalytic domain, without reaching the part of the active site closer to the n terminus that contains the aspartic and glutamic acids that belong to the catalytic triad. however, the sequenced part of the enzyme shows 75% identity with the fugu sedolisin-tpp, and 69% identity with human cln2 (figure 3 ). sequence similarity to bacterial or fungal sedolisins is much lower, indicating that the enzyme found in frogs might also share the functional properties of the cln2 subfamily. sequence comparisons of mammalian cln2-like enzymes figure 1 sequence comparisons of mammalian cln2-like enzymes.these sequences correspond to the complete enzymes, including the prosegment. residues forming the active site are shown in yellow on red background, other conserved residues identified as important for the stability of the enzyme are marked with yellow background, residues identical in at least 5 of the structures are green, and residues similar in their character are shown in magenta. the maturation cleavage point generating the n terminus of the active enzyme is marked with black triangles. the discovery of highly conserved cln2-like enzymes not only in mammals but also in two fish species and one of frogs may indicate that these peptidases are universally present in the vertebrates, and that their important role identified in humans [2] and mice [14] might be a more general feature. the medical importance of cln2 and the lack of a crystal structure inspired attempts at protein modeling. the first such model assumed that this enzyme is membranebound, with the sequence 271-294 (numbering corresponding to the mature enzyme, figure 1 ) forming the putative transmembrane anchor [15] . however, in view of the subsequently obtained structures of the fully watersoluble sedolisins, this model was clearly incorrect. exploiting the sequence similarity between cln2, sedolisin, and kumamolisin ( figure 4 ), we have now used the experimentally obtained structures of the latter two enzymes to form a new, homology-derived model of human cln2. the primary basis of the homology model was the structure of a complex of sedolisin with a covalently-bound inhibitor, pseudo-iodotyrostatin. although it has not been directly shown that this compound can inhibit human cln2, other similar peptides with an aldehyde functionality on their "c-termini" are weak, but detectable, inhibitors of this enzyme (oda, unpublished). it is thus likely that pseudo-iodotyrostatin or a similar inhibitor might work for cln2 as well, although the actual contacts between the inhibitor and the enzyme that are seen in the model have to be treated with caution. another reason for the modeling of a pseudo-iodotyrostatin complex is that cln2 is a tripeptidase, and that this inhibitor it represents the only experimental structure of a tripeptide analog bound to sedolisin. the r.m.s. deviation between the corresponding cα coordinates of the model of cln2 ( figure 5 ) and the experimental structure of sedolisin is about 1.75 å, not much larger than the experimental difference between sedolisin and kumamolisin. interestingly, the cys327 and cys342 residues in the model were found to be ideally positioned to form a disulfide bond even though this was not part of the design strategy. that this bond likely occurs in the real protein is suggested by the fact that these two cysteines are strictly conserved in all known animal species of cln2 (figures 1 and 3) , although they are absent in all known sequences of bacterial sedolisins. thus, if this disulfide were experimentally found to exist in cln2 it would provide support for the correctness of the model. since the principal known activity of cln2 is that of a tripeptidase, it is expected that three substrate-binding pockets, s1 through s3 (using the nomenclature of schechter and berger [16] ), should be discernible. residues p1-p3 of the inhibitor that should occupy these pockets are shown in figure 6 . all the available structures of the complexes of either sedolisin or kumamolisin with inhibitors contain either a tyrosine or a phenylalanine occupying the s1 pocket. parts of this pocket are fully corrected gene sequence of the zebrafish cln2 figure 2 corrected gene sequence of the zebrafish cln2.this putative sequence shows the manual corrections that bring it into alignment with the sequence of the fugu enzyme. inserted nucleotides are marked in green and a deleted one in red. conserved among different sedolisins, whereas other parts of it differ. the right-hand side of the pocket (in the view used in figure 6 ) is made of the main chain including residues 164-165 (unless otherwise indicated, the numbering refers to the sequence of the mature human cln2). this part of the main chain is held in place through a hydrogen-bonded interaction with the side chain of thr279, part of the signature sequence sgtsas surrounding the catalytic ser280 and its equivalents in the other enzymes. asp165 itself is also conserved since that residue provides the lining of the oxyanion hole, so it can be safely assumed that this part of the s1 pocket is virtually identical. no side chains point into the pocket from there, though, so its importance is limited to providing a steric barrier and excluding solvent. another wall of the pocket is made up of the main chain of residues 130-132 that flank the conserved gly131. again, this part of the chain does not provide any specific interactions with the p1 residue of the substrate. considerable differences are seen, however, at the bottom of the pocket, where the side chains of asp179 in kumamolisin and the equivalent ser190 in sedolisin make hydrogen bonds to the p1 tyrosine of the inhibitor (if present). the equivalent residue in cln2 is thr182, but it is very unlikely that it can assume an orientation that would allow it to make a hydrogen bond to a p1 tyrosine. other polar residues in the vicinity are glu175 in sedolisin and the corresponding asp169 in kumamolisin. however, the residue found here in cln2 is cys170, much less sequences of the catalytic domains of cln2 figure 3 sequences of the catalytic domains of cln2. complete sequences are shown for cln2 from human, fugu, and zebrafish, together with the partial sequence of putative cln2 in xenopus tropicalis. residues identical in all four enzymes are colored green and those similar are colored magenta. active site residues are marked as in figure 1 . polar than its counterparts. there is also no equivalent to the polar interaction between glu175 and glu171 in sedolisin, since the equivalent of the latter residue in cln2 is ser166, much smaller and pointing away. the side of the s1 pocket that is created by the very flexible side chain of arg179 in sedolisin contains only a much smaller ser174 in cln2, and thus is much more open in the latter protein. this part of the pocket, with the main chain of the protein quite distant from the substrate, is indeed not well conserved in these proteins, with kumamolisin missing it entirely due to a deletion in the corresponding sequence position. in summary, the s1 pocket in cln2 has less polar character than the equivalent pocket in the related proteins, and is lacking direct polar anchors for any side chains that might be present in the substrate. the s2 pocket in cln2 is also quite open and accessible to solvent. it is most likely larger than the equivalent sequence alignment of bacterial and mammalian enzymes figure 4 sequence alignment of bacterial and mammalian enzymes. alignment of the sequences of sedolisin, kumamolisin, and human cln2 used in the construction of the model of the latter enzyme. the colors scheme is the same as in figure 2 . a homology-derived model of human cln2 figure 5 a homology-derived model of human cln2. ribbon diagram of the cα trace of cln2, with the segments that were modeled based on the highly conserved core of sedolisin and kumamolisin (r.m.s. deviation of 1 å) colored in red. side chains of the residues that were found to be mutated in the genes of families of patients with late-infantile neuronal ceroid lipofuscinosis [17] are marked in ball-and-stick. pockets in either sedolisin or kumamolisin, since these are limited by trp81 in the former and trp129 in the latter (these residues originate from different parts of the backbone in the two enzymes and are not topologically related). tyr130, an equivalent of the latter residue in cln2, is unlikely to come into direct contact with the p2 residue of the substrate due to its greater distance (almost 4 å for the closest atoms). an important remaining puzzle is that the predicted structure of cln2 does not show any clear limitations of the s3 pocket that could explain the tripeptidase activity of this enzyme. the location of the p3 side chain of the substrate is ambiguous, since it could point in either one of two directions by exchange with the n-terminal amine. the only negatively charged residue of cln2 that is found in this vicinity is asp132. although in the current model the distance between its carboxylate and the nitrogen of the n-terminus of the modeled substrate is about 6 å, these two groups could be brought into hydrogen-bonding range by some allowed changes in the torsion angles of the protein. such a conformational change would involve breaking of the hydrogen bond between asp132 and ser139. however, this latter interaction is not likely to be structurally crucial since the serine is not absolutely conserved in all cln2-like enzymes. location of mutations found through a genetic survey of families of classic late-infantile neuronal ceroid lipofusci-nosis patients has been described previously [17] . most such mutations result in expression of either truncated enzyme or in incorrect intron-exon splices. however, some of the mutations lead to single amino acid substitutions in the mature enzyme. such mutations include i92n, e148k, c170r and c170y, v190d, g194e, q227h, r252h, a259e, and s280l ( figure 5 ). only the role of the latter mutation is completely clear, since it replaces the catalytic serine of cln2 with a side chain that cannot support its enzymatic activity. no other residues appear to be located in the immediate vicinity of the substrate. residues val190, gly194, and arg252 are very highly conserved not only in cln2 but also in other sedolisins and must play an important structural role. the reasons why the remaining mutations would lead to the loss of enzymatic activity are much less clear, but the wide distribution of these mutations in the structure supports the conclusion that any modifications to cln2 that would abolish or impair its function could lead to the development of the disease. little is known at this time about biologically-relevant substrates of cln2. various defects that include truncations and single-site mutations in cln2 have been found in the genes of patients that display symptoms of lateinfantile neuronal ceroid lipofuscinosis [17] . one of the symptoms of the disease is the accumulation of an autofluorescent material, ceroid-lipofuscin, in lysosomal storage bodies in various cell types, primarily in the nerv-a model of the active site of human cln2 figure 6 a model of the active site of human cln2. the enzyme is shown in complex with pseudo-iodotyrostatin, a good inhibitor of the sedolisin family of peptidases. only selected residues of the enzyme are explicitly shown on the background consisting of the molecular surface. the stick model of the inhibitor is colored gold and the p1-p3 residues are labeled in black. similar views have been previously published for the experimentally-determined structures of sedolisin and kumamolisin [8] . the figure was prepared using the program dino http://www.bioz.unibas.ch/~xray/dino. ous system. since a major component of such bodies appears to be intact subunit c of mitochondrial atp synthase, this protein has been implicated as a potential biological target of the protease. it has been shown recently that cln2 can indeed degrade this subunit on its n terminus [18] , but the unambiguous proof that this is indeed the most important target is still lacking. cln2 is capable of processing a number of different angiotensin-derived peptides [19] , with the efficiency of cleavage dependent on the length of such peptides. the most efficiently processed peptide consisted of 14 amino acids, with the tripeptide asp-arg-val removed from its n terminus. the model of cln2 presented here can easily accommodate this peptide on the p side of the substrate-binding site, although the exact mode of binding of the long p' portion of the substrate remains obscure. the observation that an analogous peptide acetylated on its n terminus cannot be processed supports the postulate that the interactions of the n-terminal amino group with the side chain of asp132 may be the most important feature defining the tripeptidase specificity of cln2. a number of different tripeptides can be serially processed from glucagon, with their sequences varying widely [20] . again, however, all of these tripeptides can be easily accommodated in the substrate-binding site of the cln2 model. other potentially biologically relevant substrates include cholecystokinin and possibly other neuropeptides [21] . an intriguing property of cln2 is its reported ability to cleave collagen-related peptides [22] . the tripeptides resulting from such processing include gly-pro-met, gly-pro-arg, and gly-pro-ala. it has been recently reported that kumamolisin, and particularly a closely related protein from alicyclobacillus sendaiensis (kumamolisin-as) can efficiently cleave not only collagen-related peptides, but also native type i collagen [23] . with the substrate-binding site of cln2 resembling that of kumamolisin more than sedolisin (the latter enzyme has low, if any, collagenase activity), the potential collagen-processing role of cln2 might warrant further investigation. since the catalytic machinery of cln2 matches closely that of sedolisin, kumamolisin, and other members of the family of serine-carboxyl peptidases, the enzymatic mechanism of all these enzymes is most likely the same. design of inhibitors specific for cln2 should incorporate the features that have been proven to be important for the related enzymes, such as the placement of an aldehyde functionality capable of making covalent interactions with the catalytic serine, or the utilization of chloromethyl ketone for the same purpose. since the few inhibitors that have been successfully used in the studies of sedolisins are either longer than tripeptides or contain blocking groups on their n termini, new tripeptide-based inhibitors with free n termini are now being synthesized (oda, unpublished). it will be necessary to test the binding properties of different substrates in order to determine the most promising peptide sequences. analysis of the model of cln2 suggests that the size of the s1 subsite is much larger than in either sedolisin or kumamolisin, and thus the use of a large p1 group might be indicated. of course, the availability of an experimental crystal structure will make the design of inhibitors easier and we are continuing our efforts to crystallize cln2 from different sources. three-dimensional, atomic-scale models of cln2 were developed by exploiting the sequence similarity to the sedolisin and kumamolisin proteins (r.m.s. deviation of 1.0 å for 273 pairs of cα atoms in the core of the enzymes). presently, these two enzymes are the only members of the newly-defined sedolisin/serine-carboxyl peptidase family [8] for which the crystal structures have been published [5] [6] [7] . the actual protein data bank [24] : http://www.pdb.org/) entries used in the modeling were 1ga4.pdb and 1gt9.pdb for sedolisin and kumamolisin, respectively. the first step was to form a global, multiple sequence alignment between all known members of the sedolisin family. studies have shown that incorporating the specific patterns of amino acid residue-type variation and conservation among a family of homologous proteins provides superior results over simple, pair-wise sequence alignment [25] . sequence files representing the different subfamilies were extracted from the non-redundant gen-bank database [26] using sedolisin, kumamolisin, and the human cln2 sequences as queries to the web-based version of the blast program [27] : http:// www.ncbi.nlm.nih.gov/blast/. initial multiple sequence alignments were formed with the clustalx computer program [28] . as is expected for a family of proteins, highlyconserved segments were found aligned to the crystal structure-identified core regions of the sedolisin and kumamolisin sequences. subsequently, the sequences were divided into two groups: those closer to sedolisin than kumamolisin and vise-a-versa. the alignment of these two groups was then manually set by the observed structural alignment of the sedolisin and kumamolisin proteins. finally, some additional adjustment was required to correct the few places where highly conserved residues of the core regions were slightly out of alignment among different subfamilies of sequences. the model of human cln2 was built using the structure of sedolisin complexed with the inhibitor pseudo-tyrostatin [5, 6] as a template. the reason for this choice is that while different protein models were generally compara-ble, the chosen inhibitor was most compatible with the tripeptidase character of cln2. with the correspondence of residues specified in the alignments, atomic coordinates were transferred to the target sequence by a variety of methods, including the homology modeling modules of the look/genemine [29] and deepview [30] computer program packages. for the core and active site of the protein, coordinates for identical residues were simply transferred unchanged; whereas, special care was required to position the side chains of residues differing from the template. this was first accomplished automatically by the two computer packages, then manually adjusted in the quanta molecular modeling package (accelrys, inc.) to better mimic the templates and optimize the interactions with surrounding residues. a similar two-step approach was used to manifest the insertions and deletions in the variable, loop regions of the protein, where it was necessary to create new backbone as well as side chain coordinates for the models. it should be noted that, for obvious reasons, the conformation of poorly conserved loop regions is generally the least accurate aspect of a homology model. fortunately, these problematic loops will not significantly affect the active site of the model, since only two of them impinge on the boundary of this highly conserved, functional region. the model was finished by performing energy minimization in vacuo with the computer program charmm [31] . this refined the structure by bringing the covalent geometry and non-bonded interactions into agreement with experimentally observed and calculated values. such optimizations included adjusting bond lengths, 3-point angles and 4-point dihedral angles, as well as eliminating atomic overlap and forming salt-bridges and hydrogen bonds. since presently the potential energy functions used to describe the atomic-scale models are not sufficiently comprehensive and accurate, the final energy of the model was not used as an indicator of the realistic quality of the structure. the final quality of the structure was analyzed with the computer program procheck [32] . the structure was deposited at the pdb under accession code 1r60. studies on the serial extraction of pituitary proteins association of mutations in a lysosomal protein with classical late-infantile neuronal ceroid lipofuscinosis tripeptidyl-peptidase i is apparently the cln2 protein absent in classical late-infantile neuronal ceroid lipofuscinosis the human cln2 protein/tripeptidyl-peptidase i is a serine protease that autoactivates at acidic ph carboxyl proteinase from pseudomonas defines a novel family of subtilisin-like enzymes inhibitor complexes of the pseudomonas serine-carboxyl proteinase the 1.4 å crystal structure of kumamolysin: a thermostable serine-carboxyl-type proteinase structural and enzymatic properties of the sedolisin family of serine-carboxyl peptidases aorsin, a novel serine proteinase with trypsin-like specificity at acidic ph structural and enzymatic characterization of physarolisin (formerly physaropepsin) proves that it is a unique serine-carboxyl proteinase molecular modeling of the hiv-1 protease and its substrate binding site coronavirus main proteinase (3clpro) structure: basis for design of anti-sars drugs mouse gene knockout models for the cln2 and cln3 forms of ceroid lipofuscinosis a proposed model for the late-infantile neuronal ceroid lipofuscinosis (batten disease) protein cln2 on the size of the active site in proteases. i. papain mutational analysis of the defective protease in classic late-infantile neuronal ceroid lipofuscinosis, a neurodegenerative lysosomal storage disorder tripeptidyl peptidase i, the late infantile neuronal ceroid lipofuscinosis gene product, initiates the lysosomal degradation of subunit c of atp synthase the specificity of lysosomal tripeptidyl peptidase-i determined by its action on angiotensin-ii analogues purification and characterisation of a tripeptidyl aminopeptidase i from rat spleen lysosomal degradation of cholecystokinin-(29-33)-amide in mouse brain is dependent on tripeptidyl peptidase-1: implications for the degradation and storage of peptides in classical late-infantile neuronal ceroid lipofuscinosis partial purification and characterization of an ovarian tripeptidyl peptidase: a lysosomal exopeptidase that sequentially releases collagen-related (gly-pro-x) triplets collagenolytic serine-carboxyl proteinase from alicyclobacillus sendaiensis strain ntap-1: purification, characterization, gene cloning, and heterologous expression the protein data bank sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods gapped blast and psi-blast: a new generation of protein database search programs the clustal_x windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools the genemine system for genome/proteome annotation and collaborative data mining swiss-model and the swiss-pdb-viewer: an environment for comparative protein modeling charmm: a program for macromolecular energy, minimization, and dynamics calculations pro-check: program to check the stereochemical quality of protein structures extensive discussions with dr. a. barrett (wellcome trust sanger institute, hinxton, uk) are gratefully acknowledged. modeling efforts were facilitated by dr. robert jernigan (iowa state university). this work was supported in part by a grant-in-aid for scientific research (b), 15380072, from the ministry of education, culture, sports, science and technology of japan (to k.o.); by nih grants dk18865 and ai39211 (to b.m.d.); and in part with federal funds from the national cancer institute, national institutes of health, under contract no. no1-co-12400. the content of this publication does not necessarily reflect the views or policies of the department of health and human services, nor does the mention of trade names, commercial products or organizations imply endorsement by the u. s. government. aw initiated this project and analyzed the genomic distribution of this family of enzymes. srd contributed the modeling of the three-dimensional structure of cln2. ml analyzed the model and compared it to the crystal structures of sedolisins. ho, ko and bmd contributed their experience gained from studies of serine-carboxyl peptidases and the design of their inhibitors, aimed at analysis of substrate-enzyme interactions and enzyme specificity. all authors read and approved the final manuscript. key: cord-016293-pyb00pt5 authors: newell-mcgloughlin, martina; re, edward title: the flowering of the age of biotechnology 1990–2000 date: 2006 journal: the evolution of biotechnology doi: 10.1007/1-4020-5149-2_4 sha: doc_id: 16293 cord_uid: pyb00pt5 nan the significance of developing genetic and physical maps of the genome, and the importance of comparing the human genome with those of other species. it also suggested a preliminary focus on improving current technology. at the request of the u.s. congress, the office of technology assessment (ota) also studied the issue, and issued a document in 1987 -within days of the nrc report -that was similarly supportive. the ota report discussed, in addition to scientific issues, social and ethical implications of a genome program together with problems of managing funding, negotiating policy and coordinating research efforts. prompted by advisers at a 1988 meeting in reston, virginia, james wyngaarden, then director of the national institutes of health (nih) , decided that the agency should be a major player in the hgp, effectively seizing the lead from doe. the start of the joint effort was in may 1990 (with an "official" start in october) when a 5-year plan detailing the goals of the u.s. human genome project was presented to members of congressional appropriations committees in mid-february. this document co-authored by doe and nih and titled "understanding our genetic inheritance, the u.s. human genome project: the first five years" examined the then current state of genome science. the plan also set forth complementary approaches of the two agencies for attaining scientific goals and presented plans for administering research agenda; it described collaboration between u.s. and international agencies and presented budget projections for the project. according to the document, "a centrally coordinated project, focused on specific objectives, is believed to be the most efficient and least expensive way" to obtain the 3-billion base pair map of the human genome. in the course of the project, especially in the early years, the plan stated that "much new technology will be developed that will facilitate biomedical and a broad range of biological research, bring down the cost of many experiments (mapping and sequencing), and finding applications in numerous other fields." the plan built upon the 1988 reports of the office of technology assessment and the national research council on mapping and sequencing the human genome. "in the intervening two years," the document said, "improvements in technology for almost every aspect of genomics research have taken place. as a result, more specific goals can now be set for the project." the document describes objectives in the following areas mapping and sequencing the human genome and the genomes of model organisms; data collection and distribution; ethical, legal, and social considerations; research training; technology development; and technology transfer. these goals were to be reviewed each year and updated as further advances occured in the underlying technologies. they identified the overall budget needs to be the same as those identified by ota and nrc, namely about $200 million per year for approximately 15 years. this came to $13 billion over the entire period of the project. considering that in july 1990, the dna databases contained only seven sequences greater than 0.1 mb this was a major leap of faith. this approach was a major departure from the single-investigator-based gene of interest focus that research took hitherto. this sparked much controversy both before and after its inception. critics questioned the usefulness of genomic sequencing, they objected to the high cost and suggested it might divert funds from other, more focused, basic research. the prime argument to support the latter position is that there appeared to be are far less genes than accounted for by the mass of dna which would suggest that the major part of the sequencing effort would be of long stretches of base pairs with no known function, the so-called "junk dna." and that was in the days when the number of genes was presumed to be 80-100,000. if, at that stage, the estimated number was guessed to be closer to the actual estimate of 35-40,000 (later reduced to 20-25,000) this would have made the task seem even more foolhardy and less worthwhile to some. however, the ever-powerful incentive of new diagnostics and treatments for human disease beyond what could be gleaned from the gene-by-gene approach and the rapidly evolving technologies, especially that of automated sequencing, made it both an attractive and plausible aim. charles cantor (1990) , a principal scientist for the department of energy's genome project contended that doe and nih were cooperating effectively to develop organizational structures and scientific priorities that would keep the project on schedule and within its budget. he noted that there would be small short-term costs to traditional biology, but that the long-term benefits would be immeasurable. genome projects were also discussed and developed in other countries and sequencing efforts began in japan, france, italy, the united kingdom, and canada. even as the soviet union collapsed, a genome project survived as part of the russian science program. the scale of the venture and the manageable prospect for pooling data via computer made sequencing the human genome a truly international initiative. in an effort to include developing countries in the project unesco assembled an advisory committee in 1988 to examine unesco's role in facilitating international dialogue and cooperation. a privately-funded human genome organization (hugo) had been founded in 1988 to coordinate international efforts and serve as a clearinghouse for data. in that same year the european commission (ec) introduced a proposal entitled the "predictive medicine programme." a few ec countries, notably germany and denmark, claimed the proposal lacked ethical sensitivity; objections to the possible eugenic implications of the program were especially strong in germany (dickson 1989) . the initial proposal was dropped but later modified and adopted in 1990 as the "human genome analysis programme" (dickman and aldhous 1991) . this program committed substantial resources to the study of ethical issues. the need for an organization to coordinate these multiple international efforts quickly became apparent. thus the human genome organization (hugo), which has been called the "u.n. for the human genome," was born in the spring of 1988. composed of a founding council of scientists from seventeen countries, hugo's goal was to encourage international collaboration through coordination of research, exchange of data and research techniques, training, and debates on the implications of the projects (bodmer 1991) . in august 1990 nih began large-scale sequencing trials on four model organisms: the parasitic, cell-wall lacking pathogenic microbe mycoplasma capricolum, the prokaryotic microbial lab rat escherichia coli, the most simple animal caenorhabditis elegans, and the eukaryotic microbial lab rat saccharomyces cerevisiae. each research group agreed to sequence 3 megabases (mb) at 75 cents a base within 3 years. a sub living organism was actually fully sequenced and the complete sequence of that genome, the human cytomegalovirus (hcmv) genome was 0.23 mb. that year also saw the casting of the first salvo in the protracted debate on "ownership" of genetic information beginning with the more tangible question of ownership of cells. and, as with the debates of the early eighties, which were to be revisited later in the nineties, the respondent was the university of california. moore v. regents of the university of california was the first case in the united states to address the issue of who owns the rights to an individual's cells. diagnosed with leukemia, john moore had blood and bone marrow withdrawn for medical tests. suspicious of repeated requests to give samples because he had already been cured, moore discovered that his doctors had patented a cell line derived from his cells and so he sued. the california supreme court found that moore's doctor did not obtain proper informed consent, but, however, they also found that moore cannot claim property rights over his body. the quest for the holy grail of the human genome was both inspired by the rapidly evolving technologies for mapping and sequencing and subsequently spurred on the development of ever more efficient tools and techniques. advances in analytical tools, automation, and chemistries as well as computational power and algorithms revolutionized the ability to generate and analyze immense amounts of dna sequence and genotype information. in addition to leading to the determination of the complete sequences of a variety of microorganisms and a rapidly increasing number of model organisms, these technologies have provided insights into the repertoire of genes that are required for life, and their allelic diversity as well as their organization in the genome. but back in 1990 many of these were still nascent technologies. the technologies required to achieve this end could be broadly divided into three categories: equipment, techniques, and computational analysis. these are not truly discrete divisions and there was much overlap in their influence on each other. as noted, lloyd smith, michael and tim hunkapiller, and leroy hood conceived the automated sequencer and applied biosystems inc. brought it to market in june 1986. there is no much doubt that when applied biosystems inc. put it on the market that which had been a dream became decidedly closer to an achievable reality. in automating sangers chain termination sequencing system, hood modified both the chemistry and the data-gathering processes. in the sequencing reaction itself, he replaced radioactive labels, which were unstable, posed a health hazard, and required separate gels for each of the four bases. hood developed chemistry that used fluorescent dyes of different colors for each of the four dna bases. this system of "color-coding" eliminated the need to run several reactions in overlapping gels. the fluorescent labels addressed another issue which contributed to one of the major concerns of sequencing -data gathering. hood integrated laser and computer technology, eliminating the tedious process of information-gathering by hand. as the fragments of dna passed a laser beam on their way through the gel the fluorescent labels were stimulated to emit light. the emitted light was transmitted by a lens and the intensity and spectral characteristics of the fluorescence are measured by a photomultiplier tube and converted to a digital format that could be read directly into a computer. during the next thirteen years, the machine was constantly improved, and by 1999 a fully automated instrument could sequence up to 150,000,000 base pairs per year. in 1990 three groups came up with a variation on this approach. they developed what is termed capillary electrophoresis, one team was led by lloyd smith (luckey, 1990) , the second by barry karger , and the third by norman dovichi. in 1997 molecular dynamics introduced the megabace, a capillary sequencing machine. and not to be outdone the following year in 1998, the original of the species came up with the abi prism 3700 sequencing machine. the 3700 is also a capillary-based machine designed to run about eight sets of 96 sequence reactions per day. on the biology side, one of the biggest challenges was the construction of a physical map to be compiled from many diverse sources and approaches in such a way as to insure continuity of physical mapping data over long stretches of dna. the development of dna sequence tagged sites (stss) to correlate diverse types of dna clones aided this standardization of the mapping component by providing mappers with a common language and a system of landmarks for all the libraries from such varied sources as cosmids, yeast artificial chromosomes (yacs) and other rdnas clones. this way each mapped element (individual clone, contig, or sequenced region) would be defined by a unique sts. a crude map of the entire genome, showing the order and spacing of stss, could then be constructed. the order and spacing of these unique identifier sequences composing an sts map was made possible by development of mullis' polymerase chain reaction (pcr), which allows rapid production of multiple copies of a specific dna fragment, for example, an sts fragment. sequence information generated in this way could be recalled easily and, once reported to a database, would be available to other investigators. with the sts sequence stored electronically, there would be no need to obtain a probe or any other reagents from the original investigator. no longer would it be necessary to exchange and store hundreds of thousands of clones for full-scale sequencing of the human genome-a significant saving of money, effort, and time. by providing a common language and landmarks for mapping, sts's allowed genetic and physical maps to be cross-referenced. with a refinement on this technique to go after actual genes, sydney brenner proposed sequencing human cdnas to provide rapid access to the genes stating that 'one obvious way of finding at least a large part of the important [fraction] of the human genome is to look at the sequences of the messenger rna's of expressed genes' (brenner, 1990) . the following year the man who was to play a pivotal role on the world stage that became the human genome project suggested a way to implement sydney's approach. that player, nih biologist j. craig venter announced a strategy to find expressed genes, using ests (expressed sequence tag) (adams, 1991) . these so called ests represent a unique stretch of dna within a coding region of a gene, which as sydney suggested would be useful for identifying full-length genes and as a landmark for mapping. so using this approach projects were begun to mark gene sites on chromosome maps as sites of mrna expression. to help with this a more efficient method of handling large chunks of sequences was needed and two approaches were developed. yeast artificial chromosomes, which were developed by david burke, maynard olson, and george carle, increased insert size 10-fold (david t. burke et al., 1987) . caltech's second major contribution to the genome project was developed by melvin simon, and hiroaki shizuya. their approach to handling large dna segments was to develop "bacterial artificial chromosomes" (bacs), which basically allow bacteria to replicate chunks greater than 100,000 base pairs in length. this efficient production of more stable, large-insert bacs made the latter an even more attractive option, as they had greater flexibility than yacs. in 1994 in a collaboration that presages the snp consortium, washington university, st louis mo, was funded by the pharmaceutical company merck and the national cancer institute to provide sequence from those ests. more than half a million ests were submitted during the project (murr l et al., 1996) . on the analysis side was the major challenge to manage and mine the vast amount of dna sequence data being generated. a rate-limiting step was the need to develop semi-intelligent algorithms to achieve this herculean task. this is where the discipline of bioinformatics came into play. it had been evolving as a discipline since margaret oakley dayhoff used her knowledge of chemistry, mathematics, biology and computer science to develop this entirely new field in the early sixties. she is in fact credited today as a founder of the field of bioinformatics in which biology, computer science, and information technology merge into a single discipline. the ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. there are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information. paralleling the rapid and very public ascent of recombinant dna technology during the previous two decades, the analytic and management tools of the discipline that was to become bioinformatics evolved at a more subdued but equally impressive pace. some of the key developments included tools such as the needleman-wunsch algorithm for sequence comparison which appeared even before recombinant dna technology had been demonstrated as early as 1970; the smith-waterman algorithm for sequence alignment (1974); the fastp algorithm (1985) and the fasta algorithm for sequence comparison by pearson and lupman in 1988 and perl (practical extraction report language) released by larry wall in 1987. on the data management side several databases with ever more effective storage and mining capabilities were developed over the same period. the first bioinformatic/biological databases were constructed a few years after the first protein sequences began to become available. the first protein sequence reported was that of bovine insulin in 1956, consisting of 51 residues. nearly a decade later, the first nucleic acid sequence was reported, that of yeast alanine trna with 77 bases. just one year later, dayhoff gathered all the available sequence data to create the first bioinformatic database. one of the first dedicated databases was the brookhaven protein databank whose collection consisted of ten x-ray crystallographic protein structures (acta. cryst. b, 1973) . the year 1982 saw the creation of the genetics computer group (gcg) as a part of the university of wisconsin biotechnology center. the group's primary and much used product was the wisconsin suite of molecular biology tools. it was spun off as a private company in 1989. the swiss-prot database made its debut in 1986 in europe at the department of medical biochemistry of the university of geneva and the european molecular biology laboratory (embl). the first dedicated "bioinformatics" company intelligenetics, inc. was founded in california in 1980. their primary product was the intelligenetics suite of programs for dna and protein sequence analysis. the first unified federal effort, the national center for biotechnology information (ncbi) was created at nih/nlm in 1988 and it was to play a crucial part in coordinating public databases, developing software tools for analyzing genome data, and disseminating information. and on the other side of the atlantic, oxford molecular group, ltd. (omg) was founded in oxford, uk by anthony marchington, david ricketts, james hiddleston, anthony rees, and w. graham richards. their primary focus was on rational drug design and their products such as anaconda, asp, and chameleon obviously reflected this as they were applied in molecular modeling, and protein design engineering. within two years ncbi were making their mark when david lipman, eugene myers, and colleagues at the ncbi published the basic local alignment search tool blast algorithm for aligning sequences (altschul et al., 1990) . it is used to compare a novel sequence with those contained in nucleotide and protein databases by aligning the novel sequence with previously characterized genes. the emphasis of this tool is to find regions of sequence similarity, which will yield functional and evolutionary clues about the structure and function of this novel sequence. regions of similarity detected via this type of alignment tool can be either local, where the region of similarity is based in one location, or global, where regions of similarity can be detected across otherwise unrelated genetic code. the fundamental unit of blast algorithm output is the high-scoring segment pair (hsp). an hsp consists of two sequence fragments of arbitrary but equal length whose alignment is locally maximal and for which the alignment score meets or exceeds a threshold or cutoff score. this system has been refined and modified over the years the two principal variants presently in use being the ncbi blast and wu-blast (wu signifying washington university). the same year that blast was launched two other bioinformatics companies were launched. one was informax in bethesda, md whose products addressed sequence analysis, database and data management, searching, publication graphics, clone construction, mapping and primer design. the second, molecular applications group in california, was to play a bigger part on the proteomics end (michael levitt and chris lee). their primary products were look and segmod which are used for molecular modeling and protein design. the following year in 1991 the human chromosome mapping data repository, genome data base (gdb) was established. on a more global level, the development of computational capabilities in general and the internet in specific was also to play a considerable part in the sharing of data and access to databases that rendered the rapidity of the forward momentum of the hgp possible. also in 1991 edward uberbacher of oak ridge national laboratory in tennessee developed grail, the first of many gene-finding programs. in 1992 the first two "genomics" companies made their appearance. incyte pharmaceuticals, a genomics company headquartered in palo alto, california, was formed and myriad genetics, inc. was founded in utah. incyte's stated goal was to lead in the discovery of major common human disease genes and their related pathways. the company discovered and sequenced, with its academic collaborators (originally synteni from pat brown's lab at stanford), a number of important genes including brca1 and brca2, with mary claire king, epidemiologist at uc-berkeley, the genes linked to breast cancer in families with a high degree of incidence before age 45. by 1992 a low-resolution genetic linkage map of the entire human genome was published and u.s. and french teams completed genetic maps of both mouse and man. the mouse with an average marker spacing of 4.3 cm as determined by eric lander and colleagues at whitehead and the human, with an average marker spacing of 5 cm by jean weissenbach and colleagues at ceph (centre d'etude du polymorphisme humaine). the latter institute was the subject of a rather scathing book by paul rabinow (1999) based on what they did with this genome map. in 1993, an american biotechnology company, millennium pharmaceuticals, and the ceph, developed plans for a collaborative effort to discover diabetes genes. the results of this collaboration could have been medically significant and financially lucrative. the two parties had agreed that ceph would supply millennium with germplasm collected from a large coterie of french families, and millennium would supply funding and expertise in new technologies to accelerate the identification of the genes, terms to which the french government had agreed. but in early 1994, just as the collaboration was to begin, the french government cried halt! the government explained that the ceph could not be permitted to give the americans that most precious of substances for which there was no precedent in law -french dna. rabinow's book discusses the tangled relations and conceptions such as, can a country be said to have its own genetic material, the first but hardly the last franco-american disavowal of détente (paul rabinow, 1999) . the latest facilities such as the joint genome institute (jgi), walnut creek, ca are now able to sequence up to 10mb per day which makes it possible to sequence whole microbial genomes within a day. technologies currently under development will probably increase this capacity yet further through massively parallel sequencing and/or microfluidic processing making it possible to sequence multiple genotypes from several species. nineteen ninety-two saw one of the first shakeups in the progress of the hgp. that was the year that the first major outsider entered the race when britain's wellcome trust plunked down $95 million to join the hgp. this caused a mere ripple while the principal shake-ups occurred stateside. much of the debate and subsequently the direction all the way through the hgp process was shaped by the personalities involved. as noted the application of one of the innovative techniques, namely ests, to do an end run on patenting introduced one of those major players to the fray, craig venter. venter, the high school drop out who reached the age of majority in the killing fields of vietnam was to play a pivotal role in a more "civilized" but no less combative field of human endeavor. he came onto the world stage through his initial work on ests while at the national institute of neurological disorders and stroke (ninds) from 1984 to 1992. he noted in an interview with the scientist magazine in 1995, that there was a degree of ambiguity at ninds about his venturing into the field of genomics, while they liked the prestige of hosting one of the leaders and innovators in his newly emerging field, they were concerned about him moving outside the nind purview of the human brain and nervous system. ultimately, while he proclaimed to like the security and service infrastructure this institute afforded him, that same system became too restrictive for his interests and talent. he wanted the whole canvas of human-gene expression to be his universe, not just what was confined to the central nervous system. he was becoming more interested in taking a whole genome approach to understanding the overall structure of genomes and genome evolution, which was much broader than the mission of ninds. he noted, with some irony, in later years that the then current nih director harold varmus had wished in hindsight that nih had pushed to do a similar database in the public domain, clearly in venter's opinion varmus was in need of a refresher course in history! bernadine healy, nih director in 1994, was one of the few in a leadership role who saw the technical and fiscal promise of venter's work and, like all good administrators, it also presented an opportunity to resolve a thorny "personnel" issue. she appointed him head of the ad hoc committee to have an intramural genome program at nih to give the head of the hgp (that other larger than life personality jim watson) notice that he was not the sole arbitrator of the direction for the human genome project. however venter very soon established himself as an equally non-conformist character and with the tacit consent of his erstwhile benefactor. he initially assumed the mantle of a non-conformist through guilt by association rather than direct actions when it was revealed that nih was filing patent applications on thousands of these partial genes based on his ests catalyzing the first hgp fight at a congressional hearing. nih's move was widely criticized by the scientific community because, at the time, the function of genes associated with the partial sequences was unknown. critics charged that patent protection for the gene segments would forestall future research on them. the patent office eventually rejected the patents, but the applications sparked an international controversy over patenting genes whose functions were still unknown. interestingly enough despite nih's reliance on the est/cdna technique, venter, who was now clearly venturing outside the ninds mandated rubric, could not obtain government funding to expand his research, prompting him to leave nih in 1992. he moved on to become president and director of the institute for genomic research (tigr), a nonprofit research center based in gaithersburg, md. at the same time william haseltine formed a sister company, human genome sciences (hgs), to commercialize tigr products. venter continued est work at tigr, but also began thinking about sequencing entire genomes. again, he came up with a quicker and faster method: whole genome shotgun sequencing. he applied for an nih grant to use the method on hemophilus influenzae, but started the project before the funding decision was returned. when the genome was nearly complete, nih rejected his proposal saying the method would not work. in a triumphal flurry in late may 1995 and with a metaphorical nose-thumbing at his recently rejected "unworkable" grant venter announced that tigr and collaborators had fully sequenced the first free-living organism -haemophilus influenzae. in november 1994, controversy surrounding venter's research escalated. access restrictions associated with a cdna database developed by tigr and its rockville, md.-based biotech associate, human genome sciences (hgs) inc. -including hgs's right to preview papers on resulting discoveries and for first options to license products -prompted merck and co. inc. to fund a rival database project. in that year also britain "officially" entered the hgp race when the wellcome trust trumped down $95 million (as mentioned earlier). the following year hgs was involved in yet another patenting debacle forced by the rapid march of technology into uncharted patent law territory. on june 5, 1995 hgs applied for a patent on a gene that produces a "receptor" protein that is later called ccr5. at that time hgs has no idea that ccr5 is an hiv receptor. in december 1995, u.s. researcher robert gallo, the co-discoverer of hiv, and colleagues found three chemicals that inhibit the aids virus but they did not know how the chemicals work. in february 1996, edward berger at the nih discovered that gallo's inhibitors work in late-stage aids by blocking a receptor on the surface of t-cells. in june of that year in a period of just 10 days, five groups of scientists published papers saying ccr5 is the receptor for virtually all strains of hiv. in january 2000, schering-plough researchers told a san francisco aids conference that they have discovered new inhibitors. they knew that merck researchers had made similar discoveries. as a significant valentine in 2000 the u.s. patent and trademark office (uspto) grants hgs a patent on the gene that makes ccr5 and on techniques for producing ccr5 artificially. the decision sent hgs stock flying and dismayed researchers. it also caused the uspto to revise its definition of a "patentable" drug target. in the meantime haseltine's partner in rewriting patenting history, venter turned his focus to the human genome. he left tigr and started the for-profit company celera, a division of pe biosystems, the company that at times, thanks to hood and hunkapillar, led the world in the production of sequencing machines. using these machines, and the world's largest civilian supercomputer, venter finished assembling the human genome in just three years. following the debacle with the then nih director bernine healy over patenting the partial genes that resulted from est analysis, another major personality-driven event in that same year occurred. watson strongly opposed the idea of patenting gene fragments fearing that it would discourage research, and commented that "the automated sequencing machines 'could be run by monkeys.' " (nature june 29, 2000) with this dismissal watson resigned his nih nchgr post in 1992 to devote his full-time effort to directing cold spring harbor laboratory. his replacement was of a rather more pragmatic, less flamboyant nature. while venter maybe was described as an idiosyncratic shogun of the shotgun, francis collins was once described as the king arthur of the holy grail that is the human genome project. collins became the director of the national human genome research institute in 1993. he was considered the right man for the job following his 1989 success (along with lap-chee tsui) in identifying the gene for the cystic fibrosis transmembrane (cftr) chloride channel receptor that, when mutated, can lead to the onset of cystic fibrosis. although now indelibly connected with the topic non-plus tout in biology, like many great innovators in this field before him, francis collins had little interest in biology as he grew up on a farm in the shenandoah valley of virginia. from his childhood he seemed destined to be at the center of drama, his father was professor of dramatic arts at mary baldwin college and the early stage management of career was performed on a stage he built on the farm. while the physical and mathematical sciences held appeal for him, being possessed of a highly logical mind, collins found the format in which biology was taught in the high school of his day mind-numbingly boring, filled with dissections and rote memorization. he found the contemplation of the infinite outcomes of dividing by zero (done deliberately rather than by accident as in einstein's case) far more appealing than contemplating the innards of a frog. that biology could be gloriously logical only became clear to collins when, in 1970, he entered yale with a degree in chemistry from the university of virginia and was first exposed to the nascent field of molecular biology. anecdotally it was the tome, the book of life, penned by the theoretical physicist father of molecular biology, edwin schrodinger, while exiled in trinity college dublin in 1942 that was the catalyst for his conversion. like schrodinger he wanted to do something more obviously meaningful (for less than hardcore physicists at least!) than theoretical physics, so he went to medical school at unc-chapel hill after completing his chemistry doctorate in yale, and returned to the site of his road to damascus for post-doctoral study in the application of his newfound interest in human genetics. during this sojourn at yale, collins began working on developing novel tools to search the genome for genes that cause human disease. he continued this work, which he dubbed "positional cloning," after moving to the university of michigan as a professor in 1984. he placed himself on the genetic map when he succeeded in using this method to put the gene that causes cystic fibrosis on the physical map. while a less colorful-in-your-face character than venter he has his own personality quirks, for example, he pastes a new sticker onto the back of his motorcycle helmet every time he finds a new disease gene. one imagines that particular piece of really estate is getting rather crowded. interestingly it was not these four hundred pound us gorillas who proposed the eventually prescient timeline for a working draft but two from the old power base. in meetings in the us in 1994, john sulston and bob waterston proposed to produce a 'draft' sequence of the human genome by 2000, a full five years ahead of schedule. while agreed by most to be feasible it meant a rethinking of strategy and involved focusing resources on larger centers and emphasizing sequence acquisition. just as important, it asserts the value of draft quality sequence to biomedical research. discussion started with the british based wellcome trust as possible sponsors (marshall e. 1995) . by 1995 a rough draft of the human genome map was produced showing the locations of more than 30,000 genes. the map was produced using yeast artificial chromosomes and some chromosomes -notably the littlest 22 -were mapped in finer detail. these maps marked an important step toward clone-based sequencing. the importance was illustrated in the devotion of an entire edition of the journal nature to the subject. (nature 377: 175-379 1995) the duel between the public and private face of the hgp progressed at a pace over the next five years. following release of the mapping data some level of international agreement was decided on sequence data release and databases. they agreed on the release of sequence data, specifically, that primary genomic sequence should be in the public domain to encourage research and development to maximize its benefit to society. also that it be rapidly released on a daily basis with assemblies of greater than 1 kb and that the finished annotated sequence should be submitted immediately to the public databases. in 1996 an international consortium completed the sequence of the genome of the workhorse yeast saccharomyces cerevisiae. data had been released as the individual chromosomes were completed. the saccharomyces genome database (sgd) was created to curate this information. the project collects information and maintains a database of the molecular biology of s. cerevisiae. this database includes a variety of genomic and biological information and is maintained and updated by sgd curators. the sgd also maintains the s. cerevisiae gene name registry, a complete list of all gene names used in s. cerevisiae. in 1997 a new more powerful diagnostic tool termed snps (single nucleotide polymorphisms) was developed. snps are changes in single letters in our dna code that can act as markers in the dna landscape. some snps are associated closely with susceptibility to genetic disease, our response to drugs or our ability to remove toxins. the snp consortium although designated a limited company is a nonprofit foundation organized for the purpose of providing public genomic data. it is a collaborative effort between pharmaceutical companies and the wellcome trust with the idea of making available widely accepted, high-quality, extensive, and publicly accessible snp map. its mission was to develop up to 300,000 snps distributed evenly throughout the human genome and to make the information related to these snps available to the public without intellectual property restrictions. the project started in april 1999 and was anticipated to continue until the end of 2001. in the end, many more snps, about 1.5 million total, were discovered than was originally planned. by 1998 the complete genome sequence of mycobacterium tuberculosis was published by teams from the uk, france, us and denmark in june 1998. the abi prism 3700 sequencing machine, a capillary-based machine designed to run about eight sets of 96 sequence reactions per day also reached the market that year. that same year the genome sequence of the first multicellular organism, c. elegans was completed. c. elegans has a genome of about 100 mb and, as noted, is a primitive animal model organism used in a range of biological disciplines. by november 1999 the human genome draft sequence reached 1000 mb and the first complete human chromosome was sequenced -this first was reached on the east side of the atlantic by the hgp team led by the sanger centre, producing a finished sequence for chromosome 22, which is about 34 million base-pairs and includes at least 550 genes. according to anecdotal evidence when visiting his namesake centre, sanger asked: "what does this machine do then?" "dideoxy sequencing" came the reply, to which fred retorted: "haven't they come up with anything better yet?" as will be elaborated in the final chapter the real highlight of 2000 was production of a 'working draft' sequence of the human genome, which was announced simultaneously in the us and the uk. in a joint event, celera genomics announced completion of their 'first assembly' of the genome. in a remarkable special issue, nature included a 60-page article by the human genome project partners, studies of mapping and variation, as well as analysis of the sequence by experts in different areas of biology. science published the article by celera on their assembly of hgp and celera data as well as analyses of the use of the sequence. however to demonstrate the sensitivity of the market place to presidential utterances the joint appearances by bill clinton and tony blair touting this major milestone turned into a major cold shower when clinton's reassurance of access of the people to their genetic information caused a precipitous drop in celera's share value overnight. clinton's assurance that, "the effort to decipher the human genome will be the scientific breakthrough of the century -perhaps of all time. we have a profound responsibility to ensure that the life-saving benefits of any cutting-edge research are available to all human beings." (president bill clinton, wednesday, march 14, 2000) stands in sharp contrast to the statement from venter's colleague that " any company that wants to be in the business of using genes, proteins, or antibodies as drugs has a very high probability of running afoul of our patents. from a commercial point of view, they are severely constrained -and far more than they realize." (william a. haseltine, chairman and ceo, human genome sciences). the huge sell-off in stocks ended weeks of biotech buying in which those same stocks soared to unprecedented highs. by the next day, however, the genomic company spin doctors began to recover ground in a brilliant move which turned the clinton announcement into a public relations coup. all major genomics companies issued press releases applauding president clinton's announcement. the real news they argued, was that "for the first time a president strongly affirmed the importance of gene based patents." and the same bill haseltine of human genome sciences positively gushed as he happily pointed out that he "could begin his next annual report with the [president's] monumental statement, and quote today as a monumental day." as distinguished harvard biologist richard lewontin notes: "no prominent molecular biologist of my acquaintance is without a financial stake in the biotechnology business. as a result, serious conflicts of interest have emerged in universities and in government service (lewontin, 2000) . away from the spin doctors perhaps eric lander may have best summed up the herculean effort when he opined that for him "the human genome project has been the ultimate fulfilment: the chance to share common purpose with hundreds of wonderful colleagues towards a goal larger than ourselves. in the long run, the human genome project's greatest impact might not be the three billion nucleotides of the human chromosomes, but its model of scientific community." (ridley, 2000) 6. gene therapy the year 1990 also marked the passing of another milestone that was intimately connected to one of the fundamental drivers of the hgp. the california hereditary disorders act came into force and with it one of the potential solutions for human hereditary disorders. w. french anderson in the usa reported the first successful application of gene therapy in humans. the first successful gene therapy for a human disease was successfully achieved for severe combined immune deficiency (scid) by introducing the missing gene, adenosine deaminase deficiency (ada) into the peripheral lymphocytes of a 4-year-old girl and returning modified lymphocytes to her. although the results are difficult to interpret because of the concurrent use of polyethylene glycol-conjugated ada commonly referred to as pegylated ada (pgla) in all patients, strong evidence for in vivo efficacy was demonstrated. ada-modified t cells persisted in vivo for up to three years and were associated with increases in t-cell number and ada enzyme levels, t cells derived from transduced pgla were progressively replaced by marrow-derived t cells, confirming successful gene transfer into long-lived progenitor cells. ashanthi desilva, the girl who received the first credible gene therapy, continues to do well more than a decade later. cynthia cutshall, the second child to receive gene therapy for the same disorder as desilva, also continues to do well. within 10 years (by january 2000), more than 350 gene therapy protocols had been approved in the us and worldwide, researchers launched more than 400 clinical trials to test gene therapy against a wide array of illnesses. surprisingly, a disease not typically heading the charts of heritable disorders, cancer has dominated the research. in 1994 cancer patients were treated with the tumor necrosis factor gene, a natural tumor fighting protein which worked to a limited extent. even more surprisingly, after the initial flurry of success little has worked. gene therapy, the promising miracle of 1990 failed to deliver on its early promise over the decade. apart from those examples, there are many diseases whose molecular pathology is, or soon will be, well understood, but for which no satisfactory treatments have yet been developed. at the beginning of the nineties it appeared that gene therapy did offer new opportunities to treat these disorders both by restoring gene functions that have been lost through mutation and by introducing genes that can inhibit the replication of infectious agents, render cells resistant to cytotoxic drugs, or cause the elimination of aberrant cells. from this "genomic" viewpoint genes could be said to be viewed as medicines, and their development as therapeutics should embrace the issues facing the development of small-molecule and protein therapeutics such as bioavailability, specificity, toxicity, potency, and the ability to be manufactured at large scale in a cost-effective manner. of course for such a radical approach certain basal level criteria needed to be established for selecting disease candidates for human gene therapy. these include, such factors as the disease is an incurable, life-threatening disease; organ, tissue, and cell types affected by the disease have been identified; the normal counterpart of the defective gene has been isolated and cloned; either the normal gene can be introduced into a substantial subfraction of the cells from the affected tissue, or the introduction of the gene into the available target tissue, such as bone marrow, will somehow alter the disease process in the tissue affected by the disease; the gene can be expressed adequately (it will direct the production of enough normal protein to make a difference); and techniques are available to verify the safety of the procedure. an ideal gene therapeutic should, therefore, be stably formulated at room temperature and amenable to administration either as an injectable or aerosol or by oral delivery in liquid or capsule form. the therapeutic should also be suitable for repeat therapy, and when delivered, it should neither generate an immune response nor be destroyed by tissue-scavenging mechanisms. when delivered to the target cell, the therapeutic gene should then be transported to the nucleus, where it should be maintained as a stable plasmid or chromosomal integrant, and be expressed in a predictable, controlled fashion at the desired potency in a cell-specific or tissue-specific manner. in addition to the ada gene transfer in children with severe combined immunodeficiency syndrome, a gene-marking study of epstein-barr virus-specific cytotoxic t cells, and trials of gene-modified t cells expressing suicide or viral resistance genes in patients infected with hiv were studied in the early nineties. additional strategies for t-cell gene therapy which were pursued later in the decade involve the engineering of novel t-cell receptors that impart antigen specificity for virally infected or malignant cells. issues which still are not resolved include nuclear transport, integration, regulated gene expression and immune surveillance. this knowledge, when finally understood and applied to the design of delivery vehicles of either viral or non-viral origin, will assist in the realization of gene therapeutics as safe and beneficial medicines that are suited to the routine management of human health. scientists are also working on using gene therapy to generate antibodies directly inside cells to block the production of harmful viruses such as hiv or even cancer-inducing proteins. there is a specific connection with francis collins, as his motivation for pursuing the hgp was his pursuit of defective genes beginning with the cystic fibrosis gene. this gene, called the cf transmembrane conductance regulator, codes for an ion channel protein that regulates salts in the lung tissue. the faulty gene prevents cells from excreting salt properly causing a thick sticky mucus to build up and destroy lung tissue. scientists have spliced copies of the normal genes into disabled adeno viruses that target lung tissues and have used bronchioscopes to deliver them to the lungs. the procedure worked well in animal studies however clinical trials in humans were not an unmitigated success. because the cells lining the lungs are continuously being replaced the effect is not permanent and must be repeated. studies are underway to develop gene therapy techniques to replace other faulty genes. for example, to replace the genes responsible for factor viii and factor ix production whose malfunctioning causes hemophilia a and b respectively; and to alleviate the effects of the faulty gene in dopamine production that results in parkinson's disease. apart from technical challenges such a radical therapy also engenders ethical debate. many persons who voice concerns about somatic-cell gene therapy use a "slippery slope" argument. it sounds good in theory but where does one draw the line. there are many issues yet to be resolved in this field of thorny ethics "good" and "bad" uses of the gene modification, difficulty of following patients in long-term clinical research and such. many gene therapy candidates are children who are too young to understand the ramifications of this treatment: conflict of interest -pits individuals' reproductive liberties and privacy interests against the interests of insurance companies or society. one issue that is unlikely to ever gain acceptance is germline therapy, the removal of deleterious genes from the population. issues of justice and resource allocation also have been raised: in a time of strain on our health care system, can we afford such expensive therapy? who should receive gene therapy? if it is made available only to those who can afford it, then a number of civil rights groups claim that the distribution of desirable biological traits among different socioeconomic and ethnic groups would become badly skewed adding a new and disturbing layer of discriminatory behavior. indeed a major setback occurred before the end of the decade in 1999. jesse gelsinger was the first person to die from gene therapy, on september 17, 1999, and his death created another unprecedented situation when his family sued not only the research team involved in the experiment (u penn), the company genovo inc., but also the ethicist who offered moral advice on the controversial project. this inclusion of the ethicist as a defendant alongside the scientists and school was a surprising legal move that puts this specialty on notice, as will no doubt be the case with other evolving technologies such as stem cells and therapeutic cloning, that its members could be vulnerable to litigation over the philosophical guidance they provide to researchers. the penn group principal investigator james wilson approached ethicist arthur caplan about their plans to test the safety of a genetically engineered virus on babies with a deadly form of the liver disorder, ornithine transcarbamylase deficiency. the disorder allows poisonous levels of ammonia to build up in the blood system. caplan steered the researchers away from sick infants, arguing that desperate parents could not provide true informed consent. he said it would be better to experiment on adults with a less lethal form of the disease who were relatively healthy. gelsinger fell into that category. although he had suffered serious bouts of ammonia buildup, he was doing well on a special drug and diet regimen. the decision to use relatively healthy adults was controversial because risky, unproven experimental protocols generally use very ill people who have exhausted more traditional treatments, so have little to lose. in this case, the virus used to deliver the genes was known to cause liver damage, so some scientists were concerned it might trigger an ammonia crisis in the adults. wilson underestimated the risk of the experiment, omitted the disclosure about possible liver damage in earlier volunteers in the experiment and failed to mention the deaths of monkeys given a similar treatment during pre-clinical studies. a food and drug administration investigation after gelsinger's death found numerous regulatory violations by wilson's team, including the failure to stop the experiment and inform the fda after four successive volunteers suffered serious liver damage prior to the teen's treatment. in addition, the fda said gelsinger did not qualify for the experiment, because his blood ammonia levels were too high just before he underwent the infusion of genetic material. the fda suspended all human gene experiments by wilson and the university of penn subsequently restricting him solely to animal studies. a follow-up fda investigation subsequently alleged he improperly tested the experimental treatment on animals. financial conflicts of interest also surrounded james wilson, who stood to personally profit from the experiment through genovo his biotechnology company. the lawsuit was settled out of court for undisclosed terms in november 2000. the fda also suspended gene therapy trials at st. elizabeth's medical center in boston, a major teaching affiliate of tufts university school of medicine, which sought to use gene therapy to reverse heart disease, because scientists there failed to follow protocols and may have contributed to at least one patient death. in addition, the fda temporarily suspended two liver cancer studies sponsored by the schering-plough corporation because of technical similarities to the university of pennsylvania study. some research groups voluntarily suspended gene therapy studies, including two experiments sponsored by the cystic fibrosis foundation and studies at beth israel deaconess medical center in boston aimed at hemophilia. the scientists paused to make sure they learned from the mistakes. the nineties also saw the development of another "high-thoughput" breakthrough, a derivative of the other high tech revolution namely dna chips. in 1991 biochips were developed for commercial use under the guidance of affymetrix. dna chips or microarrays represent a "massively parallel" genomic technology. they facilitate high throughput analysis of thousands of genes simultaneously, and are thus potentially very powerful tools for gaining insight into the complexities of higher organisms including analysis of gene expression, detecting genetic variation, making new gene discoveries, fingerprinting strains and developing new diagnostic tools. these technologies permit scientists to conduct large scale surveys of gene expression in organisms, thus adding to our knowledge of how they develop over time or respond to various environmental stimuli. these techniques are especially useful in gaining an integrated view of how multiple genes are expressed in a coordinated manner. these dna chips have broad commercial applications and are now used in many areas of basic and clinical research including the detection of drug resistance mutations in infectious organisms, direct dna sequence comparison of large segments of the human genome, the monitoring of multiple human genes for disease associated mutations, the quantitative and parallel measurement of mrna expression for thousands of human genes, and the physical and genetic mapping of genomes. however the initial technologies, or more accurately the algorithms used to extract information, were far from robust and reproducible. the erstwhile serial entrepreneur, al zaffaroni (the rebel who in 1968 founded alza when syntex ignored his interest in developing new ways to deliver drugs) founded yet another company, affymetrix, under the stewardship of stephen fodor, which was subject to much abuse for providing final extracted data and not allowing access to raw data. as with other personalities of this high through put era, seattle-bred steve fodor was also somewhat of a polymath having contributed to two major technologies, microarrays and combinatorial chemistry, the former has delivered on it's, promise while the latter, like gene therapy, is still in a somewhat extended gestation. and despite the limitations of being an industrial scientist he has had a rather prolific portfolio of publications. his seminal manuscripts describing this work have been published in all the journals of note, science, nature and pnas and was recognized in 1992 by the aaas by receiving the newcomb-cleveland award for an outstanding paper published in science. fodor began his industrial career in yet another zaffaroni firm. in 1989 he was recruited to the affymax research institute in palo alto where he spearheaded the effort to develop high-density arrays of biological compounds. his initial interest was in the broad area of what came to be called combinatorial chemistry. of the techniques developed, one approach permitted high resolution chemical synthesis in a light-directed, spatially-defined format. in the days before positive selection vectors, a researcher might have screened thousands of clones by hand with an oligonucleotide probe just to find one elusive insert. fodor's (and his successors) dna array technology reverses that approach. instead of screening an array of unknowns with a defined probe -a cloned gene, pcr product, or synthetic oligonucleotide -each position or "probe cell" in the array is occupied by a defined dna fragment, and the array is probed with the unknown sample. fodor used his chemistry and biophysics background to develop very dense arrays of these biomolecules by combining photolithographic methods with traditional chemical techniques. the typical array may contain all possible combinations of all possible oligonucleotides (8-mers, for example) that occur as a "window" which is tracked along a dna sequence. it might contain longer oligonucleotides designed from all the open reading frames identified from a complete genome sequence. or it might contain cdnas -of known or unknown sequence -or pcr products. of course it is one thing to produce data it is quite another to extract it in a meaningful manner. fodor's group also developed techniques to read these arrays, employing fluorescent labeling methods and confocal laser scanning to measure each individual binding event on the surface of the chip with extraordinary sensitivity and precision. this general platform of microarray based analysis coupled to confocal laser scanning has become the standard in industry and academia for large-scale genomics studies. in 1993, fodor co-founded affymetrix where the chip technology has been used to synthesize many varieties of high density oligonucleotide arrays containing hundreds of thousands of dna probes. in 2001, steve fodor founded perlegen, inc., a new venture that applied the chip technology towards uncovering the basic patterns of human diversity. his company's stated goals are to analyze more than one million genetic variations in clinical trial participants to explain and predict the efficacy and adverse effect profiles of prescription drugs. in addition, perlegen also applies this expertise to discovering genetic variants associated with disease in order to pave the way for new therapeutics and diagnostics. fodor's former company diversified into plant applications by developing a chip of the archetypal model of plant systems arabidopsis and supplied pioneer hi bred with custom dna chips for monitoring maize gene expression. they (affymetrix) have established programs where academic scientists can use company facilities at a reduced price and set up 'user centers' at selected universities. a related but less complex technology called 'spotted' dna chips involves precisely spotting very small droplets of genomic or cdna clones or pcr samples on a microscope slide. the process uses a robotic device with a print head bearing fine "repeatograph" tips that work like fountain pens to draw up dna samples from a 96-well plate and spot tiny amounts on a slide. up to 10,000 individual clones can be spotted in a dense array within one square centimeter on a glass slide. after hybridization with a fluorescent target mrna, signals are detected by a custom scanner. this is the basis of the systems used by molecular dynamics and incyte (who acquired this technology when it took over synteni). in 1997, incyte was looking to gather more data for its library and perform experiments for corporate subscribers. the company considered buying affymetrix genechips but opted instead to purchase the smaller synteni, which had sprung out of pat brown's stanford array effort. synteni's contact printing technology resulted in dense -and cheaper -arrays. though incyte used the chips only internally, affymetrix sued, claiming synteni/incyte was infringing on its chip density patents. the suit argued that dense biochips -regardless of whether they use photolithography -cannot be made without a license from affymetrix! and in a litigious congo line endemic of this hi-tech era incyte countersued and for good measure also filed against genetic database competitor gene logic for infringing incyte's patents on database building. meanwhile, hyseq sued affymetrix, claiming infringement of nucleotide hybridization patents obtained by its cso. affymetrix, in turn, filed a countersuit, claiming hyseq infringed the spotted array patents. hyseq then reached back and found an additional hybridization patent it claimed that affymetrix had infringed. and so on into the next millennium! in part to avoid all of this another california company nanogen, inc. took a different approach to single nucleotide polymorphism discrimination technology. in an article in the april 2000 edition of nature biotechnology, entitled "single nucleotide polymorphic discrimination by an electronic dot blot assay on semiconductor microchips," nanogen describes the use of microchips to identify variants of the mannose binding protein gene that differ from one another by only a single dna base. the mannose binding protein (mbp) is a key component of the innate immune system in children who have not yet developed immunity to a variety of pathogens. to date, four distinct variants (alleles) of this gene have been identified, all differing by only a single nucleotide of dna. mbp was selected for this study because of its potential clinical relevance and its genetic complexity. the samples were assembled at the nci laboratory in conjunction with the national institutes of health and transferred to nanogen for analysis. however, from a high throughput perspective there is a question mark over microarrays. mark benjamin, senior director of business development at rosetta inpharmatics (kirkland, wa), is skeptical about the long-term prospects for standard dna arrays in high-throughput screening as the first steps require exposing cells and then isolating rna, which is something that is very hard to do in a high-throughput format. another drawback is that most of the useful targets are likely to be unknown (particularly in the agricultural sciences where genome sequencing is still in its infancy), and dna arrays that are currently available test only for previously sequenced genes. indeed, some argue that current dna arrays may not be sufficiently sensitive to detect the low expression levels of genes encoding targets of particular interest. and the added complication of the companies' reluctance to provide "raw data" means that derived data sets may be created with less than optimum algorithims thereby irretrievably losing potentially valuable information from the starting material. reverse engineering is a possible approach but this is laborious and time consuming and being prohibited by many contracts may arouse the interest of the ever-vigilant corporate lawyers. over the course of the nineties, outgrowths of functional genomics have been termed proteomics and metabolomics, which are the global studies of gene expression at the protein and metabolite levels respectively. the study of the integration of information flow within an organism is emerging as the field of systems biology. in the area of proteomics, the methods for global analysis of protein profiles and cataloging protein-protein interactions on a genome-wide scale are technically more difficult but improving rapidly, especially for microbes. these approaches generate vast amounts of quantitative data. the amount of expression data becoming available in the public and private sectors is already increasing exponentially. gene and protein expression data rapidly dwarfed the dna sequence data and is considerably more difficult to manage and exploit. in microbes, the small sizes of the genomes and the ease of handling microbial cultures, will enable high throughput, targeted deletion of every gene in a genome, individually and in combinations. this is already available on a moderate throughput scale in model microbes such as e. coli and yeast. combining targeted gene deletions and modifications with genome-wide assay of mrna and protein levels will enable intricate inter-dependencies among genes to be unraveled. simultaneous measurement of many metabolites, particularly in microbes, is beginning to allow the comprehensive modeling and regulation of fluxes through interdependent pathways. metabolomics can be defined as the quantitative measurement of all low molecular weight metabolites in an organism's cells at a specified time under specific environmental conditions. combining information from metabolomics, proteomics and genomics will help us to obtain an integrated understanding of cell biology. the next hierarchical level of phenotype considers how the proteome within and among cells cooperates to produce the biochemistry and physiology of individual cells and organisms. several authors have tentatively offered "physiomics" as a descriptor for this approach. the final hierarchical levels of phenotype include anatomy and function for cells and whole organisms. the term "phenomics" has been applied to this level of study and unquestionably the more well known omics namely economics, has application across all those fields. and, coming slightly out of left field this time, the spectre of eugenics needless to say was raised in the omics era. in the year 1992 american and british scientists unveiled a technique which has come to be known as pre-implantation genetic diagnosis (pid) for testing embryos in vitro for genetic abnormalities such as cystic fibrosis, hemophilia, and down's syndrome (wald, 1992) . this might be seen by most as a step forward, but it led ethicist david s. king (1999) to decry pid as a technology that could exacerbate the eugenic features of prenatal testing and make possible an expanded form of free-market eugenics. he further argues that due to social pressures and eugenic attitudes held by clinical geneticists in most countries, it results in eugenic outcomes even though no state coercion is involved and that, as abortion is not involved, and multiple embryos are available, pid is radically more effective as a tool of genetic selection. the first regulatory approval of a recombinant dna technology in the u.s. food supply was not a plant but an industrial enzyme that has become the hallmark of food biotechnology success. enzymes were important agents in food production long before modern biotechnology was developed. they were used, for instance, in the clotting of milk to prepare cheese, the production of bread and the production of alcoholic beverages. nowadays, enzymes are indispensable to modern food processing technology and have a great variety of functions. they are used in almost all areas of food production including grain processing, milk products, beer, juices, wine, sugar and meat. chymosin, known also as rennin, is a proteolytic enzyme whose role in digestion is to curdle or coagulate milk in the stomach, efficiently converting liquid milk to a semisolid like cottage cheese, allowing it to be retained for longer periods in a neonate's stomach. the dairy industry takes advantage of this property to conduct the first step in cheese production. chy-max™, an artificially produced form of the chymosin enzyme for cheese-making, was approved in 1990. in some instances they replace less acceptable "older" technology, for example the enzyme chymosin. unlike crops industrial enzymes have had relatively easy passage to acceptance for a number of reasons. as noted they are part of the processing system and theoretically do not appear in the final product. today about 90% of the hard cheese in the us and uk is made using chymosin from geneticallymodified microbes. it is easier to purify, more active (95% as compared to 5%) and less expensive to produce (microbes are more prolific, more productive and cheaper to keep than calves). like all enzymes it is required only in very small quantities and because it is a relatively unstable protein it breaks down as the cheese matures. indeed, if the enzyme remained active for too long it would adversely affect the development of the cheese, as it would degrade the milk proteins to too great a degree. such enzymes have gained the support of vegetarian organizations and of some religious authorities. for plants the nineties was the era of the first widespread commercialization of what came to be known in often deprecating and literally inaccurate terms as gmos (genetically modified organisms). when the nineties dawned dicotyledonous plants were relatively easily transformed with agrobacterium tumefaciens but many economically important plants, including the cereals, remained inaccessible for genetic manipulation because of lack of effective transformation techniques. in 1990 this changed with the technology that overcame this limitation. michael fromm, a molecular biologist at the plant gene expression center, reported the stable transformation of corn using a high-speed gene gun. the method known as biolistics uses a "particle gun" to shoot metal particles coated with dna into cells. initially a gunpowder charge subsequently replaced by helium gas was used to accelerate the particles in the gun. there is a minimal disruption of tissue and the success rate has been extremely high for applications in several plant species. the technology rights are now owned by dupont. in 1990 some of the first of the field trials of the crops that would dominate the second half of the nineties began, including bt corn (with the bacillus thuriengenesis cry protein discussed in chapter three). in 1992 the fda declared that genetically engineered foods are "not inherently dangerous" and do not require special regulation. since 1992, researchers have pinpointed and cloned several of the genes that make selected plants resistant to certain bacterial and fungal infections; some of these genes have been successfully inserted into crop plants that lack them. many more infection-resistant crops are expected in the near future, as scientists find more plant genes in nature that make plants resistant to pests. plant genes, however, are just a portion of the arsenal; microorganisms other than bt also are being mined for genes that could help plants fend off invaders that cause crop damage. the major milestone of the decade in crop biotechnology was approval of the first bioengineered crop plant in 1994. it represented a double first not just of the first approved food crop but also of the first commercial validation of a technology which was to be surpassed later in the decade. that technology, antisense technology works because nucleic acids have a natural affinity for each other. when a gene coding for the target in the genome is introduced in the opposite orientation, the reverse rna strand anneals and effectively blocks expression of the enzyme. this technology was patented by calgene for plant applications and was the technology behind the famous flavr savr tomatoes. the first success for antisense in medicine was in 1998 when the u.s. food and drug administration gave the go-ahead to the cytomegalovirus (cmv) inhibitor fomivirsen, a phosphorothionate antiviral for the aids-related condition cmv retinitis making it the first drug belonging to isis, and the first antisense drug ever, to be approved. another technology, although not apparent at the time was behind the second approval and also the first and only successful to date in a commercial tree fruit biotech application. the former was a virus resistant squash the second the papaya ringspot resistant papaya. both owed their existence as much to historic experience as modern technology. genetically engineered virus-resistant strains of squash and cantaloupe, for example, would never have made it to farmers' fields if plant breeders in the 1930's had not noticed that plants infected with a mild strain of a virus do not succumb to more destructive strains of the same virus. that finding led plant pathologist roger beachy, then at washington university in saint louis, to wonder exactly how such "cross-protection" worked -did part of the virus prompt it? in collaboration with researchers at monsanto, beachy used an a. tumefaciens vector to insert into tomato plants a gene that produces one of the proteins that makes up the protein coat of the tobacco mosaic virus. he then inoculated these plants with the virus and was pleased to discover, as reported in 1986, that the vast majority of plants did not succumb to the virus. eight years later, in 1994, virus-resistant squash seeds created with beachy's method reached the market, to be followed soon by bioengineered virus-resistant seeds for cantaloupes, potatoes, and papayas. (breeders had already created virusresistant tomato seeds by using traditional techniques.) and the method of protection still remained a mystery when the first approvals were given in 1994 and 1996. gene silencing was perceived initially as an unpredictable and inconvenient side effect of introducing transgenes into plants. it now seems that it is the consequence of accidentally triggering the plant's adaptive defense mechanism against viruses and transposable elements. this recently discovered mechanism, although mechanistically different, has a number of parallels with the immune system of mammals. how this system worked was not elucidated until later in the decade by a researcher who was seeking a very different holy grail -the black rose! rick jorgensen, at that time at dna plant technologies in oakland, ca and subsequently of, of the university of california davis attempted to overexpress the chalcone synthase gene by introducing a modified copy under a strong promoter.surprisingly he obtained white flowers, and many strange variegated purple and white variations in between. this was the first demonstration of what has come to be known as post-transcriptional gene silencing (ptgs). while initially it was considered a strange phenomenon limited to petunias and a few other plant species, it is now one of the hottest topics in molecular biology. rna interference (rnai) in animals and basal eukaryotes, quelling in fungi, and ptgs in plants are examples of a broad family of phenomena collectively called rna silencing (hannon 2002; plasterk 2002) . in addition to its occurrence in these species it has roles in viral defense (as demonstrated by beachy) and transposon silencing mechanisms among other things. perhaps most exciting, however, is the emerging use of ptgs and, in particular, rnai -ptgs initiated by the introduction of double-stranded rna (dsrna) -as a tool to knock out expression of specific genes in a variety of organisms. nineteen ninety one also heralded yet another first. the february 1, 1991 issue of science reported the patenting of "molecular scissors": the nobel-prize winning discovery of enzymatic rna, or "ribozymes," by thomas czech of the university of colorado. it was noted that the u.s. patent and trademark office had awarded an "unusually broad" patent for ribozymes. the patent is u.s. patent no. 4,987,071, claim 1 of which reads as follows: "an enzymatic rna molecule not naturally occurring in nature having an endonuclease activity independent of any protein, said endonuclease activity being specific for a nucleotide sequence defining a cleavage site comprising single-stranded rna in a separate rna molecule, and causing cleavage at said cleavage site by a transesterification reaction." although enzymes made of protein are the dominant form of biocatalyst in modern cells, there are at least eight natural rna enzymes, or ribozymes, that catalyze fundamental biological processes. one of which was yet another discovery by plant virologists, in this instance the hairpin ribozyme was discovered by george bruening at uc davis. the self-cleavage structure was originally called a paperclip, by the bruening laboratory which discovered the reactions. as mentioned in chapter 3, it is believed that these ribozymes might be the remnants of an ancient form of life that was guided entirely by rna. since a ribozyme is a catalytic rna molecule capable of cleaving itself and other target rnas it therefore can be useful as a control system for turning off genes or targeting viruses. the possibility of designing ribozymes to cleave any specific target rna has rendered them valuable tools in both basic research and therapeutic applications. in the therapeutics area, they have been exploited to target viral rnas in infectious diseases, dominant oncogenes in cancers and specific somatic mutations in genetic disorders. most notably, several ribozyme gene therapy protocols for hiv patients are already in phase 1 trials. more recently, ribozymes have been used for transgenic animal research, gene target validation and pathway elucidation. however, targeting ribozymes to the cellular compartment containing their target rnas has proved a challenge. at the other bookend of the decade in 2000, samarsky et al. reported that a family of small rnas in the nucleolus (snornas) can readily transport ribozymes into this subcellular organelle. in addition to the already extensive panoply of rna entities yet another has potential for mischief. viroids are small, single-stranded, circular rnas containing 246-463 nucleotides arranged in a rod-like secondary structure and are the smallest pathogenic agents yet described. the smallest viroid characterized to date is rice yellow mottle sobemovirus (rymv), at 220 nucleotides. in comparison, the genome of the smallest known viruses capable of causing an infection by themselves, the single-stranded circular dna of circoviruses, is around 2 kilobases in size. the first viroid to be identified was the potato spindle tuber viroid (pstvd). some 33 species have been identified to date. unlike the many satellite or defective interfering rnas associated with plant viruses, viroids replicate autonomously on inoculation of a susceptible host. the absence of a protein capsid and of detectable messenger rna activity implies that the information necessary for replication and pathogenesis resides within the unusual structure of the viroid genome. the replication mechanism actually involves interaction with rna polymerase ii, an enzyme normally associated with synthesis of messenger rna, and "rolling circle" synthesis of new rna. some viroids have ribozyme activity which allow self-cleavage and ligation of unit-size genomes from larger replication intermediates. it has been proposed that viroids are "escaped introns". viroids are usually transmitted by seed or pollen. infected plants can show distorted growth. from its earliest years, biotechnology attracted interest outside scientific circles. initially the main focus of public interest was on the safety of recombinant dna technology, and of the possible risks of creating uncontrollable and harmful novel organisms (berg , 1975) . the debate on the deliberate release of genetically modified organisms, and on consumer products containing or comprising them, followed some years later (nas, 1987) . it is interesting to note that within the broad ethical tableau of potential issues within the science and products of biotechnology, the seemingly innocuous field of plant modification has been one of the major players of the 1990's. the success of agricultural biotechnology is heavily dependent on its acceptance by the public, and the regulatory framework in which the industry operates is also influenced by public opinion. as the focus for molecular biology research shifted from the basic pursuit of knowledge to the pursuit of lucrative applications, once again as in the previous two decades the specter of risk arose as the potential of new products and applications had to be evaluated outside the confines of a laboratory. however, the specter now became far more global as the implications of commercial applications brought not just worker safety into the loop but also, the environment, agricultural and industrial products and the safety and well being of all living things. beyond "deliberate" release, the rac guidelines were not designed to address these issues, so the matter moved into the realm of the federal agencies who had regulatory authority which could be interpreted to oversee biotechnology issues. this adaptation of oversight is very much a dynamic process as the various agencies wrestle with the task of applying existing regulations and developing new ones for oversight of this technology in transition. as the decade progressed focus shifted from basic biotic stress resistance to more complex modifications the next generation of plants will focus on value added traits in which valuable genes and metabolites will be identified and isolated, with some of the later compounds being produced in mass quantities for niche markets. two of the more promising markets are nutraceuticals or so-called "functional foods" and plants developed as bioreactors for the production of valuable proteins and compounds, a field known as plant molecular farming. developing plants with improved quality traits involves overcoming a variety of technical challenges inherent to metabolic engineering programs. both traditional plant breeding and biotechnology techniques are needed to produce plants carrying the desired quality traits. continuing improvements in molecular and genomic technologies are contributing to the acceleration of product development in this space. by the end of the decade in 1999, applying nutritional genomics, della penna (1999) isolated a gene, which converts the lower activity precursors to the highest activity vitamin e compound, alpha-tocopherol. with this technology, the vitamin e content of arabidopsis seed oil has been increased nearly 10-fold and progress has been made to move the technology to crops such as soybean, maize, and canola. this has also been done for folates in rice. omega three fatty acids play a significant role in human health, eicosapentaenoic acid (epa) and docosahexaenoic acid (dha), which are present in the retina of the eye and cerebral cortex of the brain, respectively, are some of the most well documented from a clinical perspective. it is believed that epa and dha play an important role in the regulation of inflammatory immune reactions and blood pressure, treatment of conditions such as cardiovascular disease and cystic fibrosis, brain development in utero, and, in early postnatal life, the development of cognitive function. they are mainly found in fish oil and the supply is limited. by the end of the decade ursin (2000) had succeeded in engineering canola to produce these fatty acids. from a global perspective another value-added development had far greater impact both technologically and socio-economically. a team led by ingo potrykus (1999) engineered rice to produce pro-vitamin a, which is an essential micronutrient. widespread dietary deficiency of this vitamin in rice-eating asian countries, which predisposes children to diseases such as blindness and measles, has tragic consequences. improved vitamin a nutrition would alleviate serious health problems and, according to unicef, could also prevent up to two million infant deaths due to vitamin a deficiency. adoption of the next stage of gm crops may proceed more slowly, as the market confronts issues of how to determine price, share value, and adjust marketing and handling to accommodate specialized end-use characteristics. furthermore, competition from existing products will not evaporate. challenges that have accompanied gm crops with improved agronomic traits, such as the stalled regulatory processes in europe, will also affect adoption of nutritionally improved gm products. beyond all of this, credible scientific research is still needed to confirm the benefits of any particular food or component. for functional foods to deliver their potential public health benefits, consumers must have a clear understanding of, and a strong confidence level in, the scientific criteria that are used to document health effects and claims. because these decisions will require an understanding of plant biochemistry, mammalian physiology, and food chemistry, strong interdisciplinary collaborations will be needed among plant scientists, nutritionists, and food scientists to ensure a safe and healthful food supply. in addition to being a source of nutrition, plants have been a valuable wellspring of therapeutics for centuries. during the nineties, however, intensive research has focused on expanding this source through rdna biotechnology and essentially using plants and animals as living factories for the commercial production of vaccines, therapeutics and other valuable products such as industrial enzymes and biosynthetic feedstocks. possibilities in the medical field include a wide variety of compounds, ranging from edible vaccine antigens against hepatitis b and norwalk viruses (arntzen, 1997) and pseudomonas aeruginosa and staphylococcus aureus to vaccines against cancer and diabetes, enzymes, hormones, cytokines, interleukins, plasma proteins, and human alpha-1-antitrypsin. thus, plant cells are capable of expressing a large variety of recombinant proteins and protein complexes. therapeutics produced in this way are termed plant made pharmaceuticals (pmps). and non-therapeutics are termed plant made industrial products (pmips) (newell-mcgloughlin, 2006) . the first reported results of successful human clinical trials with their transgenic plant-derived pharmaceuticals were published in 1998. they were an edible vaccine against e. coli-induced diarrhea and a secretory monoclonal antibody directed against streptococcus mutans, for preventative immunotherapy to reduce incidence of dental caries. haq et al. (1995) reported the expression in potato plants of a vaccine against e. coli enterotoxin (etec) that provided an immune response against the toxin in mice. human clinical trials suggest that oral vaccination against either of the closely related enterotoxins of vibrio cholerae and e. coli induces production of antibodies that can neutralize the respective toxins by preventing them from binding to gut cells. similar results were found for norwalk virus oral vaccines in potatoes. for developing countries, the intention is to deliver them in bananas or tomatoes (newell-mcgloughlin, 2006) . plants are also faster, cheaper, more convenient and more efficient than the principal eukaryotic production system, namely chinese hamster ovary (cho) cells for the production of pharmaceuticals. hundreds of acres of protein-containing seeds could inexpensively double the production of a cho bioreactor factory. in addition, proteins can be expressed at the highest levels in the harvestable seed and plant-made proteins and enzymes formulated in seeds have been found to be extremely stable, reducing storage and shipping costs. pharming may also enable research on drugs that cannot currently be produced. for example, croptech in blacksburg, va., is investigating a protein that seems to be a very effective anticancer agent. the problem is that this protein is difficult to produce in mammalian cell culture systems as it inhibits cell growth. this should not be a problem in plants. furthermore, production size is flexible and easily adjustable to the needs of changing markets. making pharmaceuticals from plants is also a sustainable process, because the plants and crops used as raw materials are renewable. the system also has the potential to address problems associated with provision of vaccines to people in developing countries. products from these alternative sources do not require a so-called "cold chain" for refrigerated transport and storage. those being developed for oral delivery obviates the need for needles and aspectic conditions which often are a problem in those areas. apart from those specific applications where the plant system is optimum there are many other advantages to using plant production. many new pharmaceuticals based on recombinant proteins will receive regulatory approval from the united states food and drug administration (fda) in the next few years. as these therapeutics make their way through clinical trials and evaluation, the pharmaceutical industry faces a production capacity challenge. pharmaceutical discovery companies are exploring plant-based production to overcome capacity limitations, enable production of complex therapeutic proteins, and fully realize the commercial potential of their biopharmaceuticals (newell-mcgloughlin, 2006) . nineteen ninety also marked a major milestone in the animal biotech world when herman made his appearance on the world's stage. since the palmiter's mouse, transgenic technology has been applied to several species including agricultural species such as sheep, cattle, goats, pigs, rabbits, poultry, and fish. herman was the first transgenic bovine created by genpharm international, inc., in a laboratory in the netherlands at the early embryo stage. scientist's microinjected recently fertilized eggs with the gene coding for human lactoferrin. the scientists then cultured the cells in vitro to the embryo stage and transferred them to recipient cattle. lactoferrin, an iron-containing anti-bacterial protein is essential for infant growth. since cow's milk doesn't contain lactoferrin, infants must be fed from other sources that are rich in iron -formula or mother's milk (newell-mcgloughlin, 2001) . as herman was a boy he would be unable to provide the source, that would require the production of daughters which was not necessarily a straightforward process. the dutch parliments permission was required. in 1992 they finally approved a measure that permitted the world's first genetically engineered bull to reproduce. the leiden-based gene pharming proceeded to artificially inseminate 60 cows with herman's sperm. with a promise that the protein, lactoferrin, would be the first in a new generation of inexpensive, high-tech drugs derived from cows' milk to treat complex diseases like aids and cancer. herman, became the father of at least eight female calves in 1994, and each one inherited the gene for lactoferrin production. while their birth was initially greeted as a scientific advancement that could have far-reaching effects for children in developing nations, the levels of expression were too low to be commercially viable. by 2002, herman, who likes to listen to rap music to relax, had sired 55 calves and outlived them all. his offspring were all killed and destroyed after the end of the experiment, in line with dutch health legislation. herman was also slated for the abattoir, but the dutch public -proud of making history with herman -rose up in protest, especially after a television program screened footage showing the amiable bull licking a kitten. herman won a bill of clemency from parliament. however, instead of retirement on a comfortable bed of straw, listening to rap music, herman was pressed into service again. he now stars at a permanent biotech exhibit in naturalis, a natural history museum in the dutch city of leiden. after his death, he will be stuffed and remain in the museum in perpetuity (a fate similar to what awaited an even more famous mammalian first born later in the decade). the applications for transgenic animal research fall broadly into two distinct areas, namely medical and agricultural applications. the recent focus on developing animals as bioreactors to produce valuable proteins in their milk can be catalogued under both areas. underlying each of these, of course, is a more fundamental application, that is the use of those techniques as tools to ascertain the molecular and physiological bases of gene expression and animal development. this understanding can then lead to the creation of techniques to modify development pathways. in 1992 a european decision with rather more far-reaching implications than hermans sex life was made. the first european patent on a transgenic animal was issued for a transgenic mouse sensitive to carcinogens -harvard's "oncomouse". the oncomouse patent application was refused in europe in 1989 due primarily to an established ban on animal patenting. the application was revised to make narrower claims, and the patent was granted in 1992. this has since been repeatedly challenged, primarily by groups objecting to the judgement that benefits to humans outweigh the suffering of the animal. currently, the patent applicant is awaiting protestors' responses to a series of possible modifications to the application. predictions are that agreement will not likely be forthcoming and that the legal wrangling will continue into the future. bringing animals into the field of controversy starting to swirl around gmos and preceding the latter's commercialization, was the approval by the fda of bovine somatotropin (bst) for increased milk production in dairy cows. the fda's center for veterinary medicine (cvm) regulates the manufacture and distribution of food additives and drugs that will be given to animals. biotechnology products are a growing proportion of the animal health products and feed components regulated by the cvm. the center requires that food products from treated animals must be shown to be safe for human consumption. applicants must show that the drug is effective and safe for the animal and that its manufacture will not affect the environment. they must also conduct geographically dispersed clinical trials under an investigational new animal drug application with the fda through which the agency controls the use of the unapproved compound in food animals. unlike within the eu, possible economic and social issues cannot be taken into consideration by the fda in the premarket drug approval process. under these considerations the safety and efficacy of rbst was determined. it was also determined that special labeling for milk derived from cows that had been treated with rbst is not required under fda food labeling laws because the use of rbst does not effect the quality or the composition of the milk. work with fish proceeded a pace throughout the decade. gene transfer techniques have been applied to a large number of aquatic organisms, both vertebrates and invertebrates. gene transfer experiments have targeted a wide variety of applications, including the study of gene structure and function, aquaculture production, and use in fisheries management programs. because fish have high fecundity, large eggs, and do not require reimplantation of embryos, transgenic fish prove attractive model systems in which to study gene expression. transgenic zebrafish have found utility in studies of embryogenesis, with expression of transgenes marking cell lineages or providing the basis for study of promoter or structural gene function. although not as widely used as zebrafish, transgenic medaka and goldfish have been used for studies of promoter function. this body of research indicates that transgenic fish provide useful models of gene expression, reliably modeling that in "higher" vertebrates. perhaps the largest number of gene transfer experiments address the goal of genetic improvement for aquaculture production purposes. the principal area of research has focused on growth performance, and initial transgenic growth hormone (gh) fish models have demonstrated accelerated and beneficial phenotypes. dna microinjection methods have propelled the many studies reported and have been most effective due to the relative ease of working with fish embryos. bob devlins' group in vancouver has demonstrated extraordinary growth rate in coho salmon which were transformed with a growth hormone from sockeye salmon. the transgenics achieve up to eleven times the size of their littermates within six months, reaching maturity in about half the time. interestingly this dramatic effect is only observed in feeding pins where the transgenics' ferocious appetites demands constant feeding. if the fish are left to their own devices and must forage for themselves, they appear to be out-competed by their smarter siblings. however most studies, such as those involving transgenic atlantic salmon and channel catfish, report growth rate enhancement on the order of 30-60%. in addition to the species mentioned, gh genes also have been transferred into striped bass, tilapia, rainbow trout, gilthead sea bream, common carp, bluntnose bream, loach, and other fishes. shellfish also are subject to gene transfer toward the goal of intensifying aquaculture production. growth of abalone expressing an introduced gh gene is being evaluated; accelerated growth would prove a boon for culture of the slowgrowing mollusk. a marker gene was introduced successfully into giant prawn, demonstrating feasibility of gene transfer in crustaceans, and opening the possibility of work involving genes affecting economically important traits. in the ornamental fish sector of aquaculture, ongoing work addresses the development of fish with unique coloring or patterning. a number of companies have been founded to pursue commercialization of transgenics for aquaculture. as most aquaculture species mature at 2-3 years of age, most transgenic lines are still in development and have yet to be tested for performance under culture conditions. extending earlier research that identified methylfarnesoate (mf) as a juvenile hormone in crustaceans and determined its role in reproduction, researchers at the university of connecticut have developed technology to synchronize shrimp egg production and to increase the number and quality of eggs produced. females injected with mf are stimulated to produce eggs ready for fertilization. the procedure produces 180 percent more eggs than the traditional crude method of removing the eyestalk gland. this will increase aquaculture efficiency. a number of experiments utilize gene transfer to develop genetic lines of potential utility in fisheries management. transfer of gh genes into northern pike, walleye, and largemouth bass are aimed at improving the growth rate of sport fishes. gene transfer has been posed as an option for reducing losses of rainbow trout to whirling disease, although suitable candidate genes have yet to be identified. richard winn of the university of georgia is developing transgenic killifish and medaka as biomonitors for environmental mutagens, which carry the bacteriophage phi x 174 as a target for mutation detection. development of transgenic lines for fisheries management applications generally is at an early stage, often at the founder or f1 generation. broad application of transgenic aquatic organisms in aquaculture and fisheries management will depend on showing that particular gmos can be used in the environment both effectively and safely. although our base of knowledge for assessing ecological and genetic safety of aquatic gmos currently is limited, some early studies supported by the usda biotechnology risk assessment program have yielded results. data from outdoor pond-based studies on transgenic catfish reported by rex dunham of auburn university show that transgenic and non-transgenic individuals interbreed freely, that survival and growth of transgenics in unfed ponds was equal to or less than that of non-transgenics, and that predator avoidance is not affected by expression of the transgene. however, unquestionably the seminal event for animal biotech in the nineties was ian wilmut's landmark work using nuclear transfer technology to generate the lambs morag and megan reported in 1996 (from an embryonic cell nuclei) and the truly ground-breaking work of creating dolly from an adult somatic cell nucleus, reported in february, 1997 (wilmut, 1997) . wilmut and his colleagues at the roslin institute demonstrated for the first time with the birth of dolly the sheep that the nucleus of an adult somatic cell can be transferred to an enucleated egg to create cloned offspring. it had been assumed for some time that only embryonic cells could be used as the cellular source for nuclear transfer. this assumption was shattered with the birth of dolly. this example of cloning an animal using the nucleus of an adult cell was significant because it demonstrated the ability of egg cell cytoplasm to "reprogram" an adult nucleus. when cells differentiate, that is, evolve from primitive embryonic cells to functionally defined adult cells, they lose the ability to express most genes and can only express those genes necessary for the cell's differentiated function. for example, skin cells only express genes necessary for skin function, and brain cells only express genes necessary for brain function. the procedure that produced dolly demonstrated that egg cytoplasm is capable of reprogramming an adult differentiated cell (which is only expressing genes related to the function of that cell type). this reprogramming enables the differentiated cell nucleus to once again express all the genes required for the full embryonic development of the adult animal. since dolly was cloned, similar techniques have been used to clone a veritable zoo of vertebrates including mice, cattle, rabbitts, mules, horses, fish, cats and dogs from donor cells obtained from adult animals. these spectacular examples of cloning normal animals from fully differentiated adult cells demonstrate the universality of nuclear reprogramming although the next decade called some of these assumptions into question. this technology supports the production of genetically identical and genetically modified animals. thus, the successful "cloning" of dolly has captured the imagination of researchers around the world. this technological breakthrough should play a significant role in the development of new procedures for genetic engineering in a number of mammalian species. it should be noted that nuclear cloning, with nuclei obtained from either mammalian stem cells or differentiated "adult" cells, is an especially important development for transgenic animal research. as the decade reached its end the clones began arriving rapidly with specific advances made by a japanese group who used cumulus cells rather than fibroblasts to clone calves. they found that the percentage of cultured, reconstructed eggs that developed into blastocysts was 49% for cumulus cells and 23% for oviductal cells. these rates are higher than the 12% previously reported for transfer of nuclei from bovine fetal fibroblasts. following on the heels of dolly, polly and molly became the first genetically engineered transgenic sheep produced through nuclear transfer technology. polly and molly were engineered to produce human factor ix (for hemophiliacs) by transfer of nuclei from transfected fetal fibroblasts. until then germline competent transgenics had only been produced in mammalian species, other than mice, using dna microinjection. researchers at the university of massachusetts and advanced cell technology (worcester, ma) teamed up to produce genetically identical calves utilizing a strategy similar to that used to produce transgenic sheep. in contrast to the sheep cloning experiment, the bovine experiment involved the transfer of nuclei from an actively dividing population of cells. previous results from the sheep experiments suggested that induction of quiescence by serum starvation was required to reprogram the donor nuclei for successful nuclear transfer. the current bovine experiments indicate that this step may not be necessary. typically about 500 embryos needed to be microinjected to obtain one transgenic cow, whereas nuclear transfer produced three transgenic calves from 276 reconstructed embryos. this efficiency is comparable to the previous sheep research where six transgenic lambs were produced from 425 reconstructed embryos. the ability to select for genetically modified cells in culture prior to nuclear transfer opens up the possibility of applying the powerful gene targeting techniques that have been developed for mice. one of the limitations of using primary cells, however, is their limited lifespan in culture. primary cell cultures such as the fetal fibroblasts can only undergo about 30 population doublings before they senesce. this limited lifespan would preclude the ability to perform multiple rounds of selection. to overcome this problem of cell senescence, these researchers showed that fibroblast lifespan could be prolonged by nuclear transfer. a fetus, which was developed by nuclear transfer from genetically modified cells, could in turn be used to establish a second generation of fetal fibroblasts. these fetal cells would then be capable of undergoing another 30 population doublings, which would provide sufficient time for selection of a second genetic modification. as noted, there is still some uncertainty over whether quiescent cells are required for successful nuclear transfer. induction into quiescence was originally thought to be necessary for successful nuclear reprogramming of the donor nucleus. however, cloned calves have been previously produced using non-quiescent fetal cells. furthermore, transfer of nuclei from sertoli and neuronal cells, which do not normally divide in adults, did not produce a liveborn mouse; whereas nuclei transferred from actively dividing cumulus cells did produce cloned mice. the fetuses used for establishing fetal cell lines in a tufts goat study were generated by mating nontransgenic females to a transgenic male containing a human antithrombin (at) iii transgene. this at transgene directs high level expression of human at into milk of lactating transgenic females. as expected, all three offspring derived from female fetal cells were females. one of these cloned goats was hormonally induced to lactate. this goat secreted 3.7-5.8 grams per liter of at in her milk. this level of at expression was comparable to that detected in the milk of transgenic goats from the same line obtained by natural breeding. the successful secretion of at in milk was a key result because it showed that a cloned animal could still synthesize and secrete a foreign protein at the expected level. it will be interesting to see if all three cloned goats secrete human at at the identical level. if so, then the goal of creating a herd identical transgenic animals, which secrete identical levels of an important pharmaceutical, would become a reality. no longer would variable production levels exist in subsequent generations due to genetically similar but not identical animals. this homogeneity would greatly aid in the production and processing of a uniform product. as nuclear transfer technology continues to be refined and applied to other species, it may eventually replace microinjection as the method of choice for generating transgenic livestock. nuclear transfer has a number of advantages: 1) nuclear transfer is more efficient than microinjection at producing a transgenic animal, 2) the fate of the integrated foreign dna can be examined prior to production of the transgenic animal, 3) the sex of the transgenic animal can be predetermined, and 4) the problem of mosaicism in first generation transgenic animals can be eliminated. dna microinjection has not been a very efficient mechanism to produce transgenic mammals. however, in november, 1998, a team of wisconsin researchers reported a nearly 100% efficient method for generating transgenic cattle. the established method of cattle transgenes involves injecting dna into the pronuclei of a fertilized egg or zygote. in contrast, the wisconsin team injected a replication-defective retroviral vector into the perivitelline space of an unfertilized oocyte. the perivitelline space is the region between the oocyte membrane and the protective coating surrounding the oocyte known as the zona pellucida. in addition to es (embryonic stem) cells other sources of donor nuclei for nuclear transfer might be used such as embryonic cell lines, primordial germ cells, or spermatogonia to produce offspring. the utility of es cells or related methodologies to provide efficient and targeted in vivo genetic manipulations offer the prospects of profoundly useful animal models for biomedical, biological and agricultural applications. the road to such success has been most challenging, but recent developments in this field are extremely encouraging. with the may 1999 announcement of geron buying out ian wilmuts company roslin biomed, they declared it the dawn of an new era in biomedical research. geron's technologies for deriving transplantable cells from human pluripotent stem cells (hpscs) and extending their replicative capacity with telomerase was combined with the roslin institute nuclear transfer technology, the technology that produced dolly the cloned sheep. the goal was to produce transplantable, tissue-matched cells that provide extended therapeutic benefits without triggering immune rejection. such cells could be used to treat numerous major chronic degenerative diseases and conditions such as heart disease, stroke, parkinson's disease, alzheimer's disease, spinal cord injury, diabetes, osteoarthritis, bone marrow failure and burns. the stem cell is a unique and essential cell type found in animals. many kinds of stem cells are found in the body, with some more differentiated, or committed, to a particular function than others. in other words, when stem cells divide, some of the progeny mature into cells of a specific type (heart, muscle, blood, or brain cells), while others remain stem cells, ready to repair some of the everyday wear and tear undergone by our bodies. these stem cells are capable of continually reproducing themselves and serve to renew tissue throughout an individual's life. for example, they continually regenerate the lining of the gut, revitalize skin, and produce a whole range of blood cells. although the term "stem cell" commonly is used to refer to the cells within the adult organism that renew tissue (e.g., hematopoietic stem cells, a type of cell found in the blood), the most fundamental and extraordinary of the stem cells are found in the early-stage embryo. these embryonic stem (es) cells, unlike the more differentiated adult stem cells or other cell types, retain the special ability to develop into nearly any cell type. embryonic germ (eg) cells, which originate from the primordial reproductive cells of the developing fetus, have properties similar to es cells. it is the potentially unique versatility of the es and eg cells derived, respectively, from the early-stage embryo and cadaveric fetal tissue that presents such unusual scientific and therapeutic promise. indeed, scientists have long recognized the possibility of using such cells to generate more specialized cells or tissue, which could allow the generation of new cells to be used to treat injuries or diseases, such as alzheimer's disease, parkinson's disease, heart disease, and kidney failure. likewise, scientists regard these cells as an important -perhaps essential -means for understanding the earliest stages of human development and as an important tool in the development of life-saving drugs and cell-replacement therapies to treat disorders caused by early cell death or impairment. geron corporation and its collaborators at the university of wisconsin -madison (dr. james a. thomson) and johns hopkins university (dr. john d. gearhart) announced in november 1998 the first successful derivation of hpscs from two sources: (i) human embryonic stem (hes) cells derived from in vitro fertilized blastocysts (thomson 1998 ) and (ii) human embryonic germ (heg) cells derived from fetal material obtained from medically terminated pregnancies (shamblott et al. 1998) . although derived from different sources by different laboratory processes, these two cell types share certain characteristics but are referred to collectively as human pluripotent stem cells (hpscs). because hes cells have been more thoroughly studied, the characteristics of hpscs most closely describe the known properties of hes cells. stem cells represent a tremendous scientific advancement in two ways: first, as a tool to study developmental and cell biology; and second, as the starting point for therapies to develop medications to treat some of the most deadly diseases. the derivation of stem cells is fundamental to scientific research in understanding basic cellular and embryonic development. observing the development of stem cells as they differentiate into a number of cell types will enable scientists to better understand cellular processes and ways to repair cells when they malfunction. it also holds great potential to yield revolutionary treatments by transplanting new tissue to treat heart disease, atherosclerosis, blood disorders, diabetes, parkinson's, alzheimer's, stroke, spinal cord injuries, rheumatoid arthritis, and many other diseases. by using stem cells, scientists may be able to grow human skin cells to treat wounds and burns. and, it will aid the understanding of fertility disorders. many patient and scientific organizations recognize the vast potential of stem cell research. another possible therapeutic technique is the generation of "customized" stem cells. a researcher or doctor might need to develop a special cell line that contains the dna of a person living with a disease. by using a technique called "somatic cell nuclear transfer" the researcher can transfer a nucleus from the patient into an enucleated human egg cell. this reformed cell can then be activated to form a blastocyst from which customized stem cell lines can be derived to treat the individual from whom the nucleus was extracted. by using the individual's own dna, the stem cell line would be fully compatible and not be rejected by the person when the stem cells are transferred back to that person for the treatment. preliminary research is occurring on other approaches to produce pluripotent human es cells without the need to use human oocytes. human oocytes may not be available in quantities that would meet the needs of millions of potential patients. however, no peer-reviewed papers have yet appeared from which to judge whether animal oocytes could be used to manufacture "customized" human es cells and whether they can be developed on a realistic timescale. additional approaches under consideration include early experimental studies on the use of cytoplasmic-like media that might allow a viable approach in laboratory cultures. on a much longer timeline, it may be possible to use sophisticated genetic modification techniques to eliminate the major histocompatibility complexes and other cell-surface antigens from foreign cells to prepare master stem cell lines with less likelihood of rejection. this could lead to the development of a bank of universal donor cells or multiple types of compatible donor cells of invaluable benefit to treat all patients. however, the human immune system is sensitive to many minor histocompatibility complexes and immunosuppressive therapy carries life-threatening complications. stem cells also show great potential to aid research and development of new drugs and biologics. now, stem cells can serve as a source for normal human differentiated cells to be used for drug screening and testing, drug toxicology studies and to identify new drug targets. the ability to evaluate drug toxicity in human cell lines grown from stem cells could significantly reduce the need to test a drug's safety in animal models. there are other sources of stem cells, including stem cells that are found in blood. recent reports note the possible isolation of stem cells for the brain from the lining of the spinal cord. other reports indicate that some stem cells that were thought to have differentiated into one type of cell can also become other types of cells, in particular brain stem cells with the potential to become blood cells. however, since these reports reflect very early cellular research about which little is known, we should continue to pursue basic research on all types of stem cells. some religious leaders will advocate that researchers should only use certain types of stem cells. however, because human embryonic stem cells hold the potential to differentiate into any type of cell in the human body, no avenue of research should be foreclosed. rather, we must find ways to facilitate the pursuit of all research using stem cells while addressing the ethical concerns that may be raised. another seminal and intimately related event at the end of the nineties occurred in madison wisconsin. up until november of 1998, isolating es cells in mammals other than mice proved elusive, but in a milestone paper in the november 5, 1998 issue of science, james a. thomson, (1998) a developmental biologist at uw-madison reported the first successful isolation, derivation and maintenance of a culture of human embryonic stem cells (hes cells). it is interesting to note that this leap was made from mouse to man. as thomson himself put it, these cells are different from all other human stem cells isolated to date and as the source of all cell types, they hold great promise for use in transplantation medicine, drug discovery and development, and the study of human developmental biology. the new century is rapidly exploiting this vision. when steve fodor was asked in 2003 "how do you really take the human genome sequence and transform it into knowledge?" he answered from affymetrix's perspective, it is a technology development task. he sees the colloquially named affychips being the equivalent of a cd-rom of the genome. they take information from the genome and write it down. the company has come a long way from the early days of venter's ests and less than robust algorithms as described earlier. one surprising fact unearthed by the newer more sophisticated generation of chips is that 30 to 35 percent of the non-repetitive dna is being expressed as accepted knowledge was that only 1.5 to 2 percent of the genome would be expressed. since much of that sequence has no protein-coding capacity it is most likely coding for regulatory functions. in a parallel to astrophysics this is often referred to in common parlance as the "dark matter of the genome" and like dark matter for many it is the most exciting and challenging aspect of uncovering the occult genome. it could be, and most probably is, involved in regulatory functions, networks, or development. and like physical dark matter it may change our whole concept of what exactly a gene is or is not! since beadle and tatum's circumspect view of the protein world no longer holds true it adds a layer of complexity to organizing chip design. depending on which sequences are present in a particular transcript, you can, theoretically, design a set of probes to uniquely distinguish that variant. at the dna level itself there is much potential for looking at variants either expressed or not at a very basic level as a diagnostic system, but ultimately the real paydirt is the information that can be gained from looking at the consequence of non-coding sequence variation on the transcriptome itself. and fine tuning when this matters and when it is irrelevant as a predicative model is the auspices of the affymetrix spin-off perlegen. perlegen came into being in late 2000 to accelerate the development of high-resolution, whole genome scanning. and they have stuck to that purity of purpose. to paraphrase dragnet's sergeant joe friday, they focus on the facts of dna just the dna. perlegen owes its true genesis to the desire of one of its cofounders to use dna chips to help understand the dynamics underlying genetic diseases. brad margus' two sons have the rare disease "ataxia telangiectasia" (a-t). a-t is a progressive, neurodegenerative childhood disease that affects the brain and other body systems. the first signs of the disease, which include delayed development of motor skills, poor balance, and slurred speech, usually occur during the first decade of life. telangiectasias (tiny, red "spider" veins), which appear in the corners of the eyes or on the surface of the ears and cheeks, are characteristic of the disease, but are not always present. many individuals with a-t have a weakened immune system, making them susceptible to recurrent respiratory infections. about 20% of those with a-t develop cancer, most frequently acute lymphocytic leukemia or lymphoma suggesting that the sentinel competence of the immune system is compromised. having a focus so close to home is a powerful driver for any scientist. his co-founder david cox is a polymath pediatrician whose training in the latter informs his application of the former in the development of patient-centered tools. from that perspective, perlegen's stated mission is to collaborate with partners to rescue or improve drugs and to uncover the genetic bases of diseases. they have created a whole genome association approach that enables them to genotype millions of unique snps in thousands of cases and controls in a timeframe of months rather than years. as mentioned previously, snp (single nucleotide polymorphism) markers are preferred over microsatellite markers for association studies because of their abundance along the human genome, the low mutation rate, and accessibilities to high-throughput genotyping. since most diseases, and indeed responses to drug interventions, are the products of multiple genetic and environmental factors it is a challenge to develop discriminating diagnostics and, even more so, targetedtherapeutics. because mutations involved in complex diseases act probabilisticallythat is, the clinical outcome depends on many factors in addition to variation in the sequence of a single gene -the effect of any specific mutation is smaller. thus, such effects can only be revealed by searching for variants that differ in frequency among large numbers of patients and controls drawn from the general population. analysis of these snp patterns provides a powerful tool to help achieve this goal. although most bi-alleic snps are rare, it has been estimated that just over 5 million common snps, each with a frequency of between 10 and 50%, account for the bulk of the dna sequence difference between humans. such snps are present in the human genome once every 600 base pairs or so. as is to be expected from linkage disequilibrium studies, alleles making up blocks of such snps in close physical proximity are often correlated, resulting in reduced genetic variability and defining a limited number of "snp haplotypes," each of which reflects descent from a single, ancient ancestral chromosome. in 2001 cox's group, using high level scanning with some old-fashioned somatic cell genetics, constructed the snp map of chromosome 21.the surprising findings were blocks of limited haplotype diversity in which more than 80% of a global human sample can typically be characterized by only three common haplotypes (interestingly enough the prevalence of each hapolytype in the examined population was in the ratio 50:25:12.5).from this the conclusion could be drawn that by comparing the frequency of genetic variants in unrelated cases and controls, genetic association studies could potentially identify specific haplotypes in the human genome that play important roles in disease, without need of knowledge of the history or source of the underlying sequence, which hypothesis they subsequently went on to prove. following cox et al. pioneering work on "blocking" chromosome 21 into characteristic haplotypes, tien chen came to visit him from university of southern california and following the visit his group developed discriminating algorithms which took advantage of the fact that the haplotype block structure can be decomposed into large blocks with high linkage disequilibrium and relatively limited haplotype diversity, separated by short regions of low disequilibrium. one of the practical implications of this observation is as suggested by cox that only a small fraction of all the snps they refer to as "tag" snps can be chosen for mapping genes responsible for complex human diseases, which can significantly reduce genotyping effort, without much loss of power. they developed algorithms to partition haplotypes into blocks with the minimum number of tag snps for an entire chromosome. in 2005 they reported that they had developed an optimized suite of programs to analyze these block linkage disequilibrium patterns and to select the corresponding tag snps that will pick the minimum number of tags for the given criteria. in addition the updated suite allows haplotype data and genotype data from unrelated individuals and from general pedigrees to be analyzed. using an approach similar to richard michelmore's bulk segregant analysis in plants of more than a decade previously, perlegen subsequently made use of these snp haplotype and statistical probability tools to estimate total genetic variability of a particular complex trait coded for by many genes, with any single gene accounting for no more than a few percent of the overall variability of the trait. cox's group have determined that fewer than 1000 total individuals provide adequate power to identify genes accounting for only a few percent of the overall genetic variability of a complex trait, even using the very stringent significance levels required when testing large numbers of dna variants. from this it is possible to identify the set of major genetic risk factors contributing to the variability of a complex disease and/or treatment response. so, while a single genetic risk factor is not a good predictor of treatment outcome, the sum of a large fraction of risk factors contributing to a treatment response or common disease can be used to optimize personalized treatments without requiring knowledge of the underlying mechanisms of the disease.they feel that a saturating level of coverage is required to produce repeatable prediction of response to medication or predisposition to disease and that taking shortcuts will for the most part lead to incomplete, clinically-irrelevant results. in 2005 hinds et al. in science describe even more dramatic progresss. they describe a publicly available, genome-wide data set of 1.58 million common singlenucleotide polymorphisms (snps) that have been accurately genotyped in each of 71 people from three population samples. a second public data set of more than 1 million snps typed in each of 270 people has been generated by the international haplotype map (hapmap) project. these two public data sets, combined with multiple new technologies for rapid and inexpensive snp genotyping, are paving the way for comprehensive association studies involving common human genetic variations. perlegen basically is taking to the next level fodor's stated reason for the creation of affymetrix, the belief that understanding the correlation between genetic variability and its role in health and disease would be the next step in the genomics revolution. the other interesting aspect of this level of coverage is, of course, the notion of discrete identifiable groups based on ethnicity, centers of origin and such breaks down and a spectrum of variation arises across all populations which makes the perlegen chip, at one level, a true unifier of humanity but at another adds a whole layer of complexity for hmos! at the turn of the century, this personalized chip approach to medicine received some validation at a simpler level in a closely related disease area to the one to which one fifth of a-t patients ultimately succumb when researchers at the whitehead institute used dna chips to distinguish different forms of leukemia based on patterns of gene expression in different populations of cells. moving cancer diagnosis away from visually based systems to such molecular based systems is a major goal of the national cancer institute. in the study, scientists used a dna chip to examine gene activity in bone marrow samples from patients with two different types of acute leukemia -acute myeloid leukemia (aml) and acute lymphoblastic leukemia (all). then, using an algorithm, developed at the whitehead, they identified signature patterns that could distinguish the two types. when they cross-checked the diagnoses made by the chip against known differences in the two types of leukemia, they found that the chip method could automatically make the distinction between aml and all without previous knowledge of these classes. taking it to a level beyond where perlegen are initially aiming, eric lander, leader of the study said, mapping not only what is in the genome, but also what the things in the genome do, is the real secret to comprehending and ultimately curing cancer and other diseases. chips gained recognition on the world stage in 2003 when they played a key role in the search for the cause of severe acute respiratory syndrome (sars) and probably won a mcarthur genius award for their creator. ucsf assistant professor joseph derisi, already famous in the scientific community as the wunderkind originator of the online diy chip maker in pat brown's lab at stanford, built a gene microarray containing all known completely sequenced viruses (12,000 of them) and, using a robot arm that he also customized, in a three day period used it to classify a pathogen isolated from sars patients as a novel coronavirus. when a whole galaxy of dots lit up across the spectrum of known vertebrate cornoviruses derisis knew this was a new variant. interestingly the sequence had the hottest signal with avian infectious bronchitis virus. his work subsequently led epidemiologists to target the masked palm civet, a tree-dwelling animal with a weasel-like face and a catlike body as the probable primary host. the role that derisi's team at ucsf played in identifying a coronavirus as a suspected cause of sars came to the attention of the national media when cdc director dr. julie gerberding recognized joe in march 24, 2003 press conference and in 2004 when joe was honored with one of the coveted mcarthur genius awards. this and other tools arising from information gathered from the human genome sequence and complementary discoveries in cell and molecular biology, new tools such as gene-expression profiling, and proteomics analysis are converging to finally show that rapid robust diagnostics and "rational" drug design has a future in disease research. another virus that puts sars deaths in perspective benefitted from rational drug design at the turn of the century. influenza, or flu, is an acute respiratory infection caused by a variety of influenza viruses. each year, up to 40 million americans develop the flu, with an average of about 150,000 being hospitalized and 10,000 to 40,000 people dying from influenza and its complications. the use of current influenza treatments has been limited due to a lack of activity against all influenza strains, adverse side effects, and rapid development of viral resistance. influenza costs the united states an annual $14.6 billion in physician visits, lost productivity and lost wages. and least we still dismiss it as a nuisance we are well to remember that the "spanish" influenza pandemic killed over 20 million people in 1918 and 1919, making it the worst infectious pandemic in history beating out even the more notorious black death of the middle ages. this fear has been rekindled as the dreaded h5n1 (h for haemaglutenin and n for neuraminidase as described below) strain of bird flu has the potential to mutate and recognise homo sapiens as a desirable host. since rna viruses are notoriously faulty in their replication this accelerated evolutionary process gives then a distinct advantage when adapting to new environments and therefore finding more amenable hosts. although inactivated influenza vaccines are available, their efficacy is suboptimal partly because of their limited ability to elicit local iga and cytotoxic t cell responses. the choices of treatments and preventions for influenza hold much more promise in this millennium. clinical trials of cold-adapted live influenza vaccines now under way suggest that such vaccines are optimally attenuated, so that they will not cause influenza symptoms but will still induce protective immunity. aviron (mountain view, ca), biochem pharma (laval, quebec, canada), merck (whitehouse station, nj), chiron (emeryville, ca), and cortecs (london), all had influenza vaccines in the clinic at the turn of the century, with some of them given intra-nasally or orally. meanwhile, the team of gilead sciences (foster city, ca) and hoffmann-la roche (basel, switzerland) and also glaxowellcome (london) in 2000 put on the market neuraminidase inhibitors that block the replication of the influenza virus. gilead was one of the first biotechnology companies to come out with an anti-flu therapeutic. tamiflu™ (oseltamivir phosphate) was the first flu pill from this new class of drugs called neuraminidase inhibitors (ni) that are designed to be active against all common strains of the influenza virus. neuraminidase inhibitors block viral replication by targeting a site on one of the two main surface structures of the influenza virus, preventing the virus from infecting new cells. neuraminidase is found protruding from the surface of the two main types of influenza virus, type a and type b. it enables newly formed viral particles to travel from one cell to another in the body. tamiflu is designed to prevent all common strains of the influenza virus from replicating. the replication process is what contributes to the worsening of symptoms in a person infected with the influenza virus. by inactivating neuraminidase, viral replication is stopped, halting the influenza virus in its tracks. in marked contrast to the usual protracted process of clinical trials for new therapeutics, the road from conception to application for tamiflu was remarkably expeditious. in 1996, gilead and hoffmann-la roche entered into a collaborative agreement to develop and market therapies that treat and prevent viral influenza. in 1999, as gilead's worldwide development and marketing partner, roche led the final development of tamiflu, 26 months after the first patient was dosed in clinical trials in april 1999, roche and gilead announced the submission of a new drug application to the u.s. food and drug administration (fda) for the treatment of influenza. additionally, roche filed a marketing authorisation application (maa) in the european union under the centralized procedure in early may 1999. six months later in october 1999, gilead and roche announced that the fda approved tamiflu for the treatment of influenza a and b in adults. these accelerated efforts allowed tamiflu to reach the u.s. market in time for the 1999-2000 flu season. one of gilead's studies showed an increase in efficacy from 60% when the vaccine was used alone to 92% when the vaccine was used in conjunction with a neuraminidase inhibitor. outside of the u.s., tamiflu also has been approved for the treatment of influenza a and b in argentina, brazil, canada, mexico, peru and switzerland. regulatory review of the tamiflu maa by european authorities is ongoing. with the h5n1 birdflu strain's relentless march (or rather flight) across asia, in 2006 through eastern europe to a french farmyard, an unwelcome stowaway on a winged migration, and no vaccine in sight, tamiflu, although untested for this species, seen as the last line of defense is now being horded and its patented production right's fought over like an alchemist's formula. tamiflu's main competitor, zanamivir marketed as relenza™ was one of a group of molecules developed by glaxowellcome and academic collaborators using structure-based drug design methods targeted, like tamiflu, at a region of the neuraminidase surface glycoprotein of influenza viruses that is highly conserved from strain to strain. glaxo filed for marketing approval for relenza in europe and canada. the food and drug administration's accelerated drug-approval timetable began to show results by 2001, its evaluation of novartis's gleevec took just three months compared with the standard 10-12 months. another factor in improving biotherapeutic fortunes in the new century was the staggering profits of early successes. in 2003, $1.9 billion of the $3.3 billion in revenue collected by genentech in south san francisco came from oncology products, mostly the monoclonal antibody-based drugs rituxan, used to treat non-hodgkin's lymphoma, and herceptin for breast cancer. in fact two of the first cancer drugs to use the new tools for 'rational' design herceptin and gleevec, a small-molecule chemotherapeutic for some forms of leukemia are proving successful, and others such as avastin (an anti-vascular endothelial growth factor) for colon cancer and erbitux are already following in their footsteps. gleevec led the way in exploiting signal-transduction pathways to treat cancer as it blocks a mutant form of tyrosine kinase (termed the philadelphia translocation recognized in 1960's) that can help to trigger out-of-control cell division. about 25% of biotech companies raising venture capital during the third quarter of 2003 listed cancer as their primary focus, according to online newsletter venturereporter. by 2002 according to the pharmaceutical research and manufacturers of america, 402 medicines were in development for cancer up from 215 in 1996. another new avenue in cancer research is to combine drugs. wyeth's mylotarg, for instance, links an antibody to a chemotherapeutic, and homes in on cd33 receptors on acute myeloid leukemia cells. expertise in biochemistry, cell biology and immunology is required to develop such a drug. this trend has created some bright spots in cancer research and development, even though drug discovery in general has been adversely affected by mergers, a few high-profile failures and a shaky us economy in the early 2000's. as the millennium approached observers as diverse as microsoft's bill gates and president bill clinton predicted the 21st century wiould be the "biology century". by 1999 the many programs and initiatives underway at major research institutions and leading companies were already giving shape to this assertion. these initiatives have ushered in a new era of biological research anticipated to generate technological changes of the magnitude associated with the industrial revolution and the computerbased information revolution. complementary dna sequencing: expressed sequence tags and human genome project basic local alignment search tool high-tech herbal medicine: plant-based vaccines asilomar conference on recombinant dna molecules potential biohazards of recombinant dna molecules hugo: the human genome organization chimeric plant virus particles administered nasally or orally induce systemic and mucosal immune responses in mice the human genome: the nature of the enterprise orchestrating the human genome project separation and analysis of dna sequence reaction products by capillary gel electrophoresis nutritional genomics: manipulating plant micronutrients to improve human health helping europe compete in human genome research genome project gets rough ride in europe construction of a linkage map of the human genome, and its application to mapping genetic diseases separation of dna restriction fragments by high performance capillary electrophoresis with low and zero crosslinked polyacrylamide using continuous and pulsed electric fields preimplantation and the 'new' genetics a history human genome project it aint necessarily so: the dream of the human genome and other illusions high speed dna sequencing by capillary electrophoresis a strategy for sequencing the genome 5 years early expression of norwalk virus capsid protein in transgenic tobacco and potato and its oral immunogenicity in mice rapid production of specific vaccines for lymphoma by expression of the tumor-derived single-chain fv epitopes in tobacco plants generation and analysis of 280,000 human expressed sequence tags national academy of sciences. introduction of recombinant dna-engineered organisms into the environment: key issues functional foods and biopharmaceuticals: the next generation of the gm revolution in let them eat precaution biotechnology: a review of technological developments, publishers forfas vitamin-a and iron-enriched rices may hold key to combating blindness and malnutrition: a biotechnology advance french dna: trouble in purgatory genome: the autobiography of a species in 23 chapters harper collins derivation of pluripotent stem cells from cultured human primordial germ cells production of correctly processed human serum albumin in transgenic plants high-yield production of a human therapeutic protein in tobacco chloroplasts the common thread: a story of science, politics, ethics and the human genome capillary gel electrophoresis for dna sequencing. laser-induced fluorescence detection with the sheath flow cuvette production of functional human alpha 1-antitrypsin by plant cell culture genetic modification of oils for improved health benefits, presentation at conference, dietary fatty acids and cardiovascular health: dietary recommendations for fatty acids: is there ample evidence? stable accumulation of aspergillus niger phytase in transgenic tobacco leaves antenatal maternal serum screening for down's syndrome: results of a demonstration project viable offspring derived from fetal and adult mammalian cells key: cord-325985-xfzhn1n1 authors: jabado, omar j.; liu, yang; conlan, sean; quan, p. lan; hegyi, hédi; lussier, yves; briese, thomas; palacios, gustavo; lipkin, w. i. title: comprehensive viral oligonucleotide probe design using conserved protein regions date: 2007-12-13 journal: nucleic acids res doi: 10.1093/nar/gkm1106 sha: doc_id: 325985 cord_uid: xfzhn1n1 oligonucleotide microarrays have been applied to microbial surveillance and discovery where highly multiplexed assays are required to address a wide range of genetic targets. although printing density continues to increase, the design of comprehensive microbial probe sets remains a daunting challenge, particularly in virology where rapid sequence evolution and database expansion confound static solutions. here, we present a strategy for probe design based on protein sequences that is responsive to the unique problems posed in virus detection and discovery. the method uses the protein families database (pfam) and motif finding algorithms to identify oligonucleotide probes in conserved amino acid regions and untranslated sequences. in silico testing using an experimentally derived thermodynamic model indicated near complete coverage of the viral sequence database. the capacity of dna microarrays to simultaneously screen for hundreds of viral agents makes them an attractive supplement to traditional methods in microbiology. their utility has been demonstrated through detection of papilloma virus in cervical lesions (1) , sars coronavirus in tissue culture (2) , parainfluenza virus 4 in nasopharyngeal aspirates (3) , influenza from nasal wash and throat swabs (4, 5) , gammaretrovirus in prostate tumors (6) , coronaviruses and rhinoviruses from nasal lavage (7) , metapneumovirus from bronchoalveolar lavage (8) , filoviruses and malarial parasites in blood in hemorrhagic fever (9) , and a wide variety of respiratory pathogens in nasal swabs and lung tissue (10) . viral microarrays have increased in density and strain coverage as fabrication technologies have improved. cdna pathogen arrays derived from reference strain nucleic acids (11, 12) have been replaced by oligonucleotide arrays due to their increased flexibility. oligonucleotide design strategies have focused on pairwise sequence comparisons to identify conserved regions within a variety of viral pathogens (13) (14) (15) . multiple alignments have been used to design probes for clinically important virus genera, e.g. rotaviruses (16) , orthopoxviruses (17) or influenzaviruses (18) . viral resequencing arrays have recently been introduced that allow single nucleotide resolution (4, (19) (20) (21) . although such tiling arrays enable accurate typing, the number of probes required to build a resequencing array for all viral sequences exceeds current art. a comprehensive viral microarray should address the entire viral sequence database. pairwise nucleic acid comparisons, while rapid, do not scale well with sequence number and ignore valuable coding information. nonoverlapping segments, heterogeneous sizes and the large number of sequences preclude automated multiple alignments of nucleic acids for probe design. protein-protein comparisons are more sensitive for detecting conserved regions due to the power of substitution matrices (22) ; however, at the time of writing, no reported oligonucleotide design algorithm leverages this information. the protein families database (pfam) (23) is a repository of hand curated protein multiple alignments and hidden markov models (hmms) across all phylogenetic kingdoms. hmms are probabilistic representations of protein alignments that are well suited to identifying homologies (24, 25) . beginning with the pfam database as a foundation, we established a tiered method for creating viral probes that uses all sequence information available for viruses. our method for probe design employs protein alignment information, discovered protein motifs, nucleic acid motifs and finally, sliding windows to ensure near complete coverage of the database. we pursued experiments to determine the effects of probetarget mismatch and background nucleic acid concentration on array sensitivity and specificity; results were used to derive parameters for probe design. west nile virus rna (wnv, strain new york 1999, af202541) was used as template in hybridization experiments on an agilent oligonucleotide array with 1131 complementary probes of length 60 nucleotides (nt). approximately one third of the probes had between 1 and 20 randomly introduced mismatches. the plus and minus (reverse complement) strands of each sequence were deposited, in duplicate. in addition to the flaviviral specific probes, the array contained nearly 36 500 probes for other viral families, negative and positive controls. a volume containing 10 6 copies of wnv and 200 ng of background nucleic acid (human lung tissue rna) was amplified using random primers and hybridized in four replicate experiments as previously described (10) . hybridizing a wnv isolate of known sequence allowed prediction of probe-viral hybrid strength and correlation to fluorescence data. to predict hybrids with high accuracy, smith-waterman alignments of the virus sequence against microarray probes were generated using the emboss bioinformatics suite (26) . the number of mismatches was calculated for each expected probe-target pair. the change in gibbs free energy at 658c (hybridization temperature) was calculated using pairfold version 1.7 (27) as a separate measure of probe-template binding strength. pairfold employs a dynamic programming algorithm to compute the minimum free energy structure (excluding pseudo-knots); the standard free energy model is used (28) with empirical nearest neighbor energies (29) . the arrays were visualized with an agilent slide scanner, then processed with the quantile normalization technique (30) . spss version 14 was used for statistics and data plots (http://www.spss.com/), fluorescence data is available as supplementary material. the embl nucleotide sequence database [july 2007, release 91; 461,353 nucleic acid sequences (31) ] was chosen as the reference for this study because it is tightly integrated with the pfam protein family database (23, 32 taxon growth was estimated using a standard least squares method, with the spss statistical package. a non-redundant database comprising 74 044 sequences was generated with cd-hit (33), using a similarity cutoff of 98% to define sequences as identical. bacteriophages were not included in the analysis; however, data were retained to allow probe design using the embl phage database. the pfam database is comprised of hand curated seed protein alignments that are converted to a probabilistic representation using hmms. these hmms are used to search the protein database for homologues that can be added to the seed to create a comprehensive alignment (23, 24) . pfam domains were analyzed to identify short, conserved protein regions and corresponding nucleic acid sequences. in the first step, the log-odds score for each position of the hmm built from the seed alignment was summed; lower scores were considered to indicate conservancy. the most conserved, non-overlapping 20 amino acid (aa) regions were identified. in the second step, protein alignments of all pfam-a families were extracted and mapped to their underlying nucleotide sequences by cross reference to the embl records. hmm parsing modules from the bioperl package were used. in the third step, the underlying nucleotide sequences were extracted and stored. in cases where the region contained gaps, flanking nucleotides were brought together to yield sequences of length 60. these sequences formed the basis for downstream probe design. domain alignments in the pfam-b were not used in probe design because they are of lower quality; also, as domain quality improves these alignments will be integrated into pfam-a (23). all coding nucleic acid sequences that were not part of a pfam-a alignment were extracted. in this step, the most conserved regions within homologous genes were identified for probe design. sequences were clustered at the protein level with cd-hit, using a similarity threshold of 80%. all sequence clusters were subjected to a meme motif search (34) using the following parameters: motif width of 20, zero or one motif allowed per sequence, a minimum of two sequences per motif. three motifs were selected for each sequence cluster. the underlying nucleic acid sequence extracted for each protein motif was used for probe design. a sliding window approach was used for highly divergent sequences that did not share any motifs. using the pam250 matrix (35) a summed log odds scores for every 20 aa subsequence in the protein was calculated; the three least likely to vary (lowest log odds score) were selected as regions for probe design. viruses often have highly conserved non-coding regions at the termini of their genomes or genome segments that serve critical roles in replication, transcription, and packaging. we reasoned that probes based in these regions may be useful in microarray design. we identified conserved probes across homologous regions in sequences annotated as 5 0 utr, 3 0 utr, ltr, and those without annotation. sequences were first clustered at the 80% threshold. clustered sequences were then subjected to a motif search using the same parameters employed for proteins, except that a length of 60 nt per motif was specified. we addressed sequences that did not contain a shared motif separately; three non-overlapping 60 nt subsequences were chosen as probes. probe selection and minimization with set cover algorithm an algorithm was designed to automate identification of the minimum set of probes required to address a repertoire of potential viral targets (36) . in the first step of analysis, the number of mismatches between a probe and its viral target was computed; the algorithm considered a probe to be 'covering' if it had 5 mismatches to the template. coverage data were converted to a matrix of binary values. a greedy algorithm was implemented to choose a probe combination from the matrix, minimizing the number required probes. candidate probes were further screened to ensure a t m >608c, no repeats exceeding a length of 10 nt, no hairpins with stem lengths exceeding 11 nt, and <33% overall sequence identity to non-viral genomes. because it is not feasible to test all probes with all known viruses, we tested probe validity using a gibbs free energy model of hybridization. all probe sequences were compared to the non-redundant set of viral sequences by blastn (37) . probe-target pairs were aligned by smith-waterman to ensure accuracy; mismatches and change in gibbs free energy at 658c (hybridization temperature) were then calculated. to gauge the performance of our probe selection algorithm, another comprehensive method was devised that used only nucleic acid sequence. sequences in the reference sequence viral genomes project (38) are evenly distributed among viral families; therefore, we reasoned that probes derived from these sequences would provide broad coverage. to contrast with our method, we selected 60 nt oligonucleotides end-to-end along all viral genomic refseq sequences (1701 viruses). this resulted in a tiling probe-set where the length of a sequence was proportional to the number of interrogating probes. the viral sequence database is dominated by gene fragments we queried the embl viral database to determine the frequencies of coding sequences and full genomic sequences. the majority of viral sequences were <1 kb 1982 1984 1986 1988 1990 1992 1994 1996 1998 a commonly used method to reduce sequence complexity is generation of a non-redundant sequence set by clustering (33) . we grouped sequences at the 98% identity level and selected the longest sequences as unique representatives of each group. this method was used to assess the growth of sequence diversity between january 2000 and the current release of july 2007. the database grew 600% in the 7-year period; doubling every three years. unique sequences decreased as a proportion of the database, from 42% to 27%; overall growth of unique sequences was 378% (figure 1b) . the current database comprised 74 044 unique sequence representatives at the 98% similarity level. thus, the growth in the number of sequences in the viral database has been rapid, while growth in diversity has been more modest. one hypothesis to explain this slower growth of sequence diversity is that many of the existing viruses infecting humans have already been discovered and new isolates deposited are variants of well studied viruses. we charted the growth of viral taxonomic groups as a function of time to visualize trends in viral discovery (figure 1c ). the number of families and genera has remained stable since 1996; however, the number of sequences that have been classified as a new species has steadily risen. a least squares fit of this growth indicates that the steady increase in new species characterization is likely to continue, while the discovery of new viral families will be less common. a tiered, protein-motif-based approach to probe design addresses all viral sequences nucleotide sequences were divided into four subtypes: (i) coding sequences that corresponded to pfam-a alignments (cpf), (ii) coding sequences not in the pfam-a (cnpf), (iii) sequences that were annotated as untranslated regions (utr) or long terminal repeats (ltr) and (iv) sequences that were unannotated (ua). we sought to match the quality of pfam-a alignments in the non-pfam coding sequences by clustering them into groups of related sequences, approximating homologous genes. these were then subjected to a protein motif finding program to identify the conserved regions within each cluster. the untranslated and unannotated sequences were subjected to a similar clustering analysis, but at the nucleotide level. all four subtypes were subjected to the same three step design method: identification of conserved regions, extraction of nucleotide probe sequences, and minimization of covering probes. by allowing a limited number of mismatches to cognate templates, the number of probes required can be reduced. the mismatch threshold was determined based on experiments with west nile virus (strain new york 1999, af202541) that indicated high, homogenous fluorescence signal was observed if probes had five or fewer mismatches to the viral template ( figure 2 ). the probe minimization technique serves to lower microarray printing costs and simplify analysis while maintaining sequence coverage. a flowchart of the design method is depicted in figure 3 . the most recent pfam-a release (version 22) comprised 9318 families, of which 1540 had viral members. of 405 543 annotated protein sequences with length >20 aa, 278 119 (68.6%) belonged to a pfam-a family, while 127 424 (31.4%) did not. three probes were chosen for each gene, yielding a total of 104 467 cpf and 133 513 cnpf probes. of sequences not contained in pfam-a, only 5.6% (6956) were found in pfam-b alignments. thus, due to the lower quality of alignments (23) and poor viral representation, the pfam-b was not used for probe design. the 12 428 untranslated regions processed yielded 4616 probes. for the 24 841 unannotated sequences processed, 13 740 probes were designed. sequences that were not covered due to high/low gc%, low complexity, repetitive sequence or a preponderance of ambiguous nucleotides (4244) were processed with a sliding window strategy; 14 530 probes were designed. overall, the number of probes required to address all viral sequences was 270 866. sequence counts and probe counts for the most recent embl/pfam release are detailed in figure 4 . an example of typical probe distribution is shown with respect to the dengue virus 1 genome (nc_001477; figure 5 ). probe sequence composition is a major determinant of hybridization signal and is responsible for much of the variance between probes that target the same nucleic acid strand. probe-target thermodynamics have been successfully modeled to predict fluorescence (41, 42) , control for variance (43) and even estimate concentrations of target detected in samples (44) . observing that some probes with more than five mismatches to their targets showed strong fluorescent signal, we concluded that sequence composition is a major factor in our array platform. we sought to validate the probe design method by generating a simple thermodynamic model to predict hybridization signal based on sequence composition. we computed the change in gibbs free energy (ág) for all expected probe-viral nucleic acid pairs in the west nile virus hybridization experiments described above. the calculation method employed finds the most thermodynamically stable structure (minimum free energy) (28) based on empirically established nearest neighbor energies (29) . strong signal was observed from probe-virus hybrids with ág of à32.5 kj or less. thus, this value was chosen as the threshold to classify a probe as likely to generate high signal when the cognate viral target is present (figure 6 ). probes will be designed in the area of short motifs of 20aa or 60nt figure 3 . comprehensive motif-based probe design. the embl viral database is clustered with a threshold of 98% nucleotide identity to create a non-redundant sequence database. coding sequences are subjected to an amino acid motif search, and then probes are made from the underlying nucleic acid sequences. similarly, nucleic acid motifs are found in non-coding sequences and used to make probes. database coverage is checked; supplementary probes for highly divergent sequences are designed as necessary. acronyms: pfam-protein families database, meme-multiple expectation maximization for motif elicitation, utr-untranslated region, ltr-long terminal repeat. motif-based probe design provides higher coverage than virus genome tiling use of motif finding and set cover minimization markedly increases the computational resources needed to generate probe sets. to determine whether increased complexity results in a more comprehensive probe set, we compared our method to a genome tiling strategy. probes of 60 nt were designed end-to-end along the entire genome for all 1701 reference sequence viral strains available as of may 2007. the tiling probe set served as a contrast to our design method since it was based on nucleic acid sequence, had more probes per gene, required less computation, and included viruses from all genera. in comparison of the methods, the following rules were used to compute database coverage: sequences >400 nt in length were considered covered if six or more probes met hybridization criteria; sequences <400 nt in length were considered covered if two probes met hybridization criteria; sequences <200 nt in length were considered covered with a single probe meeting hybridization criteria. coverage of the entire database was gauged by computing probe-template ág for all 74 044 unique sequence representatives. database coverage using the tiling method was 47.8% and required 850 136 probes; coverage using the motif-based method was 99.7% and required 270 866 probes (table 1) . whereas probe design in motif-based arrays can exploit partial genome sequences, probes in tiling arrays are based on full length genome sequences. complete reference sequence genomes represent 1% of embl sequence entries. although at least one full length genome sequence is described for all viral genera, only 49% (1701 of 3441) of viral species have a fully sequenced representative genome. the impact of differences in the motif and tilingbased strategies for probe design is reflected in differences in coverage. for the tiling-based probe-set, 40 of 44 families with <80% sequence coverage included species lacking representative genomes. coverage with motifbased probe-sets for these same species was !93%. there is an increasing appreciation for the power of microarray technology in clinical microbiology, public health and environmental surveillance. viral microarray probe design poses unique challenges due to the rapid increase in sequence data and the high propensity for sequence divergence within viral taxa. to ensure coverage of the newest isolates it is essential to consider partial as well as complete genomic sequences in probe design. probe design based on multiple alignments or pairwise comparisons of nucleic acids for all known sequences is computationally intensive and scales poorly with database size. protein sequence comparisons are rapid and incorporate rich evolutionary models, but require a cumbersome mapping step to extract underlying nucleic acid sequence. we have described a method that capitalizes on the pfam protein alignment database and a motif finding algorithm to automate the extraction of nucleic acid sequence for probes from conserved protein regions. the protein motif-centric method has several advantages: (i) the majority of viral nucleic acid sequences encode proteins; thus, using this information leverages knowledge about function; (ii) protein sequence comparison and the resulting probesets are independent of viral taxonomy; this may enable incorporation of misclassified sequences; (iii) the pfam is a well established and highly annotated database that will provide a basis for future design efforts; and (iv) probes designed in conserved regions may be able to detect novel viruses. a second application of this design method is viral expression profiling. insights into the replication cycle, host evasion and virulence factors may be obtained by monitoring viral transcript levels during infection. to this end, arrays could be synthesized that combine probes for a single viral family and all host genes. because the viral probe sets generated by our method account for known variants across all genes, a variety of strains could be profiled with a single array. this would provide a unique experimental platform for investigating virus biology, while minimizing fabrication cost and simplifying analysis. the thresholds used to design and validate probes were experimentally determined for the agilent technologies array platform and the types of clinical samples our figure 6 . gibbs free energy model of hybridization signal. the change in gibbs free energy of probe-west nile virus hybrids was computed. aliquots of west nile virus (new york 1999 strain rna) at 10 6 copies were spiked into 200 ng of human lung (background) rna. the fluorescent signal values of replicate arrays were log 2 transformed, normalized, and converted to z-scores. 95% confidence intervals of the mean for fluorescence versus gibbs energy is plotted. probe-virus hybrids with free energy -32.5 kj had high fluorescence; this value was chosen as the threshold for considering a probe likely to generate a strong signal when the target virus is present (dotted line). laboratory encounters. probe length can be selected to emphasize efficient coverage of higher order taxa or speciation. the goal of this project is to cover all known viral sequences and optimize potential for detecting related viral sequences. thus, we designed 60 nt probes because they can better tolerate mismatched templates than 25 nt oligonucleotide probes (45) . using an empirical approach, appropriate thresholds can be determined for other array platforms, hybridization conditions, and probe lengths. the method of probe design and setcover minimization is flexible and agnostic of platform; application to bead, solution, or surface-based hybridization technology should be straightforward. although the growth of the public sequence databases has been rapid, sequence diversity has not grown as quickly. if this trend continues, we anticipate that only incremental updates to a core set of probes will be needed to maintain array integrity. an update strategy would require periodic testing of probe sets against newly deposited sequences and fresh design only in the cases of high sequence divergence. supplementary data are available at nar online. correlation of cervical carcinoma and precancerous lesions with human papillomavirus (hpv) genotypes detected with the hpv dna chip microarray method viral discovery and sequence recovery using dna microarrays microarray detection of human parainfluenzavirus 4 infection associated with respiratory failure in an immunocompetent adult broad-spectrum respiratory tract pathogen identification using resequencing dna microarrays experimental evaluation of the fluchip diagnostic microarray for influenza virus surveillance identification of a novel gammaretrovirus in prostate tumors of patients homozygous for r462q rnasel variant pan-viral screening of respiratory tract infections in adults with and without asthma reveals unexpected human coronavirus and human rhinovirus diversity diagnosis of a critical respiratory illness caused by human metapneumovirus by use of a pan-virus microarray panmicrobial oligonucleotide array for diagnosis of infectious diseases detection of respiratory viruses and subtype identification of influenza a viruses by greenechipresp oligonucleotide microarray dna microarrays for virus detection in cases of central nervous system infection detection of potato viruses using microarray technology: towards a generic method for plant viral disease diagnosis microarray-based detection and genotyping of viral pathogens database to dynamically aid probe design for virus identification design of microarray probes for virus identification and detection of emerging viruses at the genus level detection and genotyping of human group a rotaviruses by oligonucleotide microarray hybridization detection and discrimination of orthopoxviruses using microarrays of immobilized oligonucleotides robust sequence selection method used to develop the fluchip diagnostic microarray for influenza virus sequence-specific identification of 18 pathogenic microorganisms using microarray technology tracking the evolution of the sars coronavirus using highthroughput, high-density resequencing arrays genechip resequencing of the smallpox virus genome can identify novel strains: a biodefense application amino acid substitution matrices from an information theoretic perspective pfam: a comprehensive database of protein domain families based on seed alignments profile hidden markov models sequence comparison and protein structure prediction emboss: the european molecular biology open software suite rnasoft: a suite of rna secondary structure prediction and design software tools calculating nucleic acid secondary structure a unified view of polymer, dumbbell, and oligonucleotide dna nearest-neighbor thermodynamics a comparison of normalization methods for high density oligonucleotide array data based on variance and bias embl nucleotide sequence database: developments in 2005 pfam: clans, web tools and services cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences an artificial intelligence approach to motif discovery in protein sequences: application to steriod dehydrogenases atlas of protein sequence and structure greene scprimer: a rapid comprehensive tool for designing degenerate primers from multiple sequence alignments gapped blast and psi-blast: a new generation of protein database search programs national center for biotechnology information viral genomes project rationale and uses of a public hiv drugresistance database global epidemiology of hiv modeling of dna microarray data by using physical properties of hybridization thermodynamic calculations and statistical correlations for oligo-probes design improving comparability between microarray probe signals by thermodynamic intensity correction absolute mrna concentrations from sequence-specific calibration of oligonucleotide arrays expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer the work presented here was supported by national institutes of health awards (ai070411, northeast biodefense center u54-ai057158-lipkin, ai056118, hl083850 ey017404 and t32gm008224). we thank carolyn morrison for excellent technical assistance. funding to pay the open access publication charges for this article was provided by nih u54-ai057158-lipkin.conflict of interest statement. none declared. key: cord-300149-djclli8n authors: ruan, yijun; wei, chia lin; ling, ai ee; vega, vinsensius b; thoreau, herve; se thoe, su yun; chia, jer-ming; ng, patrick; chiu, kuo ping; lim, landri; zhang, tao; chan, kwai peng; lin ean, lynette oon; ng, mah lee; leo, sin yee; ng, lisa fp; ren, ee chee; stanton, lawrence w; long, philip m; liu, edison t title: comparative full-length genome sequence analysis of 14 sars coronavirus isolates and common mutations associated with putative origins of infection date: 2003-05-24 journal: lancet doi: 10.1016/s0140-6736(03)13414-9 sha: doc_id: 300149 cord_uid: djclli8n background: the cause of severe acute respiratory syndrome (sars) has been identified as a new coronavirus. whole genome sequence analysis of various isolates might provide an indication of potential strain differences of this new virus. moreover, mutation analysis will help to develop effective vaccines. methods: we sequenced the entire sars viral genome of cultured isolates from the index case (sin2500) presenting in singapore, from three primary contacts (sin2774, sin2748, and sin2677), and one secondary contact (sin2679). these sequences were compared with the isolates from canada (tor2), hong kong (cuhk-w1 and hku39849), hanoi (urbani), guangzhou (gz01), and beijing (bj01, bj02, bj03, bj04). findings: we identified 129 sequence variations among the 14 isolates, with 16 recurrent variant sequences. common variant sequences at four loci define two distinct genotypes of the sars virus. one genotype was linked with infections originating in hotel m in hong kong, the second contained isolates from hong kong, guangzhou, and beijing with no association with hotel m (p<0.0001). moreover, other common sequence variants further distinguished the geographical origins of the isolates, especially between singapore and beijing. interpretation: despite the recent onset of the sars epidemic, genetic signatures are emerging that partition the worldwide sars viral isolates into groups on the basis of contact source history and geography. these signatures can be used to trace sources of infection. in addition, a common variant associated with a non-conservative aminoacid change in the s1 region of the spike protein, suggests that immunological pressures might be starting to influence the evolution of the sars virus in human populations. published online may 9, 2003 http://image.thelancet.com/extras/03art4454web.pdf the first cases of severe acute respiratory syndrome (sars) were identified in november, 2002, in guangdong province, china. by april, 2003, the epidemic had spread worldwide, affecting 3547 individuals resulting in 182 deaths. 1 in march, 2003, the putative cause of sars was identified as a new coronavirus. 2, 3 oropharyngeal specimens from patients with sars induced a cytopathic effect on vero e6 tissue culture cells and revealed the presence of coronavirus-like particles. reverse transcriptase-pcr analysis with random or broadly-specific coronavirus primers amplified a dna fragment that resembled, but was distinct from, known coronavirus genomes. when tested, these diagnostic pcr methods detected the sars virus in many clinical samples taken from affected patients. in addition, serological evidence has shown the presence of antibodies specific to the new coronavirus in the serum of patients with sars. 4 collectively, these data strongly implicate the new coronavirus as the cause of sars, which we designate as sars-cov. sars-cov is a member of the coronoviridae family of enveloped, positive-stranded rna viruses, which have a broad host range. some coronavirus infections in people, cattle, and birds cause respiratory disease, whereas other coronavirus infections in rodents, cats, pigs, and cattle lead to enteric disease. the 27-32 kb genomes of coronaviruses, the largest of rna viruses, encode 23 putative proteins, including four major structural proteins; nucleocapsid (n), spike (s), membrane (m), and small envelope (e). the spike protein, a glycoprotein projection on the viral surface, is crucial for viral attachment and entry into the host cell. in addition, variations of s protein among strains of coronavirus are responsible for host range and tissue tropism. 5 differences in virulence of mice coronaviruses have also been linked to genetic variance in the s protein. 6, 7 and the serological response in the host is typically raised against the s protein. 8 however, the s, m, and n mature proteins all contribute to generating the host immune response as seen in transmissible gastroenteritis coronavirus, 9 infectious bronchitis virus, 10, 11 pig respiratory coronavirus, 12 and mouse hepatitis virus. 13 a characteristic of rna viruses is the high rate of genetic mutation, which leads to evolution of new viral strains and is a mechanism by which viruses escape host defenses. therefore, from a public-health perspective, understanding the mutation rate of the sars virus as it spreads through the population is important. moreover, the genetic mutability of sars-cov, especially in the segments encoding the major antigenic proteins, would also have an effect on development of broadly effective vaccines. we aimed to determine the complete genome sequence for five sars-cov related isolates from a single sars index case, three associated primary contact cases, and one secondary contact and compare them with nine other sars-cov isolates available in public-domain databases. we obtained five positive isolates for coronavirus from the index patient of the outbreak (sin2500), three primary contacts of this patient (sin2774, sin2748, sin2677), and one secondary contact (sin2679) who was related to the index patient, but contracted sars from another primary contact not included in this study. all patients fitted the who case definition for probable sars 14fever of 38ºc or higher, respiratory symptoms (eg, cough, shortness of breath, difficulty in breathing), hypoxia and chest radiograph changes suggestive of pneumonia, and history of close contact with another patient with sars or travel to a region with documented community transmission of sars within 10 days of onset of symptoms. the index patient had a history of travel to hong kong and had stayed in hotel m. 15 the viral sources were all from respiratory samples: two endotracheal tube aspirates, two nasopharyngeal aspirates, and one throat swab obtained from the patients between 0 and 11 days after of onset of symptoms. the virus-containing samples were inoculated into various cell lines including vero cells, which showed a cytopathic effect characterised by generalised rounding against a granular background seen 5-11 days after inoculation. we maintained the cells at 33ºc and repassaged them after 7 days of incubation. we spotted washed cell pellets from harvested cells showing cytopathic effects onto glass slides and overlaid them with acute and convalescent serum samples from the patients from whom we obtained the respiratory samples. all five cell lines showed reactivity by immunofluorescence (methods available from authors) with convalescent serum samples from the respective patients. we confirmed the presence of the sars coronavirus by pcr on the cell supernatants using sarsspecific primers. 2, 4 when tested, two of the infected vero cell lines also had coronavirus-like particles on electron microscopy. we isolated the viral rna template from the supernatants of vero cells that showed cytopathic effects by centrifugation at 23 000 relative centrifugal force for 2·5 h to pellet the viral particles and extracted rna with the qiamp viral rna mini kit (qiagen, valencia, ca, usa). to sequence the viral genome we used both shot-gun and specific priming approaches. from the rna templates, we synthesised double strand cdna with the superscript cdna system (invitrogen, carlsbad, ca, usa). the cdna were pcr amplified (20 cycles) with random 16-mer by platinum taq polymerase. the amplicons were then cloned into pcr2.1-topo vector (invitrogen). we selected random clones for single pass sequencing analysis on abi3730 sequencers (applied biosystems, foster city, ca, usa). we compared sequence data generated from the library with human, mouse, and viral genome databases managed at the us national center for biotechnology the basic local alignment search tool is a system for searching similar sequences against all available sequence databases irrespective of whether the query is dna or protein sequences. the blast programs consist of blastn, blastp, tblast, tblastn, tblastx, psi-blast, megablast, and so on. the most improved point of this software is its high searching speed clustalw clustalw is a multiple sequence alignment tool that is commonly used in the bioinformatics community. it produces global multiple sequence alignments through three major phases: pairwise alignment, guide tree construction, and multiple alignment. the guide tree generated by clustalw is an estimate of relations between sequences much like a phylogenetic tree paup* the phylogenetic analysis using parsimony (paup*) is a well established package for phylogenetic tree construction. it uses various methods, including parsimony, maximum likelihood, and distance methods to estimate phylogenetic relations. bioinformatics tools to assemble shotgun sequences, which are available from university of washington, phred=base-calling program; phrap=assembly program for shotgun sequences; consed=unix-based graphic editor for phrap sequence assemblies positive-stranded rna viruses some viruses, such as coronaviruses, carry their genetic material as rna rather than the more typical dna-based genomes. positivestranded rna (also called plus-stranded) indicates that the single stranded rna genome is of the same sense as coding messenger rna accession number 16 we designed sequence-specific primers on the basis of cdna sequence data to fill the gaps of 1-2 kb distance. after we completed the first rough genome sequence for the sample sin2500, we designed 30 primer pairs incorporating sequence information from the newly available tor2 sars virus 17 to cover the whole viral genome with each primer pair amplifying about 1200 base pairs. we then used this sequence-specific priming approach to generate reverse transcriptase-pcr fragments for the five whole viral genomes. each pcr fragment was directly sequenced from both directions inward and outward, in duplicate. therefore, for any region of the genome there was six to eight-fold coverage. we used the phred/phrap/consed package (university of washington, seattle, wa, usa; http://www.phred.org) to process all the raw sequence reads for base calling, assembly, and editing. nucleotide differences in the assembled genome sequences were also checked manually for accuracy. sequence regions that were poor quality were resequenced either from purified pcr fragments or cloned plasmid. all genetic variations of singapore isolates identified when compared with available sars-cov genome sequences were further confirmed by primer extension genotyping technology (sequenom, san diego, ca, usa). the panel shows the sars-cov isolates we accessed from genbank. we determined the locations of point mutations by aligning the 14 sars sequences with clustalw. 18 we calculated putative coding sequences by completing the multialignment of the 14 sars sequences with the tor2 nucleotide sequence annotation as the reference. the annotations of the proteins are taken from the corresponding entries in the ncbi entrez site. associations between the members of the coronaviridae family to the sars virus were assessed by comparing overlapping fragments of the sin2500 genomic sequence against a database of coronavirus sequences. to calculate regional homologies with sister coronaviruses we used sequences within sliding 200 base pair windows sampled in a tiled fashion every 100 base pairs across the viral genomes using the blast blastn program from wu-blast (washington university, st louis, mo, usa). 19 in the heat map that was generated, the sars genomic fragments are plotted along the horizontal axis in the order they appear in the sponsor of the study had no role in study design, data collection, data analysis, data interpretation, or in the writing of the report. we have sequenced the complete genomes of sars-cov from five singapore isolates derived from one index case (sin2500), three primary cases (sin2677, sin2748, and sin2774), and one secondary case (sin2679). these sequences showed that the genomes of sars-cov isolated in singapore are comprised of 29 711 bases, with the exception of a five-nucleotide deletion in strain sin2748 and a six-nucleotide deletion in sin2677. initial blast analysis suggested that the singapore sars virus is similar to, but distinct from, the group 2 coronaviruses in the coronaviridae family of enveloped and positive-stranded rna viruses. as in the recently sequenced sars-cov (hku39849, cuhk-w1, tor2, and urbani), the singapore sars virus contains 11 predicted open reading frames that encode 23 putative mature proteins with known and unknown functions. most of the non-structural proteins seem to be encoded in the first half of the genome including nsp1 and nsp2, with putative proteinase function and nsp9 rna-dependent rna polymerase, whereas most of the structural proteins such as spike, membrane, envelop, and nucleocapsid are located in the second half of the genome (figure 1, webfigure: http://image.thelancet. com/extras/03art4454webfigure.pdf). the haemagglutinin esterase, which is common in the group 2 coronavirus, is missing in the sars-cov genome, suggesting that some of the non-structural genes are dispensable in coronaviruses. although the genome organisation of sars-cov is similar to that of other coronaviruses, sars-cov is only distantly related to any coronavirus member, irrespective of species specificity, at both nucleotide and aminoacid levels (figure 2). in assessing the homology with coronaviridae genomes from other species with the sars-cov, sequence the genome, and the corresponding fragments from other coronaviruses are placed vertically. the brightness of a pixel corresponds to the strength of the match between a sars fragment and a coronavirus genome; the smaller the p value, the brighter the pixel. at p=1 it is black, and the brightness is proportional to log (1÷p) until p is less than 10 -11 , when it is white. the panel shows the accession numbers of the coronavirus sequences used. in the analysis of the common sequence variations, the probability of co-occurrence of multiple polymorphisms or mutations in an isolate was used as a measure of significance. we used 13 of the samples (we excluded sample bj04 because of substantial missing sequence information) and restricted our attention to 26 140 loci at which nucleotides were determined in all 13 samples. the null hypothesis was that the nucleotides at these loci were obtained by mutating a single consensus sequence independently at random at each position of each sequence. the mutation rate was estimated from the data by the fraction of positions in the various genomes that differed from the consensus sequence obtained by taking the most frequent nucleotide at each position. we also tested a weaker null hypothesis, in which the mutations taking place at any given locus in different samples are independent, but arbitrary dependence between the loci is possible subject to this constraint. details of the analytical approach are described in webappendix 1 (http://image.thelancet.com/extras/03art 4454webappendix1.pdf) and on our website (http://www. gis.astar.edu.sg/ homepage/toolssup.jsp). phylogenetic analysis of the sars viral genomes was done with paup* 20 with the maximum probability criterion. we used the default variable settings, with one exception: the substitution rates were estimated from the data. a separate phylogenetic analysis done with clustalw 18 gave the same structure. conservation seems to be restricted to the middle part of the genome, between bases 14 000 and 21 000, where the rna-dependent polymerase and several uncharacterised proteins (eg, orf1ab:nsp 10-13) are located. the remainder of the genome, especially at the 5ј and 3ј regions, diverges from other strains at the nucleotide and aminoacid level. we aligned the five complete genome sequences of singapore sars-cov isolates and the nine sars-cov genomes from outside of singapore, which were sequenced by others, and investigated the genetic variations between these 14 genomes (webappendix 1). in total, there were 127 single nucleotide sequence variations, one deletion of six nucleotides (nt 27782-27787) in strain sin2677, and one deletion of five nucleotides (nt 27810-27814) in strain sin2748 (webappendix 2, http://image.thelancet.com/extras/ 03art4454webappendix2.pdf). both these deletion sites were in the noncoding sequences between an uncharacterised open reading frame and the nucleocapsid protein. of the 127 base substitutions, 94 changed the aminoacid sequence (webappendix 2). the mutations were in the following open reading frames: orf1a polyprotein, orf1a rna-polymerase, orf1ab: nsp10 to nsp 13, spike glycoprotein, membrane nucleocapsid, and several uncharacterised putative proteins. to eliminate mutational noise induced in culture from real strain differences, we reanalysed the data using a probabilistic approach. mutations that might have been artifacts of cell culture would occur only once in our survey, whereas sequence variants associated with common ancestry should be seen in multiple isolates. of the 127 sequence variations in the 14 isolates, 16 variant loci were identified in two or more isolates, and eight were seen in three or more isolates ( figure 3 ). with the more stringent criterion, four loci recurred five or more times in the 14 sars-covs analysed: c/t polymorphisms at position 9404 resulting in a valine to alanine change in orf1a (nsp1); position 19 084 leading to an isoleucine to threonine change in orf1ab (nsp11); position 22 222 changing an isoleucine residue to threonine in the s1 portion of the spike protein, and position 27 827 which is in a non-coding region ( figure 3 ). in addition, a t/g polymorphism is noted at position 17 564 changing orf1ab (nsp10, helicase domain) from an aspartic acid to a glutamic acid. sequence variants at these four loci segregate together as a specific genotype. assuming that all base substitutions were random events propagated in the vero cells, the probability of four specific nucleotide changes occurring concurrently is very low. the significance is at p<10 -60 when the null hypothesis is that each locus in each sample mutates independently, and p<10 -15 when dependence is allowed among the loci. the c:g:c:c and t:t:t:t genotypes are, therefore, very unlikely to have emerged by chance, and might be evidence for the first genetic signature of strain differences in the sars virus. all isolates with the t:t:t:t genotype were linked to infection acquired at the hotel m in hong kong, 18 whereas none of the c:g:c:c genotype isolates had this association ( figure 3 ). phylogenetic analysis based on the common variant sequences (defined as present in two or more isolates) confirmed that the cases associated with exposure in the hotel m formed a cluster that was distinct from the other isolates ( figure 4) . on the basis of the molecular and contact history, we have reconstructed a probable lineage map of the sars-cov infections investigated here ( figure 5 ). in addition to this four-locus genotype, four other common variant sequences (all occurring three to four times in the 14 isolates) seem to further define subgroups geographically (figure 3). the variant sequences at position 19 084 seem to distinguish the singaporean isolates from all others. although all nine isolates from outside singapore showed a c at this position, four of the five isolates from singapore had a t. the only reversion from t to a c in sin2679 (a secondary contact case) might be the result of a backmutational event potentially occurring during the passage of the virus. taking into account missing data, the polymorphisms at nucleotide positions 9854, 19 838, and 27 243 all segregate with isolates identified specifically in beijing. thus, these common sequence polymorphisms might be useful in identifying the differential source of a sars viral infection. although the genome organisation of sars-cov is similar to that of other coronaviruses, the sars-cov sequence is only distantly related to any coronavirus member. results of earlier reports 4 suggested that sars-cov more closely resembled the cow coronavirus and the mouse hepatitis virus by comparing a conserved 215 aminoacid segment of the polymerase protein product. however, when taking into account the entire genomic sequence, the strength of the associations was reduced, confirming the reports of others that the sars coronavirus is a completely new pathogenic strain that does not arise from a simple recombination of known existing strains. 2, 21 since the s1 subunit of the spike protein is the major antigenic moiety for coronaviruses and is not an essential structural protein, it is prone to high mutation rates as the virus evolves in host populations. that the s1 region did not seem to have excessive numbers of base substitutions suggests that the viral isolates have not been subject to immunological selection. 22, 23 however, because all samples were from viral cultures propagated in vero cells, some, if not most, of these 129 mutations might have occurred during in-vitro expansion and not because of host pressures. 24 in addition, some of the available nucleotide substitutions might have been the result of sequencing errors since several of the sequences were submitted in draft form. to reduce the effects of these technical artifacts, we restricted our analysis to the 16 loci with recurrent mutations among the 14 isolates. these loci are the sequence variants most likely to have been resident in human populations. of special interest are the nucleotide changes in four of these loci (positions 9404, 19 084, 22 222, and 27 827) that recurred five or more times. the base substitutions at these locations are highly restricted and segregate together as specific genotypes (c:g:c:c vs t:t:t:t). thus, it is highly unlikely that the c:g:c:c and t:t:t:t genotypes emerged by chance. rather, we believe this to be evidence for the first genetic signature of strain differences in the sars virus. all isolates with the t:t:t:t genotype were linked to infection acquired at the hotel m in hong kong, 15 whereas none of the c:g:c:c genotype isolates had this association. the index case from singapore (from which the singaporean infections described herein were derived) acquired the sars-cov infection while staying at hotel m. the tor2 virus, cultured in canada, and the hku39849 isolate from hong kong, were from patients who became infected through contact at hotel m, although perhaps not directly. the urbani sars-cov isolate was from a physician infected by a patient who contracted sars while staying at hotel m. isolates cuhkw1, gzd1, bj01, bj02. bj03, bj04, however, came from patients with no known linkage with hotel m and, on the whole, were derived later than the hotel m linked set. our results showed that the cases associated with exposure in the hotel m formed a cluster that was distinct from the other isolates. in addition to this four locus genotype, the variant sequences at position 19 084 distinguished the singaporean isolates from all others. there also seems to be a signature for the north china isolates at positions 9854, 19 838, and 27 243. thus, the common sequence polymorphisms might be useful in identifying the differential source of a sars viral infection. whether any of these common polymorphisms will result in biological and clinical differences remains to be determined. however, the common mutation in position 22 222 changing an isoleucine residue to threonine in the important antigenic region of the spike protein might be relevant. mutations in this region of the sars-cov genome can arise because of selective pressure from host immune responses. that an isoleucine is present in all hotel m linked isolates whereas a threonine at the same position in the major antigenic protein is found in all other geographically distinct isolates suggests that such non-conservative aminoacid changes have occurred to evade immunological pressures. the sars viral epidemic has placed a substantial strain on the health and economic status of nations. understanding the nature of this virus and deriving methods to control the epidemic are very important. our results show several molecular facets of the sars coronavirus pertinent to public-health management of this epidemic. its novelty as a human pathogen suggests that most populations might be immunologically naive to its infection. the discovery of genotypes linked to geographic and temporal clusters of infectious contacts suggests that molecular signatures can be used to refine contact histories. a e ling organised and directed the in-vitro investigations of the biology of the sars virus. k p chan provided viral cultures, and l l oon and s y sethoe did the molecular diagnostic analysis and extraction of the viral nucleic acids. n m lee did the electron microscopy of the initial viral samples. y s leo was the chief infectious disease clinician caring for the patients. sequence determination and analysis was done by y ruan, c wei, h thoreau, p ng, k chiu, l lim, t zhang, e e c ren, l stanton, p long, and e t liu. cumulative number of reported probable cases of severe acute respiratory syndrome (sars) identification of a novel coronavirus in patients with severe acute respiratory syndrome a novel coronavirus associated with severe acute respiratory syndrome coronavirus as a possible cause of severe acute respiratory syndrome retargeting of coronavirus by substitution of the spike glycoprotein ectodomain: crossing the host cell species barrier targeted recombination demonstrates that the spike gene of transmissible gastroenteritis coronavirus is a determinant of its enteric tropism and virulence pathogenesis of chimeric mhv4/mhv-a59 recombinant viruses: the murine coronavirus spike protein is a major determinant of neurovirulence the jhm strain of mouse hepatitis virus induces a spike protein-specific db-restricted cytotoxic t cell response expression of immunogenic glycoprotein s polypeptides from transmissible gastroenteritis coronavirus in transgenic plants production and immunogenicity of multiple antigenic peptide (map) constructs derived from the s1 glycoprotein of infectious bronchitis virus (ibv) recombinant nucleocapsid protein is potentially an inexpensive, effective serodiagnostic reagent for ibv an adenovirus recombinant expression the spike glycoprotein of porcine respiratory coronavirus is immunogenic in swine nucleotide sequence comparison of the membrane protein genes of three enterotropic strains of mouse hepatitis virus severe acute respiratory syndrome (sars) update: outbreak of severe acute respiratory syndrome, worldwide clustal-w-improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice phylogenetic analysis using parsimony (and other methods). version 4 the genome sequence of the sars-associated coronavirus generation of coronavirus spike deletion variants by highfrequency recombination at regions of predicted rna secondary structure origin and evolution of georgia 98 (ga98), a new serotype of avian infectious bronchitis virus effects of passage history and sampling bias on phylogenetic reconstruction of human influenza a evolution we thank prasanna kolatkar, marie wong, liu jianjun, and joshy george for their help in this project. the authors are members of the gis and are therefore funded by the agency for science technology and research of singapore. none declared. key: cord-287658-c2lljdi7 authors: lopez-rincon, alejandro; tonda, alberto; mendoza-maldonado, lucero; mulders, daphne g.j.c.; molenkamp, richard; perez-romero, carmina a.; claassen, eric; garssen, johan; kraneveld, aletta d. title: classification and specific primer design for accurate detection of sars-cov-2 using deep learning date: 2020-09-10 journal: biorxiv doi: 10.1101/2020.03.13.990242 sha: doc_id: 287658 cord_uid: c2lljdi7 in this paper, deep learning is coupled with explainable artificial intelligence techniques for the discovery of representative genomic sequences in sars-cov-2. a convolutional neural network classifier is first trained on 553 sequences from available repositories, separating the genome of different virus strains from the coronavirus family with considerable accuracy. the network’s behavior is then analyzed, to discover sequences used by the model to identify sars-cov-2, ultimately uncovering sequences exclusive to it. the discovered sequences are first validated on samples from other repositories, and proven able to separate sars-cov-2 from different virus strains with near-perfect accuracy. next, one of the sequences is selected to generate a primer set, and tested against other state-of-the-art primer sets on existing datasets, obtaining competitive results. finally, the primer is synthesized and tested on patient samples (n=6 previously tested positive), delivering a sensibility similar to routine diagnostic methods, and 100% specificity. in this paper, deep learning is coupled with explainable artificial intelligence techniques for the discovery of representative genomic sequences in sars-cov-2. a convolutional neural network classifier is first trained on 553 sequences from ngdc, separating the genome of different virus strains from the coronavirus family with accuracy 98.73%. the network’s behavior is then analyzed, to discover sequences used by the model to identify sars-cov-2, ultimately uncovering sequences exclusive to it. the discovered sequences are validated on samples from ncbi and gisaid, and proven able to separate sars-cov-2 from different virus strains with near-perfect accuracy. next, one of the sequences is selected to generate a primer set, and tested against other state-of-the-art primer sets, obtaining competitive results. finally, the primer is synthesized and tested on patient samples (n=6 previously tested positive), delivering a sensibility similar to routine diagnostic methods, and 100% specificity. the proposed methodology has a substantial added value over existing methods, as it is able to both identify promising primer sets for a virus from a limited amount of data, and deliver effective results in a minimal amount of time. considering the possibility of future pandemics, these characteristics are invaluable to promptly create specific detection methods for diagnostics. the coronaviridae family presents a positive sense, single-strand rna genome. these viruses have been identified in avian and mammal hosts, including humans. coronaviruses have genomes from 26.4 kilo base-pairs (kbps) to 31.7 kbps, with g + c contents varying from 32% to 43%; human-infecting coronaviruses belonging to this family include sars-cov, mers-cov, hcov-oc43, hcov-229e, hcov-nl63 and hcov-hku1 1 . in december 2019, sars-cov-2, a novel, human-infecting coronavirus was identified in wuhan, china, using next generation sequencing (ngs) 2 . as of the 12 th august of 2020, the new sars-cov-2 has 20,162,474 confirmed cases across almost all countries, with 3,641,603 cases in the european region 3 . in addition, sars-cov-2 has an estimated mortality rate of 3-4%, and it is spreading faster than sars-cov and mers-cov 4 . as a typical rna virus, new mutations appears every replication cycle of coronavirus, and its average evolutionary rate is roughly 10-4 nucleotide substitutions per site each year 2 . in the specific case of sars-cov-2, rt-qpcr testing using primers in orf1ab and n genes have been used to identified the infection in humans 5 . this method has come into question; yang et al. in a study from 866 respiratory specimens showed that for 0-7 days after onset of illness, the sputum samples had a negative rate of 11.1% in severe and 17.8% in mild cases, follow by 26.7% and 27.0% in nasal swabs and finally 40% and 38.7% for throat swabs 6 . zhao et al. reports that 35.2% of 173 patients did not show positive in rt-pcr test 7 , which has been further explored by arevalo et al. 8 and woloshin et al. 9 . these problems could be the result of the variation of viral rna sequences within virus species, and the viral load in different anatomic sites 10 . it has been noted that, population mutation frequency of site 8,872 located in orf1ab gene and site 28,144 located in orf8 gene gradually increased from 0 to 29% as the epidemic progressed 11 . apart from the false negative test problems, sars-cov-2 assays can yield a small portion of false positives through nonspecific detection of other coronaviruses, as the virus is closely related to other coronavirus organisms 12 . in addition, sars-cov-2 may be present with other respiratory infections, hindering its identification 13, 14 . thus, it is fundamental to improve existing diagnostic tools to contain the spread. for example, diagnostic tools combining computed tomography (ct) scans with deep learning have been proposed, achieving an improved detection accuracy of 82.9% 15 . another solution being used for studying sars-cov-2, is sequencing of the viral complementary dna (cdna). for example, we can use this sequencing data with cdna, resulting from the pcr of the original viral rna; e,g, real-time pcr amplicons to identify the sars-cov-2 16 . classification using viral sequencing techniques is mainly based on alignment methods such as fasta 17 and blast 18 . these methods rely on the assumption that cdna sequences share common features, and their order prevails among different sequences 19, 20 . however, these methods suffer from the necessity of needing base sequences for the detection 21 . nevertheless, it is necessary to develop innovative improved diagnostic tools that target the genome to improve the identification of pathogenic variants, as sometimes several tests, are needed to have an accurate diagnosis. therefore, as an alternative, deep learning methods have been suggested for classification of dna sequences. the advantage of these methods are that they do not need pre-selected features to identify or classify dna sequences. deep learning has been efficiently used for classification of dna sequences, using one-hot label encoding and convolution neural networks (cnn) 22, 23 , albeit the examples in literature are featuring dna sequences of length up to 500 bps, only. in particular, for the case of viruses, ngs genomic samples might not be identified by blast, as there are no reference sequences valid for all genomes, as viruses have high mutation frequency 24 . alternative solutions based on deep learning have been proposed to classify viruses, by dividing sequences into pieces of fixed length, ranging from 300 bps 24 to 3,000 bps 25 . however, this approach has the negative effect of potentially ignoring part of the information contained in the input sequence, that is disregarded if it cannot completely fill a piece of fixed size. the global impact of sars-cov-2 prompted researchers to apply effective alignment-free methods to the classification of the virus: for example, in 26 the authors propose the use of machine learning digital signal processing for separating the virus from similar strains, with remarkable accuracy. nevertheless, there is no human-readable information that can be extracted from their black-box procedure, so the biological insight provided by their approach is limited. given the impact of the world-wide outbreak, international efforts have been made to simplify the access to viral genomic data and metadata through international repositories, such as the national genomics data center (ngdc) repository 11 , the national center for biotechnology information (ncbi) repository 27 and the global initiative on sharing all influenza data (gisaid) repository 28 , expecting that the easiness to acquire information would make it possible to develop medical countermeasures to control the disease worldwide, as it happened in similar cases earlier [29] [30] [31] . thus, taking advantage of the available information of international resources without any political and/or economic borders, we propose an innovative system based on viral gene sequencing. using a cnn to separate coronaviruses belonging to different strains 32 , including sars-cov-2, we apply techniques inspired by explainable ai in computer vision to discover representative cdna sequences that the network uses to classify sars-cov-2. we then validate the discovered sequences on datasets not used during the training of the cnn, and show how to exploit them to create a novel, highly informative set of sequence features (e.g. viral sequences). such sequences can be later inspected and analyzed by human experts. experimental results show that the new set of sequence features leads traditional, simple classifiers, to correctly assess sars-cov-2 with remarkable accuracy (> 99%). a few of the discovered sequences also possess the correct characteristics for potentially becoming primers, as just checking for their presence in samples is enough to specifically identify sars-cov-2 ( fig. 1) . laboratory testing on the most promising sequences identified showed that the primers found by our approach can be a viable alternative to the commonly adopted primers at the time of writing. figure 1 . overall procedure to find the specific sars-cov-2 21-bps rna sequences to create a primer set. summarizing the results of experiments 1-4 ( fig. 3) , we discovered 12 meaningful 21-bps sequences that best characterize sars-cov-2. for all the analyzed data, these sequences appear only in sars-cov-2 samples and not in any other viruses, as summarized in table 1 . remarkably, our results outperform earlier publications using machine learning for identifying sars-cov-2 (see for example 26 ) , with the added benefit of producing human-readable results instead of a plain black box classifier. we calculated the frequency of appearance of different primer sets' sequences used in sars-cov-2 rt-pcr tests developed by who referral laboratories and compared it to our primer design in the dataset from the gisaid ( table 2) repository. all of the sequences have a frequency of appearance of > 99%, with the exception of china-cdc-n-f with a 68.52%. this is consistent with the percentage of genomes with mutation in the primer region in the gisaid latest update summary of august 11 th28 . in the analysis of specificity in silico, we compared all the primers sets' sequences with the ncbi-b and ngdc dataset, the results show that hku-n-f, hku-n-r, charite-e-f, charite-e-r and us-cdc-n2-f are not specific to sars-cov-2 as they detect sars-cov-1 too. the rest of the sequences, including our design, only appear in sars-cov-2. thus, in summary from 8 different primer sets, 3 of them are not specific to sars-cov-2, and from the remaining 5, considering frequency of appearance only, our design is in 3rd best option calculated with the lower limit. 99.59% china-cdc-nto validate the data obtained in silico by laboratory methods a conventional pcr was performed on cdna obtained from rna from sars-cov-2 and other human coronaviruses. in addition, rnas from nasopharyngeal swabs from six patients previously diagnosed with sars-cov-2 infection and four patients negative for sars-cov-2 by routine diagnostic method 5 were analyzed with the same conventional pcr (fig. 2) . different dilutions of sars-cov-2 rna were detected with similar sensitivity compared to the diagnostic reference assay. (fig. 2 lanes 1-8) . our candidate primer set exclusively detected sars-cov-2 and did not amplify rna from other human coronaviruses (figure 9, lanes 9-14) . the candidate primer set was able to detect sars-cov-2 rna from patient samples previously found positive for sars-cov-2, but not in patients previously found negative (fig. 2, lanes 15-24) . although further validation will be required to develop this candidate primer set into a diagnostic assay, our results clearly demonstrate the power of our method to select potential sequences for further validation. being able to reliably identify sars-cov-2 and distinguish it from other similar pathogens is important to contain its spread. the time of processing samples and the availability of reliable diagnostic tests is a challenge during an outbreak. developing innovative diagnostic tools that target the genome to improve the identification of pathogens, can help reduce health costs and time to identify the infection, instead of using unsuitable treatments or testing. moreover, it is necessary to perform an accurate classification to identify the different species of coronavirus, the genetic variants that could appear in the future, and the co-infections with other pathogens. given the high transmissibility of the sars-cov-2, the proper diagnosis of the disease is urgent, to stop the virus from spreading further. considering the false negatives given by the standard rt-qpcr detection, better implementations such as using deep learning are necessary in order to properly detect the virus. while the accuracy of current rt-qpcr testing is around 70%, and ct scans with deep learning go up at 83%, we believe that the use of the sequences detected by a cnn-based methodology has the potential to improve the accuracy of the diagnosis. our results, show that by targeting one out of the 12 selected 21-bps specific sequences, we are able to distinguish sars-cov-2, from any other virus (> 99%). further testing is necessary to confirm these promising results so it is essential to create multidisciplinary groups that work to stop the outbreak. finally, as an interesting remark, by comparing the discovered sequences against other hosts, we noticed that from the 12 sequences exclusive to sars-cov-2, one of them appears in 13 of 17 samples from manis javanina. in contrast, 5 of the sequences of sars-cov-2 appear in the only sample available from rhinolophus affinis and 11 out of 12 in 2 canine samples (table 1) . this is consistent with the findings of zhang et al. 33, 34 , and could point to the zootonic origin of the virus. nevertheless, more data is necessary. as a result of the high density populations, and ever growing interaction between people, it is possible that other pandemics may occur. we believe that our methodology has a substantial added value over traditional methods, because it is a fast method and only limited set of viral sequencing data is needed. moreover, this procedure led to a primer set with a very high specificity for sars-cov-2 with at least the same accuracy as the best primers sets in the world developed by who referral laboratories. thus, thinking forward, our methodology can be applied in future viral pandemics to speed up the development of accurate detection methods for diagnosis and thereby contribute to limit the spread of a virus. the cnn used during all the experiments is composed of one convolutional layer with 12 different filters or weights (each with window size 21) with maxpooling (with pool size and stride 148), a fully connected layer (196 rectified linear units with dropout probability 0.5), and a final softmax layer with 5 units, to differentiate the different classes of coronavirus strains. the optimized used is adaptive momentum (adam) 35 , with learning rate 10 −5 and a batch size of 50 samples, run for 1,000 epochs 32 . the convolutional layer of the network, in simple terms, is analyzing subsequences of 21 base pairs that can appear in different points of the virus genome. we selected 21 as designed primers for rt-pcr tests have a length of 18-22 bps normally. the pool size of the maxpooling represents the interval in which a specific 21-bps sequence can be recognized (in this case, 148 positions). through the training process, the convolutional layer is de-facto learning new features to characterize the problem, directly from the data. in this specific case, the new features are 21-bps sequences that can more easily separate different virus strains. by analyzing the result of each filter in a convolutional layer, and how its output interacts with the corresponding max pooling, it is possible to detect human-readable sequences of base pairs that might provide domain experts with relevant information. it is important to notice that these sequences are not bound to specific locations of the genome; thanks to its structure, the cnn is able to detect them and recognize their importance even if their position is displaced in different samples. we downloaded 583 sequences (*.fasta files) from the ngdc on march 15 th ,2020 (table 3) . we left out 30 sars-cov-2 sequences and then, we divided the rest of the data into 80% training, 10% validation, 10% testing. the trained cnn described above obtained a mean accuracy of 98.73% in a 10-fold cross-validation. once the network is trained, in a first analysis, we plot the inputs and outputs of the convolutional layer, to visually inspect for patterns. as an example, in fig. 4a we report the visualization of the first 1,250 bps of each of the 553 samples from the ngdc 11 repository . each filter slides a 21-bps window over the input, and for each step produces a single value. the output of a filter is thus a sequence of values in (0, 1). the output of the max pooling for each of the 12 filters is then further inspected for patterns. it is noticeable how samples belonging to different classes can be already visually distinguished. at this step, we identify filter 0 as the most promising, as it seems to focus on a few relevant points in the genome, that could correspond to meaningful cdna sequences. given this data, it is now possible to identify the 21-bps sequences that obtained the highest output values in the max pooling layer of filter 0, in a section of 148 positions. this process results in 210 (31,029 divided by 148) max pooling features, each one identifying the 21-bps sequence that obtained the highest value from the convolutional filter, in a specific 148-position interval of the original genome: the first max pooling feature will cover positions 1-148, the second will cover position 149-296, and so on. we graph the whole set of max pooling features for the complete data 4,410 (210*21), fig. 4b . the cnn architecture, and the visualization of the filter, and max pooling are available in the supplementary material section 1. analyzing the different sequence values appearing in the max pooling feature space, a total of 3,827 unique 21-bps cdna sequences, that can potentially be very informative for identifying different virus strains. for example, sequence agg taa caa acc aac caa ctt is only found inside the class of sars-cov-2, in 59 out of 66 available samples. sequence cac gag taa ctc gtc tat ctt is present again only in sars-cov-2, in 63 out of the 66 samples. the combination of the convolutional and max pooling layer allows the cnn to identify sequences even if they are slightly displaced in the genome (by up to 148 positions). thus, we create a table of feature appearance of each of the sequences selected from the previous step. this results, in just a set of feature to differentiate sars-cov-2 from other viruses. the experiments presented in the following subsections to validate our method have different objectives and make use of different datasets. a summary of all the experiments and datasets used is shown in figure 3 . table 3 . organism, assigned label, and number of samples in the unique sequences for the ngdc repository (left) and query: gene="orf1ab" and host="homo sapiens" and "complete genome" in the ncbi repository (right). we use the ncbi organism naming convention 36 we downloaded the dataset from the ngdc repository 11 on march 15 th 2020. we removed repeated sequences and applied the procedure to translate the data into the sequence feature space. this leaves us with a frequency table of 3,827 features (21-bps sequences) with 583 samples (table 3 (left) ). next, we ran a state-of-the-art feature selection algorithm 37, 38 , to reduce the sequences needed to identify different virus strain to the bare minimum. remarkably, we are then able to correctly differentiate all the coronavirus (mers-cov, sars-cov-2, sars-cov-1, etc) samples using only 53 of the original 3,827 sequences, obtaining a 100% accuracy in a 10-fold cross-validation with a simpler and more traditional classifier, such as logistic regression. the list of the 53 features is available in the supplementary material section 2. we downloaded data from ncbi 27 on march 15 th 2020, with the following query: gene="orf1ab" and host="homo sapiens" and "complete genome". the query resulted in 407 non-repeated sequences (table 3 (right)). we call this dataset ncbi-a, where 68 sequences belong to sars-cov-2. then, we applied the procedure to translate the data into the set of sequence features, and we run the same state-of-the-art feature selection algorithm 37 . the result is a list of 10 different sequences (table 4) , for which just checking for their presence is enough to differentiate between sars-cov-2 and other viruses in the dataset, with a 100% accuracy. each of the sequences, in fact, only appears in sars-cov-2 samples. , for a total of more than 900 viruses. then, we applied the procedure to translate the data into the sequence feature space and run the feature reduction algorithm 37 . this results in 2 extra sequences of 21 bps: just by checking for their presence, we are able to separate sars-cov-2 from the rest of the samples with a 100% accuracy. the sequences are: aat aga aga att att cta ttc and cga taa caa ctt ctg tgg ccc. from the gisaid repository 28 , we downloaded 53,183 sequences available on august 10 th , for sars-cov-2, from different countries, from there 52,645 have as < 1% ns, high coverage and host="homo sapiens". then, we calculated the frequency table of the 21-bps sequences obtained from experiments 2 and 3, to verify which sequences remain and could be used for detection. the appearance frequency of the target sequences among the samples in the gisaid dataset is reported in table 1 , second column. in addition, we downloaded 26 sequences from gisaid repository of other hosts (manis javanica, rhinolophus affinis, canine and felis catus) to make a comparison in the sequences from experiment 2 and 3. experiment 5: design of the candidate primer set. after the analysis carried out on the deep learning model, we ran an analysis with primer3plus 39 , to see which of the sequences could be used as a forward primer, using sample ncbi nc045512.2 as the reference sars-cov-2 sequence. we uncover the sequence tag cac tct cca agg gtg ttc that shows a frequency of appearance of 99.57% in viral genomes available from different countries in gisaid 28 and 100.0% in the ncbi 27 datasets. using the reference sars-cov-2 sequence, we identify that this discovered sequence is located between nucleotides 25,604 and 25,624 in the orf3a gene. in sars-cov, this gene encodes a protein of 274 aa, that is related with necrotic cell death 40 , chemokine production like interleukin 8 (il-8) and rantes/ccl5, nfκb activation resulting in an inflammatory response 41 and may play an important role in the virus life cycle 42 . we design a specific primer set for detection of sars-cov-2 using primer3plus 39 . we use tag cac tct cca agg gtg ttc as forward primer and gca aag cca aag cct cat ta as reverse primer, obtaining an amplicon size of 179 bps. then, we run an in-silico pcr test using fastpcr 6.7 43 with default parameters in nc045512.2 used as a reference sars-cov-2 sequence, this yields t m = 56.2 • c for the forward primer, t m = 53.1 • c for the reverse primer and ta = 58 • c. in addition, we calculated the frequency of appearance of different primers sets' sequences used in sars-cov-2 rt-qpcr tests developed by who referral laboratories and compared it to our primer design sequences in 52,645 sequences from the gisaid repository and the 583 samples of different coronaviruses from the ngdc dataset from experiment 1. the used primers set are developed by university of hong kong (hku-n); charite, berlin, germany (charite-e); us-cdc, united states (us-cdc-n1,us-cdc-n2,us-cdc-n3) and china cdc, china (china-cdc-orf1ab, china-cdc-n) ( table 9) . we selected this primers as they are the ones more commonly used as stated in the gisaid status update of august 11, 2020. we do not consider degenerate primer sets. coronavirus genomics and bioinformatics analysis. viruses genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding who report coronavirus disease 2019 (covid-19) (world health organization combination of rt-qpcr testing and clinical features for diagnosis of covid-19 facilitates management of sars-cov-2 outbreak detection of 2019 novel coronavirus (2019-ncov) by real-time rt-pcr evaluating the accuracy of different respiratory specimens in the laboratory diagnosis and monitoring the viral shedding of 2019-ncov infections antibody responses to sars-cov-2 in patients of novel coronavirus disease 2019 false-negative results of initial rt-pcr assays for covid-19: a systematic review false negative tests for sars-cov-2 infection-challenges and implications next generation sequencing of viral rna genomes correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases co-infections in people with covid-19: a systematic review and meta-analysis clinical diagnosis of 8274 samples with 2019-novel coronavirus in wuhan a deep learning algorithm using ct images to screen for corona virus disease (covid-19). medrxiv the first case of 2019 novel coronavirus pneumonia imported into korea from wuhan, china: implication for infection prevention and control measures rapid and sensitive sequence comparison with fastp and fasta basic local alignment search tool applications of alignment-free methods in epigenomics alignment-free sequence comparison-a review phylogenetically diverse tt virus viremia among pregnant women dna sequence classification by convolutional neural network a deep learning approach to dna sequence classification viraminer: deep learning on raw dna sequences for identifying viral genomes in human samples identifying viruses from metagenomic data by deep learning machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: covid-19 case study dbsnp: the ncbi database of genetic variation global initiative on sharing all influenza data-from vision to reality how ownership rights over microorganisms affect infectious disease control and innovation: a root-cause analysis of barriers to data sharing as experienced by key stakeholders managing severe acute respiratory syndrome (sars) intellectual property rights: the possible role of patent pooling threats to timely sharing of pathogen sequence data accurate identification of sars-cov-2 from viral genome sequences using deep learning a genomic perspective on the origin and emergence of sars-cov-2 extreme genomic cpg deficiency in sars-cov-2 and evasion of host antiviral defense a method for stochastic optimization genbank: the nucleotide sequence database. the ncbi handb automatic discovery of 100-mirna signature for cancer classification using ensemble feature selection machine learning-based ensemble recursive feature selection of circulating mirnas for cancer tumor classification primer3plus, an enhanced web interface to primer3 sars-coronavirus open reading frame-8b triggers intracellular stress pathways and activates nlrp3 inflammasomes augmentation of chemokine production by severe acute respiratory syndrome coronavirus 3a/x1 and 7a/x4 proteins through nf-κb activation severe acute respiratory syndrome coronavirus orf3a protein interacts with caveolin fastpcr software for pcr primer and probe design and repeat search viral rna was isolated from cell-cultured sars-cov-2, sars-1, mers-cov, hcov-nl63, hcov-oc43, hcov-229e, and from nasopharyngeal swabs from n = 10 patients by magna pure lc (roche diagnostics, the netherlands) using the total nucleic acid isolation kit. the rna was converted into cdna using superscriptiii (thermo-fisher scientific, usa) and random hexamers. subsequently, conventional pcr was performed on the cdna using hotstar taq dna polymerase (qiagen, the netherlands) with 400nm forward primer (5'-ag cac tct cca agg gtg ttc-3') and 400nm reverse primer (5'-gca aag cca aag cct cat ta-3') and the following cycling conditions: 15 min at 95 • c, followed by 40 cycles of 1 min. at 95 • c , 1 min. at 5 • cc and 1 min. at 72 • c. the pcr products were visualized by electrophoresis. the same rna was used in a diagnostics reference assay by corman et al. 5 and the cycle threshold values form this reference assay were used for estimating sensitivity. the study was approved by the medical ethical commission of the erasmus mc (mec-2015-306). lmm, cap made the biological analysis, and primer design. alr and at made the programming, data collection and experiments in silico. dm and rm made the pcr validation. ec, adk and jg made the experiment and study design. all the authors contributed to the writing. key: cord-254942-g51mjj2b authors: touati, rabeb; tajouri, asma; mesaoudi, imen; oueslati, afef elloumi; lachiri, zied; kharrat, maher title: new methodology for repetitive sequences identification in human x and y chromosomes date: 2020-10-19 journal: biomed signal process control doi: 10.1016/j.bspc.2020.102207 sha: doc_id: 254942 cord_uid: g51mjj2b repetitive dna sequences occupy the major proportion of dna in the human genome and even in the other species’ genomes. the importance of each repetitive dna type depends on many factors: structural and functional roles, positions, lengths and numbers of these repetitions are clear examples. conserving such dna sequences or not in different locations in the chromosome remains a challenge for researchers in biology. detecting their location despite their great variability and finding novel repetitive sequences remains a challenging task. to side-step this problem, we developed a new method based on signal and image processing tools. in fact, using this method we could find repetitive patterns in dna images regardless of the repetition length. this new technique seems to be more efficient in detecting new repetitive sequences than bioinformatics tools. in fact, the classical tools present limited performances especially in case of mutations (insertion or deletion). however, modifying one or a few numbers of pixels in the image doesn’t affect the global form of the repetitive pattern. as a consequence, we generated a new repetitive patterns database which contains tandem and dispersed repeated sequences. the highly repetitive sequences, we have identified in x and y chromosomes, are shown to be located in other human chromosomes or in other genomes. the data we have generated is then taken as input to a convolutional neural network classifier in order to classify them. the system we have constructed is efficient and gives an average of 94.4% as recognition score. repetitive dnas are sequences with multiple copies in the genome. they are rarely associated with clearly defined biological functions. some of the moderately-repetitive sequences may be involved in gene expression regulation. other mobile dna can be constituted by transposable genetic elements (tes) that are involved in the genome evolution process. the transposition mechanism and the structure of these tes are the keys to dividing this dna into classes. retrotransposons, are an example of tes class that move via an rna intermediate. this rna is transcribed from the dna and subsequently copied back into dna. as repetitive dna we can find tandem repeats or scattered repeated sequences. these repetitive dna sequences can be classified into two types: highly repetitive or moderately repetitive sequences [1, 2] . the major repetitive sequences in all eukaryotic cells are classified into five types according to the sequence's length. in this classification, the microsatellite sequences (short tandem repeat: str) are the smallest. they are characterized by periodicity between 2 and 4 nucleotides per unit. the second class is constituted by the minisatellites with a length varying between 10 and 60 base pairs (bp). the third class is composed of the satellites which can contain up to 100 nucleotides (100-200 base pairs) [3] [4] [5] . the retrotransposons like sine and line are part of the fourth-class which is characterized by a length varying from 50 bp to 6 kb. the final class consists of ribosomal rna gene repeat (rdna) which is the longest with a length between 9 and 45 kb. in the human genome, rare fragile sites are chromosomal dna regions especially characterized by repetitive sequences. in fact, in these regions, dna damage occurs more frequently than in other locations. due to chromosome structure, the common fragile sites can be sensitive to replication stress, and they are often rearranged in cancer. in the mammalian centromeres and telomeres, the presence of repetitive sequences is necessary in order to protect chromosomes from damage. for example, alphoid dna is a kind of dna satellite having a length of 173 bp. this dna is located in the middle of a chromosome and makes up the larger part of the human centromeres region [6] . moreover, telomeres regions located at the chromosome extremities are made up of repeat sequences of 5-7 bp. these elements are called telomere repeats [7] . the repetitive sequence 'ttaggg' is one example. the chromosome integrity is protected by telomere repeats [8, 9] . in fact, telomeres hinder the chromosomes' fusion and protect them against degradation by exonucleases [10] . these repetitive functional elements are not susceptible to become fragile sites because they are hidden in heterochromatin. this heterochromatin prevents unusual dna structures occurrence leading to recombination by not yet identified mechanisms [11] . repetitive sequences are abundant in various genomes, from bacteria to mammals, and they cover nearly half of the human genome [5] . finding new common repetitive sequences within and between different chromosomes and genomes is an important theme of research in biology. in fact, the detection of all repetitive sequences in dna could serve in elucidating important biological phenomena. to identify the repetitive sequences, different bioinformatics tools were used [12, 13] . their principle is based on comparison between dna consensus sequences and repeats candidates. the mreps [14] , misa [13] , sputnik [15] , emboss (etandem and equitandem) [16] , trf [17] and repeat-masker [18] are obvious examples. in the comparison step, these tools used different approaches such as regular expression [18] , hamming distance [12] , recursive match and penalty scores [17] . localizing new repetitive sequences presents always technical challenges. this is due to the ambiguities that such repeats can create in alignment and assembly programs [19] . in this work, we have developed a new algorithm to detect repetitive patterns that correspond to new repetitive sequences. for this purpose, we used a combination of coding techniques, signals, and image processing techniques. as a result, we have constructed a repetitive sequence database which we subdivided into two sub-databases. the first one contains the existing and validated repetitive sequences. the second dna repetitive database regroups the newly detected sequences. in this context, we called "new repetitive sequence", a sequence that was not detected by all current bioinformatics systems as well as alignment programs. in this research, we converted all of the dna sequences into a synthetic image representation. after that, we extracted all patterns that correspond to the repeat dna sequences. the second part of this work consists in classifying the obtained data. a deep learning model is chosen for this purpose: convolutional neural network (cnn). this paper is divided into four sections. after the introduction, we describe the materials and methods. in section 2, we first present the biological database subject of this study. we also introduce the coding technique we used to transform the biological data into a numerical one. after that, we describe how we convert the obtained signal into an image based on the wavelet analysis. further, we introduce the cnn architecture we establish for the repetitive dna classification. the final parts of this section consist of the employed detection steps and the adopted evaluation system. in section 3, we provide and discuss the results in terms of repetitive dna sequences detection and classification. finally, section 4 concludes the paper. two-thirds of the human genome consists of repetitive dna sequences [20] ; which confers great importance to identification and localization of these elements. in this section, we expose a novel approach for the repetitive dna sequence identification. this method is effective in detecting dispersed or tandem repeats such as minisatellites and satellites. the detection system is composed of four main blocks. the first one consists in extracting the human dna sequences from existing database. the second block is the dna coding into a numerical representation. the third block consists of "find human repetitive sequences" (fhrs) method which we propose to the repetitive dna sequences detection. it is the application of the wavelet analysis and thus for detecting the repetitive patterns. the last block consists of determining the repetitive sequences and the repetitive dna sequences database establishment. fig. 1 shows the corresponding flowchart. the human genome (homosapiens) contains 22 autosomes and two chromosomes that determine human sex: x and y, with a total number of 46 chromosomes. we find one pair of sex chromosomes in each human cell. in females, the cell contains two x chromosomes, while in males we have one x and one y chromosome. a detailed description of the human dna material is available in the ncbi database (national center for biotechnology information) [21] . from the human dna data, we count 2.91-billion base pairs (bp) consensus sequence in the euchromatic portion [22] . given that this is a huge amount of data, we based our work only on x and y chromosomes. even, at the level of these two chromosomes, we have an important mass of data. as an example, we give in fig. 2 the number of apparition of dinucleotides in both x and y chromosomes. our goal is to find repetitive dna on these chromosomes. it is important to mention that the more complex the genome is, the more difficult is to find new repetitive sequences within. therefore, the challenge presented in this work is identifying new repetitive dna sequences in human x and y chromosomes. aiming to visualize repetitive patterns in the human genome, the dna sequences have to be transformed into numerical data. this transformation is called "dna coding". in this work, we opted for a special coding technique called "order 2 frequency chaos game signal" (fcgs 2 ) [23, 24] . the fcgs 2 coding is a statistical representation of dna. in the proposed method, chromosomes are transformed based on the occurrence probability of the successive dinucleotides groups. this technique represents the time-frequency evolution of the dinucleotides in the chromosome. in the following, we give the transformation equation (eq. 1). where n 2 nucleotide is the occurrence number of dinucleotides group in the whole chromosome and length chr is the chromosome's length. in this work, we coded the entire human chromosomes x and y. the sequence that represents chromosome x is a signal with a length of 156,040,895 bp. as for chromosome y, it is a signal of size equal to 57227415bp. the identification of repetitive dna sequences is taking greater and greater importance these days. many algorithms, using various knowledge fields, have been implemented for repetitive sequences localization. in this context, signal processing approaches were used to detect repetitive sequences, according to the correspondent periodicity [25] [26] [27] [28] [29] . in this paper, we propose an efficient algorithm based on the signal and image processing tools to localize repetitive dna sequences. this method has the advantage of being independent from prior knowledge about the repeated sequences. this section presents the new algorithm we designed to detect the repetitive dna-sequences after transforming them into numerical signals. this algorithm is called find human repetitive sequences (fhrs). it contains three steps: -dna signals to dna images transformation: the scalogram representation; -energy calculation of each scalogram image which is obtained by the wavelet analysis. after that, retaining the image whose energy amplitude exceeds a chosen threshold (equal to 10 here); -finding the reference repetitive sequence in the retained image. it is the longest repeated unit in the considered dna sequence. the scalogram representation of a dna sequence is an image that we obtain by wavelet analysis and encode in the rgb space (three color channels: red, green, and blue). this time-frequency representation is shown to be efficient in terms of visualizing and detecting repetitive patterns. here, the idea is to use this type of dna image to find repetitive patterns that correspond to periodic sequences. the motivation behind this choice is that changing a pixel in the image has no influence on the overall shape of the repetitive pattern. indeed even if the repetition pattern contains variations in nucleotide composition, this does not greatly impact the overall shape of the repetitive pattern at the level of dna image. furthermore, our choice for this method is reinforced by its performance in characterizing different classes of transposable elements [30, 31] . for the wavelet analysis, we use the complex morlet wavelet which is best suited to localize repetitive dna in the time-frequency domain. the principle consists of applying the wavelet analysis to the signal obtained by the fcgs 2 coding. this analysis is done by decomposing a given dna signal into a sum of basic functions called wavelets. the latter wavelets are issued from the mother wavelet by two operations: expansion and translation. these wavelets take into account both time and frequency variations, which allow them to easily capture all the different hidden frequencies in the signal [32] [33] [34] . unlike the mother wavelet, which only has a time-varying parameter expressed by the function ψ(t), the daughter wavelet expression depends on time and scale parameters (a and b respectively). it is generated following this equation: where* indicates the conjugate complex. as we have chosen a gaussianwindowed complex sinusoid (complex morlet) to be applied as analysis window, the continuous wavelet transform (cwt) will be written as: here the oscillation's number (ɷ 0 ) must be greater than 5 (admissibility condition). the continuous wavelet coefficients of a dna signal x(t) is a matrix which elements are calculated by the following formula: the modulus of these coefficients | w (a,b) | provides the scalogram representation of the dna sequence. since chromosomes x and y are too long, we decompose x(t), which is the correspondent fcgs 2 signal, in a set of segments. each segment x i (t) has a size of 1000 bp. after segment cut, we apply the cwt wavelet and calculate the correspondent energies. as a result, we obtain a new database of the human dna representations. in total, we count 156,041 images of the x chromosome and 57,228 images of the y chromosome. the wavelet coefficients matrix contains the time-frequency information about a signal. to further explore this information, we calculate the scale-energy (e) of each nucleotide position, according to following equation: for each i = 1 : length chr /1000. here, the parameter a represents the scale in the wavelet analysis; it varies from 1 to 64. as for the indicator i, it represents the image number. by applying (eq.5), we obtain a vector that contains the energy of the dna scalogram. peak values higher than 10 in the vector indicate the existence of repetitive patterns in the dna image. fig. 3 shows an example of the fcgs 2 signal, the correspondent scalogram in a 3d representation and the energy wavelet of a sequence located in chromosome x of the human genome. this sequence corresponds to the portion [342,500 bp: 344,000 bp] in the ppp2r3b gene. as we can see, magnitude of the energy wavelet indicates the presence of periodicities in the sequence. if we consider the frequency content, we can note that the repetitive sequence is characterized by a specific frequency band. the limits of this frequency band correspond to the repetitive dna portion in the analyzed sequence. as for the 3d representation, it contains repetitive patterns of particular shape that are related to the da repetitions. following this method, we have constructed our database of the repetitive dna images. the patterned images were selected according to the energy-wavelet peaks. the generated database was named "repeat-data". for each dna image into the repeat-data database, we aim to identify a dna-reference sequence, to which corresponds the existing repetitive pattern in the scalogram. this dna-reference sequence is the longest subsequence in terms of size and repetition numbers. after this step, we have built a database that contains the location and the repetition number of all the localized sequences of reference. as we focus on detecting new repetitive sequences in the human genome, we verified the availability of the reference repetitive sequence in the public databases. for this, we checked if this sequence is annotated or not in both dfam and ncbi databases. hence, if our new repetitive sequence is not listed in these public databases, we added it to our new database. this new repetitive sequence is called "new-repeat-data". after collecting the new repetitive sequences using the fhrs algorithm, we move on to the step of extracting the repeat patterns using image processing tools. the fig. 4 summarizes the proposed methodology of extracting tandem repeat patterns in the dna images. it illustrates the results obtained when we considered the "trseq1" sequence. the sequence is 261 base pairs lengthen; its position is 28,076,765 bp to 28,077,025 bp along the human x chromosome. as in this example, the data we are treating here is the set of scalogram images that we stored before in the database "new-repeat-data". the main goal of this part of work is to detect and localize the repetitive patterns in the scalogram representations. that's why we based our work on a segmentation algorithm. our method consists first in decomposing the dna image into three color channels (red, green and blue) and choosing the blue one. this choice is justified after testing all the color bands. the best segmentation result corresponds to the bleu channel since it is best contrasted compared to the others. then for a binarization purpose, a simple thresholding is applied to keep only the pixels having an intensity value less than or equal to 26. then, to keep only the region of interest, we have used an edge detection technique. the canny edge detector provides good detection and localization relatively to other operators [35, 36] . the algorithm detects brightness discontinuities in the image using a canny filter. it is a multi-stage algorithm used to detect a wide range of edges in images [37, 38] . the canny operator uses double thresholds: high and low thresholds. the high threshold algorithm detects important and significant information like lines and contours in the image. the low threshold algorithm ensures that no details are missing. the canny edge detector is widely used to locate sharp intensity changes and to find object boundaries in an image, especially in computer vision domains. the classification of one pixel as an edge, using the canny edge detector, is achieved by gradient table 1 position of "rseq1" on both x and y chromosomes of the human genome. magnitude computation of this pixel. the result is then compared with one of its neighbors, where the maximum intensity varies the most. finally, we fill the holes in areas of interest based on morphological operators [39] . the result is an image that only contains repetitive patterns. based on this method, we can then extract and isolate the particular regions of repetitive dna patterns. after finding the dna repetitive sequences in the human x and y chromosomes (which can be tandem or scattered repeated sequences), we verified their existence in other chromosomes or even in other genomes. to achieve this goal, we have used two public bioinformatics algorithms: blat [40] and dfam [41] . for each new repetitive sequence we detected, we searched it in the whole human genome and in all other genomes using the blat platform. as an example, we consider the new scattered repeated sequence "rseq1". rseq1="ctttagagtctgcattgggcctaggtctcattgaggaca-gatagagagcagactgtgcaac". it is a 61 base pair (bp) lengthen sequence with a repetition number equal to12 in the whole human genome. the corresponding positions on both x and y chromosomes are given in the following table (table 1) . after localizing "rseq1" in x and y chromosomes, we searched for the existence of this sequence in other regions. fig. 5 shows the result of the checking of the "rseq1" existence in other species. as we can see, "rseq1" exists in several genomes such as; human, gorilla, chimpanzee, greenmonkey, bonobo, etc. after proving the existence of the newly discovered repetitive sequence in all genomes, we tried to find whether this sequence is located in genes. we, especially, searched for its existence in exonic regions or in other families of dna. if this sequence exists nowhere in these dna types, we classified it as a new repetitive dna sequence type. on the other hand, we verified the uniqueness of these new sequences using our approach fhrs, and thus by comparing the repetitive patterns in the scalogram representations. in order to ensure that our work is as meaningful and effective as possible, we thought of establishing a classification system to classify these new datasets (new repetitive dna sequences). for this reason, we considered the scalogram representation (2d image) as input data to the system. as for the classifier, we have chosen cnns as they are efficient in terms of images classification fig. 6 . cnn is a special neural networks type which works using data having a grid topology [42] . cnns classification technique were developed by lecun et al. (in 1998) in the aim to recognize handwritten characters from bank checks. cnns is a deep learning model inspired by the visual mechanism of living organisms. it uses convolutional layers to the features extraction from input data. in the cnn model, convolutional layer neurons are able to extract higher-level abstraction features from features extracted at the previous layer. cnn was applied with success in dna studies [43] [44] [45] [46] , breast cancer cell segmentation [47, 48] , medical diagnosis [49, 50] , character recognition [51] and in other areas of application. in this work, we used cnn to establish a system of new repetitive dna sequences recognition in human x and y chromosomes. for this, we took the rgb scalogram representations of dna as the input of the classification system with a size of 75 × 100. the dna images are passed, then, through a stack of convolutional layers, where we used filters with a very small receptive field (3 × 3). these filters act in the role of a scanner as they capture motifs in different orientations (up/down, center, left/right). each neuron output on a convolutional layer is the result of a convolution operation between the kernel matrix and the neuron input. as for max-pooling, it is performed over a 2 × 2 pixel window. for each convolutional layer, the second layer is a global max-pooling layer. each one of max-pooling layers only outputs the maximum value of all of its respective convolutional layers outputs. the second layer is considered as a samplebased discretization process. this process has a goal to down the sample of input and to reduce its dimensionality. after transforming the image into a suitable form for the multi-level perceptron, the image must be flattened into a column vector. the result is a flattened output that is fed to a feed-forward neural network. a back-propagation was applied to every iteration of training. a fully-connected layer was added to ensure a non-linear combination learning of the high-level features (which are represented by the output of the flatten layer). the fully-connected layer is learning a possibly non-linear function in that space. over an epoch's series, using the softmax classification technique our model is eligible to distinguish between dominating and certain low-level features in images and it can classify repetitive dna classes. after transforming the image into a suitable form for the multi-level perceptron, the image must be flattened into a column vector. the result is a flattened output that is fed to a feed-forward neural network. a backpropagation was applied to every iteration of training. a fully-connected layer was added to ensure a non-linear combination learning of the high-level features (which are represented by the output of the flatten layer). the fully-connected layer is learning a possibly non-linear function in that space. over an epoch's series, using the softmax classification technique our model is eligible to distinguish between dominating and certain low-level features in images and it can classify repetitive dna classes. only sexual chromosomes provide opportunities to know the evolution mechanisms from one specie to another. these mechanisms can depend on the accumulation of repetitive sequences [2] . in this work, we first applied the fhrs technique to detect new repetitive sequences within human sexual chromosomes (x and y). after that, we entered these sequences to a cnn based on classification system aiming at recognizing them. in this work, we used the fhrs approach (find human repetitive sequences), which combines wavelet analysis and a specific coding technique, to represent repetitive patterns in the form of an image. this method has the advantage of identifying new repetitive sequences without using any prior knowledge about the input dna sequence. based on this, we have discovered various new repetitive dna sequences within sexual chromosomes, be they tandem or interspersed. after that, we have looked for the existence of these sequences in the whole human chromosomes or in other genomes. afterward, we checked if these sequences exist or not in genes. finally, we classed these repetitive sequences in terms of their relative location to heterochromatin, telomere, and centromere. as a result, we have constructed a database comprising two subdatabases. the first one contains newly discovered repetitive sequences of type satellites and minisatellites. the second one encloses existing repetitive sequences. here, the new repetitive sequences database provides the composition of the new highly repetitive dna sequences and the correspondent locations. the repetitive sequences are of different sizes and are classified into two types: tandem repeat sequences or interspersed repeat sequences. we called this new database "new-repeat-data". with our approach, highly conserved repetitive dna sequences, having no annotations in the dna library (ncbi or dfam), have been found in the human genome. in the telomere of x and y chromosomes, we have found highly short fig. 7 . telomere image signature of homologue regions corresponding to the minisatellite "rseq2" (ctttagagtctg) n within x and y chromosomes. r. touati et al. or long repetitive sequences. the sequence "rseq2" (rseq2=cttta-gagtctg) is an example of short minisatellite of 21 base pairs. its repetition number is 312 extending from 26,304 bp to 249,544 bp. in addition, the sequence (ccctaa) n , which is annotated in ncbi database, has been well localized using our algorithm. as long repetitive minisatellite sequences, we have discovered a new sequence "rseq1" of 61 base pairs and a repetition number of 12. these repetitive sequences exist in the same location within great portions of chromosome y. fig. 7 shows an example of the global signature of a new telomeric repetitive sequence with a 71000bp of size. on the other hand, a high repetitive sequence "rseq3" (rseq3='tttaaagat' of size equal to 9 bp) has shown as a new repetitive sequence in the human genome. this short repetitive dna sequence was found also in many species such as chimpanzees, bonobo, and even in sars− cov2 (covid-19) coronavirus genome with a repetition number of 2. table 2 shows the location of this microsatellite in some chromosomes of the human genome. other sequences are found to be very high repetitive in the human genome, like the sequence "rseq4" (rseq4= 'gtataca') which appears in the x chromosome 1375 times. this sequence exists also in the covid-19 coronavirus. furthermore, we have found a new minisatellite with a size of 61bp in human. using the blat algorithm, this sequence was also found in the x chromosome of gorilla (gorgor4) with a position of 15499bp to 15,559 bp. fig. 8 shows the method adopted to localize this repetitive sequence in other regions. fig. 8 is divided into two result blocks. in the first one, we expose the scalogram corresponding to the new repetitive dna sequence. the second one contains the sequence location result in all the other genomes using the blat algorithm. in the first result block, we provide the scalogram representation of the dna sequence we have located at the x chromosome of the human genome (xp22.33, position: [321001:322000bp]). the scalogram representation makes possible to see all the specific repetitive patterns. after that we extracted the reference sequence which is the maximum repetitive sequence having a maximum size in the dna sequence. then, we have found two new repetitive sequences that were not referenced by the current bioinformatic systems or sequence alignment programs. locations of these two new repetitive sequences in both x and y chromosomes are given by table 3 . the repetitive patterns in the scalograms prove the presence of two microsatellites: rseq5 whose size is 61bp and rerseq6 size is 28 bp. these sequences are: after the localization of these two repetitive dna sequences (rseq5 and rseq6), we have chosen to use the blat alignment tool in order to see if these sequences have other locations in the other human chromosomes or in other genomes. indeed, the repetitive sequences that migrate to different regions of the genome have a great importance and they have been classified as conservative mobile dna sequences. their importance will be higher if these conservative regions are localized in genes. as a result, we have found the rep2 sequence at the position 321,267 bp to 321447 bp in the intronic region of a non-protein coding rna 685 (linc00685) gene, and thus in both x and y chromosomes [52] . in the sub-figure b of fig. 8 (second result), we show that the new repetitive sequence rep2 is located, not only within other chromosomes (1, 5, 15, x and y) of the human genome, but also in other genomes like chimpanzee and bonobo. results shown in table 4 prove that rep2 has been located in intronic region of different chromosomes of the human genome: 1, 5, 15, x and y. in fact, the sequence "rseq6" presents a special intronic conservative region located, not only in different chromosomes but also in different genomes. rseq6 sequence that have a size of 29 bp has been localized in two genes corresponding to chimpanzee genome. it is located at the poin addition, we present another example of a special new repetitive sequence "rseq7" which has been found using our approach. the fig. 9 shows the time-frequency representation of the loc652,608 gene which has a size of 2532 bp. the gene is found at the position: 1172583-1175114 bp in the x chromosome of the human genome. this fig. 9 . loc652608 gene in the x chromosome contains a tandem repeat sequence: rseq7 started in intronic region (intron 2) until exonic region (exon 3). fig. 10 . two examples of conserved intronic repetitive sequences (satellites) and noncoding sequence located in coding region such as senescence [53] . pseudo-gene is a 60s ribosomal protein l6-like. the dna image shown in fig. 9 demonstrates three exonic regions and two intronic regions. we can clearly see that the second intronic region is composed by a specific tandemic sequence which we called "rseq7". the correspondent modified version has the same size as "rseq7" which is equal to 208 bp. this particular repetitive sequence starts in the intronic zone: intron2 until reaching and exceeding the exonic zone: exon3; with a modification of 11 nucleotides. intron 2 is a noncoding sequence (208 bp) which is composed of multiple repetitions of "rseq7". rseq7='tgatggttttcctgaagcagctggctagtggcttgt-tactcgtaactggacctctggtcctcaatcgagtccctccacgaa-gaacgcacca-gaaatttgtcattgccacctcaaccaaaatcggtatcagcaatg-taaaaatctcaaaacatcttagtgatgctgacttgaagaagaa-gaagctgtggaagcccagacaccaggagag'. then, we searched this new tandem repeat "rseq7" in the other chromosomes. as a result, we found that this sequence exists in 7 chromosomes with some nucleotides modifications. moreover, we have located this modified intronic sequence in genes regions of other chromosomes of the human genome. fig. 10 shows two reference sequences and the modified version. the first exonic sequence example corresponds to the loc652608 gene in located in the x chromosome (fig. 10a) . the second exonic sequence corresponds to the rpl6p22 gene in which is located in the chromosome 7 (fig. 10b) . for these two examples the nucleotides variation number between the intronic sequence "rseq7" and the exonic sequence is equal to 11 base pairs but with different locations. on the other hand, we have chosen to use image processing techniques to extract the repetitive sequences. the idea consists in segmenting the scalogram image in order to extract the repetitive patterns. for this purpose, we developed a new segmentation algorithm applied to the dna scalograms. fig. 11 illustrates the obtained results by our segmentation algorithm with a thresholding value equal to 26. it shows the location of the "rseq7" repetitive sequences and the correspondent fig. 11 . example of dna image segmentation by which we can obtain the begining and the end of the repetitive patterns located in intronic region (intron 2), and the corresponding modified sequences (especially in exonic region) with the modification region. location of repetitive intronic satellites sequence "rseq7" and the corresponding exonic modified sequences in different chromosomes of the human genome. modified versions. here, we can see in the first subfigure (scalogram) that the repetitive pattern is located at: 1173583bp-1175114 bp in the x chromosome of the human genome. the second subfigure presents the segmented image. the repetitive patterns correspond to the repetitive sequences which start in intronic sequences and end in exonic region with some nucleotides modification (11 nucleotides) in the beginning and in the end (fig. 11 ). after the repetitive sequences localization, we checked if these sequences are located in other regions in the human genome and even in the genomes of other species. table 5 shows the location of the repetitive sequence "rseq7" and its modified repetitive sequences in different gene regions of different chromosomes in the human genome. we can note that this new repetitive sequence characterizes a ribosomal protein (rps) region in the human genome. the ribosomal rna gene repeat (rdna) is the largest repetitive region in the eukaryotic genome. the genome stability depends on the stability of the rdna, the latter affects cellular functions the next example in fig. 12 shows highly repetitive patterns in the x chromosome at position: 2277000-2282500 bp (xp22.33 region) in the human genome. this region contains tandem repeat sequences and interspersed repeat sequences. in addition, the localization results have shown that these specific patterns are localized in the intronic region of the dhrsx gene ([2,219,506 bp: 2,500,974 bp]) in the x chromosome and even in other genes located in other chromosomes. dhrsx gene is a new gene discovered in 2014 at the xp22.33 and yp11.2 in the human genome. it has been shown that the protein encoded by this gene is implicated in the positive regulation of starvation induced autophagy [54] . the scalogram represented in fig. 12 indicates the presence of repetitive patterns in intronic regions. the reference sequence corresponding to tandem repeat sequence "rseq8" has a size equal to 89bp and 14 as a repetition number. other repetitive sequences are localized in these intronic regions which are: -"rseq9" with a size of 42 bp and 26 as repetition number -"rseq10" with a size of 19 bp and 63 as repetition number -"rseq11" with a size of 6 bp and 123 as repetition number. all these repetitive sequences are minisatellite type. in the ncbi database, these regions are defined as a low complexity g-rich repetition and there is no further given information. • rseq8="agggagagagagggagggcaaacgagagggagagagaa-ggaggaggaggaaatgggggaaagagagagaaagagagatgga-gagggaac" • rseq9="agagagatggagagggaacagggagagagagggagggc-aaac" • rseq10="agagagatggagagggaac" • rseq11= "agagagaa" these repetitive sequences are also located at the same position in intronic region within the dhrsx gene in the y chromosome of the human genome. table 6 details the location of the new repetitive sequence "rseq8" inside the x and y chromosomes. furthermore, this repetitive sequence is located inside the intronic region of the dhrsx gene with tandem repeat and dispersed repeat forms. fig. 13 shows an example of another repeat tandem pattern found in the x chromosome at position 27210460− 27211308bp in the human fig. 12 . scalogram corresponding to a dna sequence in x chromosome that contains repetitive sequences in intronic region. location of the intronic repetitive sequence "rseq8" in the x and y chromosomes of the human genome. the table7 provides the locations of "rseq9" in the x chromosome of other genomes. position of "rseq9" in x chromosome of other genomes. • rseq12="atatatgatatatactatatatgtcatatatacatatacac" the short repetitive sequence "tacata" (6 bp) appears 22 times in this dna sequence and has 69,710 as a repetition number in the x chromosome. after searching for the existence of this tandem repeat sequence using our algorithm, we have successfully found 9 repetitions of another new short repetitive sequence as a tandem repeat sequence (trs). we called this sequence of 29 base pairs "rseq13". -rseq13="ctgtataacctaaataatataggttatat" fig. 14 shows the scalogram of a new repetitive dna sequence that we called "rseq13". the sequence has a size of 261 bp and it is localized at 28076765-28077025 bp in the x chromosome. it is a tandem repeat sequence, with patterns of 29 bp length: "rseq13". the ncbi and the dfam databases don't indicate the existence of such repetitive sequence ("rseq13"). with our approach we succeeded to detect this tandem repeat without any prior knowledge about its existence. the repetitive sequence "rseq13" is located not only in the x chromosome of human genome but also in other genomes like in the x chromosomes of bonobo (at [28, fig. 15 shows the scalogram of a new dna sequence "trseq2" with a size of 261 bp. the sequence is positioned at 156029111-156029371 bp in the x chromosome. as we can see, the scalogram contains a repetitive pattern corresponding to a tandem repeat sequence: "rseq14". this subsequence ("tctctgcgcctgcgccggcgcggcgcgcc") has a size of 29 bp and 9 as a repetition number. rseq14 is not annotated as a tandem repeat in the ncbi or the dfam databases but it is defined as a tar1of the telomeric satellite family [55] . in table 8 , we provide the localization results of "rseq14" in the whole human genome and in other genomes. fig. 16 shows the scalogram of another new dna sequence: "trseq3" with a size of 500 bp and extending from 2845001bp to 2845500bp in the x chromosome of human genome. the sequence contains a tandem repeat sequence: "rseq15" (cgtgtgtatgta-tatttatataca), which size is a 24 bp and its repetition number is equal to 18. this sequence is not annotated as a tandem repeat sequence in the ncbi database nor in the dfam database. our "new-repeat-data" database of all new discovered repetitive fig. 15 . scalogram image corresponding to dna sequence "trseq2" confirm the existing of the "rseq14" tandem repeat sequence (tctctgcgcctgcgccggcgcggcgcgcc) n annotated in [45] as a minisatellites sequence which their repetition number equal to 9. sequences are presented in "supplementary material" file. to conclude, we succeeded to implement an efficient algorithm for repetitive sequences detection. the sequences we detected are of two types: satellites and minisatellites. on the other hand, we have obtained better results than those of the bioinformatics tools. the main advantage presented by this work is being independent of any prior knowledge about the searched repeat. in this section, we present the results of using cnn model to classify dna scalograms obtained in the first part of this work. our goal is to identify the different classes of the new repetitive sequences we discovered and stocked in the "new-repeat-data" database. as a data, we randomly took 200 non-repetitive sequences (nonrep) and 780 repetitive sequences (rep). repetitive sequences data consists of 780 sequences divided into 4 classes depending on their repetitive pattern length (table 9 ). these classes are: rep1 (with a size >100), rep2 (with a size between 60 and 100), rep3 (with a size between 30 and 60) and rep4 (with a size <30). in globally, our constructed database contains five classes that four contain scalograms of repetitive sequences and one contains scalograms without repetitive sequences. for the classification purpose, all the dataset (980 scalogram images) was splitted into 80% for training (784 images) and 20% for testing (196 images). thus, by such classification system we can discover images that contain similar repetitive patterns. we can also differentiate these images from others that don't contain repetitions. the fig. 17 represents the classification results of the four repetitive dna classes (images with repetitive patterns) against one class of nonrepetitive dna (images with no repetitive patterns). scalogram image corresponding to the dna sequence "trseq3" that contains "rseq15" as tandem repeat motif. description of the input data to the cnn classification system. repetitive pattern with size x number with the cnn model, we distinguished different specific types of dna images. the score ranges from 89% to 100%. the obtained results yield an average score of 94.4%. the confusion matrix of the classification rates confirms that our system is efficient in distinguishing between small repetitive patterns (rep4) and non-repetitive dna sequences (nonrep) with score equal to 100%. this result is quite clear, since the scalogram images of these two classes are very different. the following table 10 contains three evaluation measurements: precision, recall and f1-score which we used to evaluate our classification system. overall, our system gives good results in recognizing the four new repetitive dna sequences with an average of 95% in precision, recall and f1-score. genetic knowledge improvement of the human genome is a complex and a continuous research process. to contribute to this process, bioinformatics and signal and images processing tools have been applied to reveal hidden spectral features of dna sequences. although the repetitive dna sequences occupy 40% of the human genome, the localization of these sequences remains insufficient as it is a very difficult task. in this paper, we proposed a new algorithm based on the signal and image processing tools to extract the repetitive patterns from dna images that correspond to the repetitive dna sequences. the main goal of this is to create a new database that contains locations of all the new discovered repetitive sequences. as an example of the obtained results, we found a new modified repetitive sequence that can characterize 60s ribosomal protein: "rseq7". therefore, deeper studies that may give a biological interpretation of these results will be welcome. in this article, we proposed a novel and highly-effective method for dna images prediction based on cnn model. in our prediction system, the obtained accuracy scores over 100 fold cross validation ranged from 89% to 100% with an overall score of 94.4%. on behalf of all authors, the corresponding author states that there is no conflict of interest. the authors declare that there are no conflict of interest exists and no competing interests regarding the publication of this paper. afef. elloumi oueslati: phd in electrical engineering from the national engineering school of tunisia (enit). she is associate professor at the national school of engineers of carthage (enicarthage). her research interest includes issues related to signal and image processing applied in the biomedical and genomic fields. zied. lachiri: phd in electrical engineering from the national engineering school of tunisia (enit).he is professor and research director in the signal, image and information technology laboratory (lr-siti, enit). his research interests include pattern recognition, and signal and image processing in biomedical, multimedia, and man-machine communication the sequence of the human genome early stages of xy sex chromosomes differentiation in the fish hoplias malabaricus (characiformes, erythrinidae) revealed by dna repeats accumulation mini-and microsatellites repetitive dna and next-generation sequencing: computational challenges and solutions characterization of human centromeric regions of specific chromosomes by means of alphoid dna sequences a tandemly repeated sequence at the termini of the extrachromosomal ribosomal rna genes in tetrahymena maintaining the end: roles of telomere proteins in end-protection, telomere replication and length regulation a highly conserved repetitive dna sequence,(ttaggg) n, present at the telomeres of human chromosomes structure and function of telomeres epigenetic regulation of heterochromatic dna stability review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance exploiting est databases for the development and characterization of gene-derived ssr-markers in barley (hordeum vulgare l.) mreps: efficient and flexible detection of tandem repeats in dna sputnik -dna microsatellite repeat search utility wemboss: a web interface for emboss tandem repeats finder: a program to analyze dna sequences using repeatmasker to identify repetitive elements in genomic sequences sense from sequence reads: methods for alignment and assembly repetitive elements may comprise over two-thirds of the human genome the sequence of the human genome comparative genomic signature representations of the emerging covid-19 coronavirus and other coronaviruses: high identity and possible recombination between bat and pangolin coronaviruses the helitron family classification using svm based on fourier transform features applied on an unbalanced dataset detection and visualization of tandem repeats in dna sequences identification of short exons disunited by a short intron in eukaryotic dna regions search of hidden periodicities in dna sequences spectral repeat finder (srf): identification of repetitive sequences using fourier transformation helitron's periodicities identification in c. elegans based on the smoothed spectral analysis and the frequency chaos game signal coding a combined support vector machine-fcgs classification based on the wavelet transform for helitrons recognition in c. elegans distinguishing between intragenomic helitron families using time-frequency features and random forest approaches decomposition of hardy functions into square integrable wavelets of constant shape wavelet theory and applications: a literature study, dct rapporten 2005 the continuous wavelet transform and variable resolution time-frequency analysis algorithm and technique on various edge detection: a survey breast cancer detection using image processing techniques a computational approach to edge detection canny edge detection enhancement by scale multiplication morphological image analysis: principles and applications blat-the blast-like alignment tool dfam: a database of repetitive dna based on profile hidden markov models gradient-based learning applied to document recognition bacterial classification with convolutional neural networks based on different data reduction layers convolutional neural network architectures for predicting dna-protein binding cnn-mgp: convolutional neural networks for metagenomics gene prediction lightweight convolutional neural network for breast cancer classification using rna-seq gene expression data weakly supervised 3d deep learning for breast cancer classification and localization of the lesions in mr images cervical cancer classification using convolutional neural networks and extreme learning machines a convolutional neural network approach to detect congestive heart failure an experimental study on upper limb position invariant emg signal classification based on deep neural network p300 based character recognition using convolutional neural network and support vector machine cancer specific long noncoding rnas show differential expression patterns and competing endogenous rna potential in hepatocellular carcinoma genome instability of repetitive sequence: lesson from the ribosomal rna gene repeat dhrsx, a novel nonclassical secretory protein associated with starvation induced autophagy structure and polymorphism of human telomere-associated dna this study was founded by the ministry of higher education and research, lr99es10 human genetics laboratory. supplementary material related to this article can be found, in the online version, at https://doi.org/10.1016/j.bspc.2020.102207. key: cord-311839-61djk4bs authors: wei, dan; jiang, qingshan; wei, yanjie; wang, shengrui title: a novel hierarchical clustering algorithm for gene sequences date: 2012-07-23 journal: bmc bioinformatics doi: 10.1186/1471-2105-13-174 sha: doc_id: 311839 cord_uid: 61djk4bs background: clustering dna sequences into functional groups is an important problem in bioinformatics. we propose a new alignment-free algorithm, mbkm, based on a new distance measure, dmk, for clustering gene sequences. this method transforms dna sequences into the feature vectors which contain the occurrence, location and order relation of k-tuples in dna sequence. afterwards, a hierarchical procedure is applied to clustering dna sequences based on the feature vectors. results: the proposed distance measure and clustering method are evaluated by clustering functionally related genes and by phylogenetic analysis. this method is also compared with blastclust, cd-hit-est and some others. the experimental results show our method is effective in classifying dna sequences with similar biological characteristics and in discovering the underlying relationship among the sequences. conclusions: we introduced a novel clustering algorithm which is based on a new sequence similarity measure. it is effective in classifying dna sequences with similar biological characteristics and in discovering the relationship among the sequences. with the development of advanced biotechnology, more and more biological sequence information has been generated. the amount of genetic data is growing faster than the rate at which it can be analyzed. clustering techniques provide a viable solution for handling and analyzing such rapidly growing genetic data. clustering algorithms partition sequences into different biologically meaningful groups, facilitating therefore the prediction of functions of genes [1] . when a new gene is assigned to a cluster, the biological function of this cluster can be attributed to this gene with high confidence. on the other hand, clustering gene sequences into groups may also help with analyzing evolutionary relationships among the sequences in a cluster [2] . clustering of gene sequences requires calculation of similarity between sequences. there are two clustering approaches according to the similarity measure used in a clustering method. one is based on sequence alignment. the similarity between two gene sequences is measured by the scores obtained from an alignment algorithm such as blast [3] or fasta [4] . although sequence alignment gives good solutions, it is relatively difficult to cluster a large number of sequences because of its computational complexity. moreover, if the sequences in the set vary in length, a satisfactory alignment is hard to achieve, resulting in a low accuracy of clustering. the other approach for similarity measure is to use alignment-free methods [5] [6] [7] [8] [9] [10] . in recent years, several alignment-free measures have been proposed. the wordbased measure is one of the most widely used methods [11] [12] [13] [14] . this method chooses a short word length k, maps each sequence onto an n-dimensional vector according to its k-length tuple (also called k-tuple or kword) properties, and then assesses the similarity of any two vectors by measures such as euclidean distance [15] , mahalanobis distance [16] , kullback-leibler discrepancy [17] , cosine distance [18] or pearson's correlation coefficient [19] . in recent years, several novel alignment-free measures [20, 21] have been designed for dna sequences analysis. yang et al. [22] extended the k-tuple distance, which is based on the difference in tuple frequencies, to clustering gene sequences. their tuple-based method determines the similarity of sequences by considering only tuple frequencies and ignoring the positional information within a sequence. major algorithms used in gene sequence clustering can be divided into two categories according to the result format: hierarchical clustering algorithms and partitional clustering algorithms [23] . hierarchical clustering is widely used for detecting clusters in genomic data. it generates a set of partitions forming a cluster hierarchy. according to linkage criteria, there are three hierarchical clustering methods including single-linkage clustering (sl), complete-linkage clustering (cl) and averagelinkage clustering (al) [24] . with sl, clusters may be merged together due to single sequences being close to each other, even though many of the sequences in each cluster may be very distant to each other [25] . cl tends to find compact clusters of approximately equal diameters [25] . with cl, all objects in a cluster are similar to each other. al can be seen as an intermediate between single and complete linkage clustering, resulting in more homogeneous clusters than those obtained by the single-linkage method [26] . for instance, blastclust [27] and generage [28] employ single linkage clustering approach; swords [29] is based on word frequencies as profiles to merge clusters hierarchically; and uchiyama [30] use average linkage clustering algorithm to classify genes. hierarchical approaches may yield fairly good results, but they require the similarity of all pairs of sequences and quickly arrive at a bottleneck in terms of computational time and memory usage for large-scale data sets [31] . partitioning algorithms have also been used. partitional clustering obtains a partition of data objects by optimizing some clustering criterion. partitional clustering algorithms are simple and well-suited for clustering large datasets [32] . k-means (km) [33, 34] is a commonly used method of partitional clustering methods. km has a lower order of computational complexity and demands less physical memory than the hierarchical method. it is suitable for clustering large gene data. some km-based algorithms, such as those introduced by wan et al. [33] , kelarev et al. [34] , tseng et al. [35] and ashlock et al. [36] , have been developed to group dna sequences. the major drawback of km compared to hierarchical clustering algorithms is the lack of hierarchical relationships in its results. to remedy the problem, bisecting kmeans (bkm), a hierarchical variation of km, was proposed to build a tree of clusters in a top-down fashion by splitting the least homogeneous cluster into two more homogeneous ones. bkm can produce either a flat clustering or a hierarchical clustering by recursively applying km. it has a linear complexity and is relatively efficient and scalable. recent study [37] concluded that bkm outperforms km and performs equally well or better than hierarchical methods when it partitions the dataset based on a homogeneity criteria. the bisecting approach is very attractive for genomic studies [38] . hierarchical clustering produces a nested series of partitions, where the results are usually depicted as a dendrogram while partitional clustering produces a flat partition. blastclust [27] is a hierarchical clustering method based on blast scores as the measure of sequence similarity. blastclust computes pairwise similarity of all sequences by blast alignment and then clusters sequences by the single linkage clustering method which produces clusters of linear topology. the performance of blastclust is limited by the size of the input data. cd-hit-est [39] , a partitional approach, is also widely used to cluster dna sequences. cd-hit-est uses an incremental clustering process and avoids the unnecessary alignments by a short word filtering mechanism, which detects similar sequences by counting the number of identical short words between them. the purpose of filters is to decide whether the identity between two sequences is above or below a threshold without aligning them, therefore speeding up the clustering process. though cd-hit-est is based on alignment, it can avoid too many pairwise alignments by using a filter, thus it is faster than blastclust, and can handle larger datasets. recent studies reveal also that blastclust is less effective for clustering divergent sequences [40] , and its performance strongly depends on the choice of optimal blast parameters including similarity threshold, percent identity, and alignment length [41] . cd-hit-est, on the other hand, does not provide hierarchical relationships between clusters of sequences. in many situations both cd-hit-est and blastclust yield clusters with only one sequence [41] . all the traditional clustering methods based on sequence alignment encounter computational difficulties in dealing with large biological databases. the approach presented in this paper involves a new alignment-free distance measure based on k-tuples, dmk (distance measure based on k-tuples) [42] , and a modified bisecting k-means clustering algorithm, mbkm (modified bisecting k-means algorithm). mbkm aims to speed up the clustering process by using the alignment-free similarity measure, and is able to produce either a hierarchical clustering or a partition clustering result. we have applied mbkm with dmk in clustering gene sequences and performing phylogenetic analysis. dmk shows better performance than the k-tuple distance in our experiments, and mbkm outperforms sl, cl, al, bkm and km when tested on public gene sequence datasets. furthermore, the proposed method also outperforms alignment-based methods such as blastclust and cd-hit-est. a gene is a stretch of dna that codes for a single polypeptide chain [43] . a gene sequence is a succession of four symbols {a, c, g, t}. because the similarity between the genes of two species indicates their evolutionary relationship, it is used in many clustering algorithms. the goal of sequence clustering is to partition biological sequences into meaningful/functional groups according to the similarity information, which is calculated using either an alignment-based method or an alignment-free method. the traditional approach for clustering dna sequences requires all-by-all comparisons from alignment [44] [45] [46] . given two sequences: s 1 = agcacaca and s 2 = acacagta, s 1 p and s 2 p are used to represent the p th characters in s 1 and s 2 , respectively. the alignment score [45] for (s 1 , s 2 ) is given by where e is the cost of an alignment operation: deletion, substitution, or insertion. however this distance measure relies on sequence alignment. since sequence alignment suffers in computational aspect with regard to large biological databases, clustering methods relying on sequence alignment have difficulties in dealing with the large gene data. an alignment-free similarity measure helps avoid the computational complexity of multiple sequence alignment for similarity computation. in this paper we propose a new alignment-free similarity measure, dmk, based on which we developed mbkm to cluster gene sequences. in the follows, we will present dmk first, and then describe mbkm algorithms. in this section, we introduce a new similarity measure which takes into account the occurrence, location and order relation of k-tuple in a dna sequence. sequences are numerically transformed to feature vectors that can be processed by data mining algorithms. let σ be the alphabet set of nucleotides (σ = {a, c, g, t}). a sequence of length s, s, is defined as a linear succession of s symbols from σ. a segment of k consecutive symbols in sequence s (k ≤ s) is designated as a k-tuple. there is a set of 4 k possible k-tuples, w k . the number of occurrences of a k-tuple w, n w , is counted by moving a sliding window of length k over the sequence with k -1 bp overlapping step size. to explore the correlation properties of dna, nair et al. [47] provided a presentation of genomic data using the inter-nucleotide distance sequence. based on a similar idea, we utilize the gaps between the locations where k-tuple occur in the sequence to explore the sequence structure. for a dna sequence s, p r is the location of the r th occurrence of k-tuple w, where p 0 = 0. and α r is given as, in which m stands for the number of occurrences of w. α r reflects the density of w and is closely related to the location where w occurs in the sequence. each w begins at the 1/α 1 position, and {α 1, α 2 ,. . .,α m } for repetition of w forms an array whose r th element indicates the relative position of two neighboring w in the sequence. this array allows us to find all subsequent repeats of w. to characterize the order of α r , we define β j as a partial sum of {α r }. β j is calculated by the following formula: {α r } is a list of non-negative real numbers, and β j is totally ordered by ≤, so β 1 , β 2 , . . ., β m is also an ordered set. {α 1 , α 2 ,. . ., α m } and {β 1 , β 2 ,. . ., β m } determine each other uniquely. β j is only dependent of the number and positions of w and independent on other k-tuples. given the set of {β 1 , β 2 ,. . ., β m }, one can obtain where w occurs and how many times w occurs in the sequence. shannon's entropy [48] , which illuminates the total information measure of source on the average, is a measure of order/disorder. according to [49] , when using the totally ordered set {β 1 , β 2 ,. . ., β m } to calculate the probabilities, the shannon entropy reflects the degree of importance of position in a sequence. we construct a discrete probability distributhe shannon entropy of the discrete probability distribution is calculated by for each k-tuple w in the sequence, not only the information of tuple numbers but also the information of tuple positions is involved in the definition of h. we take h as the feature of w in the sequence, and then construct a vector consisted of h of all possible k-tuples in the given sequence. for a fixed k, there are 4 k distinct k-tuples to be considered. these k-tuples in a fixed 4 k -dimension feature vector are denoted by ðh 1 ; h 2 ; . . . ; h 4 k þ, where h i means the feature representation of the ith k-tuple. this feature vector based on h can be regarded as an index for its corresponding sequence. cluster analysis algorithms partition objects into groups based on the distances between objects. euclidean distance is the square root of the summation of the squares of the differences between all pairs of corresponding objects. the k-tuple distance is the sum of the differences in frequency over all possible k-tuples; on the other hand, we use euclidean distance between shannon entropy of k-tuples in sequences to measure the similarity. this distance measure method is referred as dmk. for any two sequences x and y, dmk can be calculated as: where h x w i and h y w i represent the shannon entropy values of the i th k-tuple in sequences x and y, respectively. dmk can be calculated from following algorithm: algorithm name: dmk for similarity measure input: sequences {s 1 , s 2 ,. . ., s n }. output: similarity matrix, (d(x,y)) n*n . steps: 1. for each sequence, search and locate each k-tuple; a new clustering algorithm: mbkm km can be used to obtain a hierarchical clustering solution using a repeated bisecting approach [50, 51] . bkm is such an algorithm and it can produce either a partitional or a hierarchical clustering. bkm has a linear time complexity in each bisecting step. recent study [51] concludes bkm outperforms km as well as the agglomerative approach in terms of accuracy and efficiency. consequently, the bisecting approach is very attractive in many applications for clustering and genomic data analysis. bkm initially regards the whole data set as a cluster, and splits one cluster into two subclusters at each bisecting step using km until singleton clusters are obtained at the leafs or until k clusters are obtained. the outcome is structured as a binary tree. there are two key steps in a typical bkm. the first one is the selection of initial centroids. generally the initial centroids are chosen randomly in bkm. the second key step is the rule, ζ, for selection of a existing cluster to be split in each bisecting step. ζ is typically given by the following three approaches [50] : 1) choosing the cluster with largest size; 2) selecting the cluster with the overall similarity the overall similarity is either minimized or maximize, depending on the definition of d(s, s ' ). c is a cluster; 3) using a criterion based on both size and overall similarity. because the differences between these methods are small in terms of the final clustering result, the way of splitting the largest remaining cluster is recommended [50] . there are two problems in bkm algorithm: 1. randomly choosing the initial centroids in bkm may result in too adjacent elements selected. if the initial centroids are too close, the algorithm will reach a local optimization. moreover, different sets of initial cluster centroids can lead to different final clustering results. 2. the algorithm for choosing one existing cluster to split in each bisecting step usually selects the cluster with the largest size. although this leads to reasonably good and balanced clustering solution, it cannot gracefully work for datasets where the natural clusters are of different sizes, as it will tend to partition larger clusters first. in real biological data, the number of elements in every cluster may not always be similar. to address the above two problems and obtain more natural hierarchical solutions, we develop a modified bisecting k-means, mbkm, which choose the initial centroids by the maximum and minimum principle and select the cluster to split based on the compactness of clusters. in order to achieve stable and reliable clustering results, we use the maximum distance, which can avoid obtaining adjacent elements, to select the initial centroids. for a set of sequences, {s 1 , s 2 , . . ., s n }, let d(s i , s j )(i, j = 1, 2, . . .n) be the distance between any two sequences in the dataset. we choose the sequence s c 1 and s c 2 as the cluster centroid according the following rule: 2) selecting the cluster to split bkm algorithm usually partitions the largest size cluster into two smaller ones and yields clusters with similar size. however, a cluster with large number is not always the loose one. if one existing cluster is a loose one, in which its members are not closely related to each other, the cluster will be selected to be split. variance is a measure of how far a set of numbers are spread out from each other, and it can measure the compactness of the clusters. so we select the cluster to split on the basis of the compactness of clusters measured by variance. the variance of cluster c j is defined as following: where μ j is the centroid of sequences in c j , d (s i , μ j ) is the distance between s i and μ j , and n j is the number of sequences in the cluster. a small variance of a cluster indicates that the members in the cluster tend to be closely related to the mean. in other words, the smaller the variance is, the more compact the cluster is, and vice versa. based on the above idea, we outline mbkm algorithm as follows. algorithm name: mbkm for clustering sequences input: sequences {s 1 , s 2 , . . ., s n }, a distance function d between sequences, the number of clusters k. output: set of k clusters. steps: 1. initialization: regard the whole dataset {s 1 , s 2 , . . ., s n } as a single cluster. 2. pick a cluster to split. 3. find two sub-clusters: 3.1 select two initial centroids using equation (6) (7) and take the split that produces the clustering result with the highest variance. 5. repeat steps 2, 3 and 4 until the desired number k is reached. this algorithm outputs a binary tree of sequences, where each leaf represents a sequences and each node represents a sequence collection. the proposed method is evaluated by clustering functionally related gene sequences and by phylogenetic analysis. we present our evaluation results in two parts. the first one aims at testing the efficiency of our similarity measure, dmk. the second one is to illustrate the efficiency of the proposed clustering method, mbkm. to measure the quality of the clustering results, our experiments adopt f-measure [52] to evaluate the clustering performance. for cluster j and class i, f (i, j) is defined as: fði; jþ ¼ 2ã precisionði; jþã recallði; jþ precisionði; jþ þ recallði; jþ ð8þ where i =1, 2, . . ., e, j = 1, 2, . . ., f, precision(i, j) = n ij /n j , recall(i, j) = n ij /n i , e is the number of classes, and f is the number of clusters. n ij is the number of the sequences of class i in cluster j, n i is the number of the sequences of class i, and n j is the number of the sequences of cluster j. the f-measure of the whole clustering result is defined as: where n is the total number of sequences in the data set. clearly, an f-measure has a value between 0 and 1. the larger the f-measure is, the better the clustering result is. to evaluate the proposed similarity measure, we test dmk on gene sequence data sets and compare it with the k-tuple distance. we also verify the effectiveness of dmk by assessing how well it performs on phylogenetic analysis. genes of the same family usually share similar sequences, functional domains, and even interacting partners. when a new gene is assigned to a cluster, the biological function of this cluster can be attributed to this gene with high confidence. four data sets are extracted from different gene repositories as shown in table 1 . the sequences of ds1 are downloaded from ncbi (http://www.ncbi.nlm.nih. gov). the other three datasets, ds2, ds3 and ds4, are taken from pbil (http://pbil.univ-lyon1.fr/). ds2 is taken from hovergen of pbil, a database of homologous vertebrate genes. ds3 is taken from hogenom, which contains homologous gene families from microbial organisms. ds4 is randomly selected from homolens, a database of homologous genes from ensembl organisms and ensembl families. four widely used clustering algorithms, including km, single-linkage clustering (sl), complete-linkage clustering (cl) and average-linkage clustering (al), have been at20832p,at27361p, cg10513-pa, cg10514-pa, cg10550-pa, isoform a, cg10553-pa,cg10559-pa,cg10560-p chosen in the experiments. for comparison, we perform the clustering tests on all data sets using the k-tuple distance and dmk distance. in this paper, we set k value to 3. for protein coding genes, a tuple size of 3 is a good choice according to reference [22] . we also tested the clustering performance on different k values, and the result confirms that a small k value is preferred, see additional file 1: table s1 . for larger k values, there are more tuples with zero frequencies and less information is captured by the algorithm. km algorithm would yield different results during multiple executions due to its stochastic feature for initialization. we examine km in ten runs and report the average performance. the al, cl and sl hierarchical algorithms generate one solution for each of them. we obtain the result of hierarchical clustering algorithms by analyzing the hierarchical tree using the expected number of cluster as input parameters. according to table 2 , the f-measure values for each of the data sets using dmk are clearly higher than those obtained with the k-tuple distance. in our experiments, on average, the value of the f-measure given by dmk is 18% better than by the k-tuple distance (p = 0.0165, onesided paired t-test) in km, 49.7% better in sl (p = 0.0028), 24.9% better in cl (p = 0.016), and 35.8% better in al (p = 0.01885). clearly, dmk provides a significant improvement in clustering sequences. on the four data sets, the f-measure of dmk is improved more than 20% compared with that of the k-tuple distance during the same clustering process in most cases. dmk outperforms the k-tuple distance in the experiments. this is because dmk considers the occurrence, location and order relation of tuples in sequence and can capture more information in the sequence, while the k-tuple distance considers frequency alone and ignore the position of tuples in a sequence. in addition, we have tested dmk and k-tuple measures on protein sequences with a k value of 2, and the results indicate that dmk performs better than k-tuple distance (data not shown). thus in practical dmk measure can also be applied in clustering protein sequences after tuning current algorithm. in this experiment, the proposed similarity measure dmk is further tested by phylogenetic analysis. in order to evaluate the similarity measures, we use upgma in the phylip package, a widely used clustering algorithm in phylogenetic analysis. the tree is drawn by tree-view program [53] . the selected data set includes the full β-globin gene sequences of 10 species reported by feng et al. [54] , which are downloaded from ncbi (http://www.ncbi. nlm.nih.gov). their names, accession numbers, locations and lengths are listed in the additional file 1: table s2 . the similarity/dissimilarity matrices for the full sequences of β-globin gene of the 10 species using dmk are shown in table 3 , respectively. the smaller the distance is, the more similar the two sequences are. in table 3 , the most similar species pairs are humangorilla, human-chimpanzee and gorilla-chimpanzee, which are expected from their evolutionary relationship. a slightly less similar species pair is goat-bovine. on the other hand, gallus is separated from the rest, this coincides with the fact that gallus is the only nonmammalian species among these 10 species. we can also find that opossum is far away from the remaining mammals. these results are consistent with biological morphology. the quality of the constructed tree shows the quality of the distance matrix and the method of abstracting information from dna sequences. in figure 1(b) , we show the phylogenetic tree of 10 β-globin gene sequences based on dmk, generated by upgma. for comparison, the phylogenetic tree of the k-tuple distance is shown in figure 1(a) . the tree in figure 1 (a) has some consistencies with biological morphology. although it supports the separation of gallus relative to other species, its obvious drawback is that it fails to separate (mouse, rat) and (goat, bovine) from opossum. from figure 1 (b) , gallus is separated from the rest and opossum is far away from the other species. this topology is in good agreement with that presented by feng et al. [54] and cao et al. [55] except for the relative position of rodents. dmk measures the similarity between dna sequences more effective than the k-tuple distance. this is because dmk measures the distance between dna sequences based on sequence structure and composition. through evaluation on gene families and constructing phylogenetic trees of full gene sequences of 10 species, we find that dmk gives more competitive results compared to the k-tuple distance. to evaluate the effectiveness of the proposed clustering algorithm, mbkm, we apply mbkm in clustering gene sequences and compare it with several clustering algorithms. moreover, we use our method, mbkm with similarity measure dmk, in phylogenetic analysis to show how well the genes are grouped together and how well the resulting trees agree with existing phylogenies. in order to illustrate the efficiency of mbkm in gene sequence clustering, we ran mbkm with the k-tuple distance and dmk on real data sets listed in table 1 . the clustering results are compared with those of km, sl, cl, al and bkm algorithms. for bkm, the number of iterations for each bisecting step is set to 5. we ran bkm 10 times to obtain the average f-measure. by combing the six clustering algorithms with two similarity measures, we have 12 combinations of clustering algorithm for performance assessment. the combinations are km with k-tuple, sl with k-tuple, cl with k-tuple, al with k-tuple, bkm with k-tuple, mbkm with k-tuple, km with dmk, sl with dmk, cl with dmk, al with dmk, bkm with dmk and mbkm with dmk. the clustering performance of different clustering methods is the result of a combination of factors, including the types of sequence distances used for clustering and the choice of clustering algorithms. table 2 shows the clustering performance on the data sets for all 12 clustering methods. for each data set, we set the number of cluster as the real number of class during the clustering run. for example, the real number of cluster is 8 in ds1 and 6 in ds2. from table 2 , we observe that mbkm using dmk achieves best result and clearly outperforms other methods for the four data sets. the average f-measure of mbkm with k-tuple is about 2.2% higher than km with k-tuple (p = 0.036), 45% higher than sl with k-tuple (p = 0.00195), 11.4% higher than cl with k-tuple (p = 0.0424), 19% higher than al with k-tuple (p = 0.08615) and 2.3% higher than bkm (p = 0.0141). for mbkm with dmk, f-measures for ds1, ds2, ds3, and ds4 are 0.808, 0.9645, 0.9143, and 0.9587 respectively. on average, the value of f-measure given by mbkm is 14.2% better than km (p = 0.00025), 21.3% better than sl (p = 0.0105), 15.4% better than cl (p = 0.02835), 10.1% better in al (p = 0.0686), and 2.3% higher than bkm (p = 0.0015) respectively. these results show that our method, combining mbkm with dmk, is able to achieve high quality results on all the data sets. because the clustering methods listed in table 2 use the numbers of cluster as input parameters, we analyze the effects of varying the number of clusters on the clustering performance. this analysis is applied to ds1, ds2, ds3 and ds4 datasets and all 12 combinations. figures 2 and 3 show the results of these runs based on the k-tuple distance and dmk, respectively. the data used for generating these figures are included in additional file 1: tables s3-s10. figure 2 illustrates the results of the six clustering algorithms with the k-tuple distance. from figure 2 and additional file 1: tables s3-s6, mbkm achieves better fmeasures than other five clustering algorithms for the real number of clusters on all the data sets. although the other clustering algorithms give slightly better results in terms of f-measure in some cases, mbkm performs better than the other clustering algorithms in terms of the average of the f-measures values (average values are shown in additional file 1: tables s3-s6 ). this result shows that on average, mbkm performs better than other clustering algorithms for a range of cluster numbers, in the vicinity of real number of clusters. it also implies that varying the number of clusters as input for these clustering algorithms does not affect the performance. figure 3 shows the results of clustering algorithms with dmk. mbkm obtains the highest f-measure values among the six clustering algorithms at the real number of clusters. on average, mbkm achieves better results than the other clustering algorithms for ds2, ds3, and ds4. for ds1, the average value of mbkm is very close to that of al and higher than those of the other clustering algorithms. overall mbkm produces consistently high quality clusters in the neighborhood of the real number of cluster (data shown in additional file 1: tables s7-s10). the f-measures given by mbkm are higher than those of other clustering methods at the corresponding number of clusters in most cases. from figures 2 and 3 , we can see that dmk achieves better cluster quantity than the k-tuple distance in terms of f-measure. using same clustering algorithm on the same data set, dmk achieves higher average of the fmeasure values than the k-tuple distance, and dmk also obtains higher f-measures at corresponding number of clusters (data shown in additional file 1: tables s3-s10). from both figures, we find that f-measure changes as the number of cluster changes. as it is known, fmeasure is a balanced measure of precision and recall. it is an ideal condition when the number of cluster is equal to the real number. when the number of cluster is greater than or less than the real number, the f-measure will be affected. with regard to clustering algorithms, sl performs poorly in many cases, and this may be because that sl uses the nearest pair of sequences and may lead to bad splits of one cluster if two or more clusters show different pattern densities. for km and bkm, the results of many runs are lower than those of mbkm. on the whole, mbkm achieves better results than other clustering algorithms, and mbkm combining with dmk achieves best results among these clustering methods in our experiments. the task of sequence clustering is to group given sequences into clusters. the similarity measure, dmk, measures the similarity between dna sequences based solely on the k-tuple. it is more effective than the k-tuple distance, which is one of the most widely used methods. the clustering algorithm, mbkm, can obtain better clustering results and can reveal the relationships among clusters in hierarchical manner. in the next experiments, we combine mbkm with dmk to clustering dna sequences. in order to further illustrate the efficiency of our method, combining mbkm and dmk, we compare mbkm with dmk to two other clustering programs: blastclust [27] and cd-hit-est [39] . blastclust is an alignment-dependent clustering algorithm. blastclust is from ncbi blast package. blastclust accepts a number of parameters that can be used to control the clustering stringency including thresholds for score density (−s parameter), and alignment length (−l parameter). cd-hit-est is a popular dna clustering program based on greedy incremental clustering method. cd-hit-est groups dna sequences into clusters that meet a userdefined similarity threshold (−c parameter) and uses short-word filters to rapidly determine that if two sequences are similar, which reduces the number of full alignments necessary. we perform tests using blastclust and cd-hit-est on the data sets listed in table 1 . in order to obtain the best possible performance of blastclust, we set -p as f (input type is nucleotide sequence) and vary the input parameters, -s and -l, to evaluate the results. the score density, -s parameter, varies between 10 and 90 with step size 10, and the alignment length, -l parameter, varies between 0.1 and 0.9 with step size 0.1. other parameters are kept default. for cd-hit-est, because the sequence identity threshold, -c parameter, should be greater than or equal to 0.8 in the program, we vary -c parameter between 0.8 and 1 with step size 0.02, and set the word length as default value. the best results from different parameter combination are recorded. for mbkm with dmk, we set the size of k-tuple as 3 and use the real number of clusters as input. as blastclust and cd-hit-est do not use the number of clusters as input, we choose the resulting class i, which has the max f(i,j) for cluster j, to calculate the f-measures. the results, which contain the corresponding f-measures and the execution time, are summarized in table 4 . table 4 demonstrates that mbkm with dmk produces good results relative to each original cluster set in terms of f-measure. every f-measure of mbkm with dmk is higher than 0.8 and the highest is 0.9645. it is also seen in the table that mbkm with dmk outperforms blas-tclust and cd-hit-est on all the data sets. blastclust and cd-hit-est tend to give more clusters than the real numbers of classes, therefore, blastclust and cd-hit-est give high precision and low recall value. but neither of these two performs well in terms of fmeasure. the execution times reported in table 4 for algorithm comparison show mbkm with dmk is faster than blastclust and cd-hit-est. for the cases that the real number of clusters is unknown, the performance of our algorithm will be affected. in order to compare with blastclust and cd-hit-est on a relatively fair ground, we can vary the number of clusters and take the average of the fmeasure values over the different numbers of clusters. for instance, we run mbkm with dmk with the range of 3-20 numbers and the average values of f-measure are 0.7065, 0.8533, 0.8205 and 0.8429 for ds1, ds2, ds3 and ds4, respectively. as shown in additional file 1: tables s7-s10, these values are also higher than the corresponding f-measure of blastclust and cd-hit-est. in this experiment, we used mbkm with dmk to construct phylogenetic trees. 1) the clustering result of 10 species we apply mbkm with dmk to the 10 dna sequences of β-globin gene in table 4 . the clustering result is shown in figure 4 (a). using the same data set, we also build the phylogenetic tree using clustalw [56] and muscle [57] for alignment, and upgma and maximum likelihood (ml) method (in the phylip package) for presenting the tree. figure 4 (b) and 4(c) shows the tree built by clustalw with upgma and muscle with ml respectively. the trees built by muscle with upgma and clustalw with ml are provided in figure 1 of additional file 1. in figure 4 (a), human, gorilla, chimpanzee and lemur are closer to bovine and goat than to mouse and rat, this topology is in complete agreement with feng et al. [54] and cao et al. [55] confirming the outgroup status of rodents relative to ferungulates and primates. moreover, the tree in figure 4 (a) is identical to the tree in figure 4( figure 4 the phylogenetic trees for 10 species using the full dna sequences of β-globin. species. analysis of h1n1 is critical for preparing a strategy to prevent and to control influenza epidemics and pandemics. the h1n1 avian influenza is characterized by its continuous antigen variation, which is mainly caused by the ha and na proteins in which ha protein has highest rate of mutation. ha protein plays a critical role in identifying and adsorbing the host cell receptor in the infection process, and it is the decisive factor of host specific. we use our method to verify the phylogenetic relationships of h1n1, and the result is included in additional file 1. the clustering result using mbkm with dmk is shown in figure 5(a) . as a comparison, we also use clustalw with upgma and muscle with ml to construct the phylogenetic tree and they are presented in figure 5 (b) and 5(c). as is seen from figure 5 (a), 60 h1n1 viruses are distinctly divided into four main groups using our method. the four groups, include european swine older than 2009 (g1), the avian older than 2009 (g2), american swine older than 2009 (g3) and the new 2009 viruses from human, swine and avian (g4). the result shows that the new 2009 human h1n1 viruses have closer relationship with old american swine than old avian and european swine. this grouping result is generally consistent with the topology given by clustalw with upgma, which is shown in figure 5(b) , and the one presented by muscle with upgma, which is provided in the additional file 1, as well as the result suggested by zhao et al. [58] . figure 5 (c), built by muscle using ml method, also shows the new 2009 human h1n1 viruses have close relationship with old american swine except the position of the group (old avian swine, european swine) is different from the positions in figure 5 (a) and 5(b). clustalw with ml (in additional file 1) also classifies the 60 h1n1 viruses into four groups except that swine/wisconsin/1961 and swine/wisconsin/1961 are not classified well. our method analyzed the 60 h1n1 viruses within 1 second, while upgma with clustalw and muscle figure 5 the phylogenetic trees for 60 h1n1 viruses. of the same data set took 460 and 60.1 seconds to build the tree, and ml with clustalw and muscle took 571 and 188.1 seconds to build the tree, respectively. our method, mbkm with dmk, performs well when clustering 10 species and 60 h1n1 viruses. it obtains similar results to the alignment-based method. furthermore, our method is much faster than the alignmentbased methods. in order to compare the speed of our method with the multiple sequence alignment based methods, clus-talw and muscle, we performed the test on two sets of sequences. the first set consists of six datasets. all the six datasets include 100 sequences. the lengths of all sequences in the six datasets are around 1000, 2000, 3000, 4000, 5000 and 6000 respectively. another set also consists of six datasets. the number of sequences in each dataset is 20, 40, 60, 80, 100, 120 respectively; the lengths of all the sequences are around 3000. because ml method is slower than upgma, we use upgma to build the phylogenetic tree of the results from clus-talw and muscle and record the time used for each method. the results in figure 6 show that our method is much faster than the other two methods. the actual time differences are much higher than the visual differences in the figure since we are using the log(time) as the label of y-axis. for dmk, the time complexity of transforming the gene sequence s 1 ⋯s l to a vector is o (l4 k ), thus the time complexity of generating the vectors for the whole sequence database is oðn l4 k þ, where l is the average length of the sequences and n is the number of sequences. the value of k set to 3 yields good results in our experiments, and we fix k to 3 as the size of k-tuple. dmk have linear time complexity with respect to both l and n. the time consumed for mbkm calculation is primarily determined by choosing the initial cluster centroids. for n sequences, this step has a time complexity of o (n 2 ). the time complexity of clustering step in mbkm is o (n log k). the following scalability test on our method, mbkm with dmk, confirms that our method has linear time complexity with respect to the average length of the sequences. the scalability test uses theoretical model sequences composed of the four symbols ' a' , 'c' , 'g' and 't'. the method is implemented in java and on a computer with 3.00 ghz cpu and 2 gb ram. figure 7 (a) illustrates the relationships between the runtime and the number of sequences (implemented on a computer with 8 gb ram). to test the scalability with respect to the number of sequences, we use five data sets which consist of 5000, 10000, 15000, 20000, 25000, 30000, 35000 and 40000 sequences. each data set the evolution of mammalian gene families a novel clustering method via nucleotide-based fourier power spectrum analysis a basic local alignment search tool improved tools for biological sequence comparison alignment-free sequence comparison-a review alignment-free estimation of nucleotide diversity a novel feature-based method for whole genome phylogenetic analysis without alignment: application to hev genotyping and subtyping efficient estimation of pairwise distances between genomes alignment-free detection of local similarity among viral and bacterial genomes cluss: clustering of protein sequences based on a new similarity measure alignment-free sequence comparison (i): statistics and power numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison an improved string composition method for sequence comparison a mathematical consideration of the wordcomposition vector method in comparison of biological sequences a measure of the similarity of sets of sequences not requiring sequence alignment a measure of dna sequence dissimilarity based on mahalanobis distance between frequencies of words statistical measures of dna dissimilarity under markov chain models of base composition integrated gene and species phylogenies from unaligned whole genome protein sequences statistical method for predicting protein coding regions in nucleic acid sequences wse, a new sequence distance measure based on word frequencies a poisson model of sequence comparison and its application to coronavirus phylogeny performance comparison of gene family clustering methods with expect curated gene family data set in arabidposis thaliana classification, clustering, features and distances of sequence data. sequence data mining biometry: the principles and practice of statistics in biological research cluster analysis efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space documentation of the blastclust-algorithm generage: a robust algorithm for sequence clustering and domain detection swords: a statistical tool for analyzing large dna sequences hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes clustering 16 s rrna for otu prediction: a method of unsupervised bayesian clustering jacop: a simple and robust method for the automated classification of protein sequences with modular architecture interactive clustering for exploration of genomic data clustering algorithms for its sequence data with alignment metrics penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data classifying synthetic and biological dna sequences with side effect machines criterion functions for document clustering: experiments and analysis enhanced bisectingk-means clustering using intermediate cooperation cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences a methodology for comparative functional genomics easycluster: a fast and efficient geneoriented clustering tool for large-scale transcriptome data a dna sequence distance measure approach for phylogenetic tree construction the gene is dead-long live the gene: conceptualizing genes the constructionist way alignment and clustering of phylogenetic markers-implications for microbial diversity studies introduction to computational biology: maps, sequences, and genomes. lodon: chapman and hall biological sequence analysis: probabilistic models of proteins and nucleic acids visualization of genomic data using internucleotide distance signals estimating the entropy of dna sequences relative entropy of dna and its application a comparison of document clustering techniques hierarchical clustering algorithms for document datasets fast and effective text mining using linear-time document clustering treeview: an application to display phylogenetic trees on personal computers new method for comparing dna primary sequences based on a discrimination measure conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders multiple sequence alignment with the clustal series of programs muscle: multiple sequence alignment with high accuracy and high throughput a new distribution vector and its application in genome clustering a novel hierarchical clustering algorithm for gene sequences contains 10 clusters and all the sequences have the same length, 100. the curve in figure 7 (a) is primarily consistent with the time complexity of mbkm with o (n 2 ). the scalability with respect to the length of sequences was tested on five datasets with five different sequence lengths: 10000, 20000, 30000, 40000, 50000 and each set consists of 4 clusters and 100 sequences. the sensitivity with respect to the length of the sequence is illustrated in figure 7 (b), from which we can see that the time of our method increases linearly when the length of sequences increases. in this paper, we presented a novel approach for dna sequence clustering, mbkm, based on a new sequence similarity measure, dmk, which is extracted from dna sequences based on the position and composition of oligonucleotide pattern. the experimental results show the method of combining mbkm with dmk is effective in classifying dna sequences with similar biological characteristics and in discovering the underlying relationship among the sequences. in addition, dmk can achieve comparable or better accuracy than the frequency-based distance measure. our proposed method can be applied to study gene families and it can also help with the prediction of novel genes. furthermore, mbkm with dmk can generate cluster trees that are useful to understand the processes governing the gene evolution. in addition, our method may be extended for protein sequence analysis and metagenomics of identifying source organisms of metagenmic data. our method has limitations too. for example, the method did not consider edge length, and has not address problems with long repeated sequences or long insertions. in future we will try to address these problems. additional file 1: supplementary data. the authors declare that there are no competing interests.authors' contributions dw designed the algorithm, conducted the experiments, and wrote the manuscript. qj supervised the project and proposed data mining algorithm. yw guided the experiments, wrote the manuscript and analyzed the results. sw guided the experiment analysis, and proposed ideas for sequence clustering algorithm. all authors read and approved the final manuscript.submit your next manuscript to biomed central and take full advantage of: key: cord-268467-btfz6ye8 authors: schreiber, steven s.; kamahora, toshio; lai, michael m.c. title: sequence analysis of the nucleocapsid protein gene of human coronavirus 229e date: 1989-03-31 journal: virology doi: 10.1016/0042-6822(89)90050-0 sha: doc_id: 268467 cord_uid: btfz6ye8 abstract human coronaviruses are important human pathogens and have also been implicated in multiple sclerosis. to further understand the molecular biology of human coronavirus 229e (hcv-229e), molecular cloning and sequence analysis of the viral rna have been initiated. following established protocols, the 3′-terminal 1732 nucleotides of the genome were sequenced. a large open reading frame encodes a 389 amino acid protein of 43,366 da, which is presumably the nucleocapsid protein. the predicted protein is similar in size, chemical properties, and amino acid sequence to the nucleocapsid proteins of other coronaviruses. this is especially evident when the sequence is compared with that of the antigenically related porcine transmissible gastroenteritis virus (tgev), with which a region of 46% amino acid sequence homology was found. hydropathy profiles revealed the existence of several conserved domains which could have functional significance. an intergenic consensus sequence precedes the 5′-end of the proposed nucleocapsid protein gene. the consensus sequence is present in other coronaviruses and has been proposed as the site of binding of the leader sequence for mrna transcriptional start. this region was also examined by primer extension analysis of mrnas, which identified a 60-nucleotide leader sequence. the 3′-noncoding region of the genome contains an 11-nucleotide sequence, which is relatively conserved throughout the coronavirus family and lends support to the theory that this region is important for the replication of negative-strand rna. human coronavirus 229e (hcv-229e) belongs to one of two major antigenic groups of human coronaviruses (macnaughton, 1981) . it shares antigenic relationships with other coronaviruses, such as porcine transmissible gastroenteritis virus (tgev), feline infectious peritonitis virus (fipv), and canine coronavirus (ccv). the other well-characterized human coronavirus, hcv-oc43, is in a separate antigenic group which includes mouse hepatitis virus (mhv) and bovine coronavirus (bcv). both human coronaviruses are mainly respiratory pathogens and have been estimated to cause up to 25% of common colds (mcintosh et a/., 1974; wege et a/., 1982) . they have also been implicated in gastrointestinal diseases (resta et a/., 1985) . furthermore, the isolation of coronaviruses bearing an antigenic relationship to hcv-oc43 from the central nervous system of two patients with multiple sclerosis has suggested a possible etiologic relationship between human coronaviruses and multiple sclerosis (burks et a/., 1980) . this possibility is supported by the observation that neurotropic strains of mhv cause demyelination in the central nervous system of rodents (weiner and stohlman, 1978) . thus, human coronaviruses are important human pathogens. the structural and biochemical properties of several coronaviruses, particularly mhv and avian infectious sequence data from this article have been deposited with the embugenbank data libraries under accession no. jo441 9. ' to whom requests for reprints should be addressed. 2 present address: department of virology, tottori university, school of medicine, yonago 683, japan. peritonitis virus (ibv), have been well characterized (lai et a/., 1987; boursnell et a/., 1987) . the virion contains a single-stranded, positive-sense rna molecule (molecular weight 6-8 x 1 o6 da) (lai and stohlman, 1978) associated in a helical conformation with nucleocapsid proteins (n) . the viral nucleocapsid is enclosed by an envelope, in which are embedded at least two types of viral proteins, the peplomer (e2) and matrix (el) glycoproteins. coronavirus rna replication occurs in the cytoplasm of infected cells and is mediated by a virusencoded rna-dependent rna polymerase (brayton et a/., 1982) . the virus-specific mrna in infected cells comprises a genomic-sized rna plus six subgenomic mrna species. these mrnas are arranged in a nested-set structure, which is characterized by rnas having common 3'-termini but extending for varying lengths in the 5'direction (lai et al., 1981) . only the 5'proximal regions of each mrna are translated (rottier et a/., 1981) . a unique feature of the structure of coronavirus is the existence, at the 5'-end of each mrna, of an identical leader sequence. this sequence is derived from the 5'-end of the genomic rna and is of approximately 70 nucleotides in length (lai eta/., 1983 (lai eta/., , 1984 . recent evidence has supported a role for the leader sequence in mediating a novel type of discontinuous transcription of genomic rna (baric et a/., 1985; makino et al., 1986; . in contrast to other coronaviruses, the molecular biology of human coronaviruses is relatively poorly understood. the genomic rna of both hcv-229e and hcv-oc43 has a molecular weight of approximately 6 x 1 o6 da (hierholzer et al., 1981) . the six subgenomic rna species appear to have lower molecular weights than those of the corresponding mhv rnas (weiss and leibowitz, 198 1) . the structure of these mrnas is not yet known. analysis of purified hcv-229e virions has revealed three major polypeptides: a glycosylated protein with a molecular weight of 180 kda, a phosphorylated nucleocapsid protein of 50 kda, and a family of polypeptides with molecular weights of 25, 23, and 21 kda (kemp et al., 1984) . in addition, several minor nonstructural polypeptides of 107, 92, and 39 kda have been identified (kemp et al., 1984) . the functions of these proteins have not yet been characterized. to further understand the molecular biology of hcv-229e, we have initiated molecular cloning and sequence analysis of hcv-229e rna. in this paper we report the sequence analysis of the gene encoding the nucleocapsid protein of hcv-229e. in addition, the mrna leader sequence was also identified. the results are compared with sequences of other coronaviruses including mhv, bcv, ibv, and tgev. hcv-229e (obtained from dr. j. fleming, university of southern california) was propagated at low multiplicities of infection in human fetal lung cells l132 (kennedy and johnson-lussenberg, 197511976 ) using dulbecco's modified eagle's medium (dmem) supplemented with 10% fetal calf serum. virus purification and preparation of virion rna following a virus adsorption period of 1 hr at 37", hcv-229e-infected l132 monolayers were incubated at 37" for 24 to 48 hr, at which time the cell culture fluid was harvested. viruses were precipitated from 2 liters of culture fluid with 50% ammonium sulfate and centrifuged at 8000 rpm for 30 min. the pellet was resuspended in nte buffer (0.1 m naci, 0.01 m tris-hydrochloride (ph 7.2), 1 mm edta) and then placed on a discontinuous sucrose gradient consisting of 60, 50, 30, and 20% (w/w) sucrose in nte buffer and centrifuged at 26,000 rpm for 13 hr at 4" in a beckman sw28.1 rotor. the virus band at the interface between 50 and 30% sucrose was collected and diluted threefold with nte buffer. the diluted virus suspension was centrifuged on a linear sucrose gradient at 26,000 rpm in an sw28.1 rotor for 4 hr at 4". the virus band was collected and treated with proteinase k (0.2 mg/ml) for 20 min at 37", followed by 1% sds for 30 min at 37". genomic rnawas extracted with phenol and then with phenol/chloroform, and precipitated with ethanol. monolayers of l132 cells grown in 100 x 20-mm culture dishes were infected with hcv-229e. cells were incubated in phosphate-free dmem containing 1% dialyzed fetal calf serum 4 hr prior to rna extraction. actinomycin d (1 pg/ml) (sigma) and [3zp] or-thophosphate (70 &i/ml) (icn radiochemicals) were added at 3 and 2 hr, respectively, prior to rna extraction at 15 hr postinfection (p.i.). cells were collected in cold phosphatebuffered saline and centrifuged at 5000 rpm for 3 min at 4". the pellet was mixed with cold 0.5% nonidet-p40 in nte buffer, incubated for 10 min at 4', and then centrifuged at 5000 rpm for 3 min. the supernatant was transferred to a fresh tube containing l/10 vol of 10% sds at room temperature and vortexed briefly. intracellular rna was extracted with phenol and phenol/ chloroform and precipitated with ethanol. poly(a)-containing rna was selected by oligo(dt)-cellulose chromatography as previously described (makino et al., 1984) . to examine the kinetics of viral mrna synthesis, intracellular rna was extracted from virus-infected l132 monolayers in 60 x 15-mm culture dishes at 7, 21, 29, 46, and 58 hr postinfection. cdna cloning cdna cloning was performed using a modified method of gubler and hoffman (1983) . the poly(a)containing rna extracted from 229e-infected l132 monolayers was precipitated, dried, and resuspended in 6.72 ~1 of autoclaved water. the rna was incubated with 10 mm methylmercuric hydroxide in an 8 ~1 total volume for 10 min at room temperature. first-strand cdna synthesis was carried out in a 50-~1 reaction mixture containing 60 units rnasin (promega biotec), 10 mm mgci,, 100 mm kci, 50 mm tris-hci (ph 8.3 at 42") 10 mm dtt, 1.25 mm dntps, 40 &i [a-32p]datp (3000 ci/mmol), 28 mm ,&mercaptoethanol, and 10 ng oligo(dt),2-,s primer. after 5 min at room temperature, 40 units of amv reverse transcriptase (life science) was added and the mixture was incubated for 1 hr at 42". the reaction was stopped by adding 4.4 ~1 of 250 mm edta. the products were extracted with phenol/ chloroform and precipitated with ethanol containing 0.3 m ammonium acetate. for second-strand synthesis, the loo-~1 reaction mixture contained 5 mm mgci,, 100 mm kci, 20 mm tris-hci (ph 7.5) 50 pgl ml bovine serum albumin (bsa), 10 mm ammonium sulfate, 0.15 mm p-nad, 100 pm dntps, 25 units of escherichia co/i dna polymerase i, 2 units of e. co/i dna ligase, and 0.8 units of rnase h. sequential incubations were for 1 hr at 12" and 1 hr at 22". the reaction was stopped by the addition of 8.7 ~1 of 250 mm edta and the products were extracted with phenol/ chloroform and precipitated with ethanol in the presence of 0.3 mammonium acetate. homopolymeric tailing of double-stranded cdna with poly(c) was carried out in a 1 ~-pi reaction mixture containing 10 units of terminal transferase, 200 mm potassium cacodylate, 0.5 mm co&, 25 mm tris-hci (ph 6.9), 2 rnn/l dlt, 250 pg/ml bsa, and 50 pm dctp at 37" for 4 min. the dc-tailed double-stranded dna was annealed to 200 pg of dg-tailed pstl-cut pbr322 plasmid in 20 ~1 of a buffer containing 10 mni tris-hci (ph 7.4), 100 mm naci, and 0.25 mh/l edta. the mixture was incubated for 5 min at 68" and then cooled slowly overnight. the annealed molecules were used to transform e. co/i mci061 as described (dagert and erhlich, 1979) . colonies grown on lb/tetracycline plates were incubated at 37" for 12 hr and transferred to colony/plaque screen disks (new england nuclear). bacterial lysis and dna fixation were carried out according to the methods previously described (grunstein and hogness, 1975) . the disks were prehybridized in a solution containing 0.2% polyvinylpyrrolidone (mw 40,000), 0.2% ficoll (mw 400,000), 0.2% bsa, 0.05 mtris-hci (ph 7.5) 1% sds, 1 l\/i naci, 10% dextran sulfate, and 100 pg/ml denatured salmon sperm dna at 65" for 6hr. fragments derived from either the 5'-or 3'-ends of gene 7 were labeled with 32p by nick-translation and added to the solution. hybridization was carried out for 20 hr at 65". the disks were then washed twice in 2~ ssc (0.3 lvi naci, 30 mlvi sodium citrate) at room temperature, twice in 2x ssc containing 1% sds for 30 min at 65", and twice in 0.1 x ssc at room temperature for 30 min. the disks were air-dried and exposed to xray film at -70". intracellular rna from virus-infected cells was denatured by glyoxal treatment and separated by electrophoresis on a 1% agarose gel containing 10 mll/l sodium phosphate (ph 7.0) as described previously (mc-master and carmichael, 1977) . rna transfer to biodyne nylon filters (icn radiochemicals) and subsequent hybridization were performed according to the method described by thomas (1980) . a synthetic oligodeoxyribonucleotide was 5'-end-labeled with [y-~'p]atp by polynucleotide kinase (pedersen and haseltine, 1980) . the total amount of poly(a)-containing rna extracted from 229e-infected cell monolayers in three 150 x 20-mm culture dishes was incubated in 8 pi of distilled water containing 10 mm methylmercuric hydroxide for 10 min at room tem-perature. a further incubation was carried out in a 50-~1 reaction volume containing 60 units of rnasin (promega), 10 mm mgc12, 100 mm kci, 50 ml\/l tris-hci (ph 8.3 at 42") 10 mm dlt, 1.25 mn/ldntps, 28 mm ,&mercaptoethanol, 5'-end-labeled synthetic oligodeoxyribonucleotides, and 20 units of amv reverse transcriptase (life science) for 1 hr at 42". reaction products were extracted with phenol/chloroform, precipitated with ethanol, and then analyzed by electrophoresis on a 6% polyacrylamide gel containing 8.3 m urea. the primer-extended product was identified by autoradiography and eluted from the gel according to the published procedure (maxam and gilbert, 1977) . sequencing was carried out by the dideoxyribonucleotide chain termination method (sanger et al., 1977) as well as the chemical modification procedure (maxam and gilbert, 1977) . in the first method, fragments of cdna inserts generated by various restriction endonucleases were cloned into the ml 3 vectors mp18 and mp19 (messing and vierira, 1982) . [(u-~~s]-datp was used as a label. sequence data were also obtained by chemical modification (maxam and gilbert, 1977) of various cdna fragments subcloned into the pt7-3 vector (tabor and richardson, 1985) . in the second method, cdna fragments were 3'-end-labeled with klenow fragment at internal restriction sites or, alternatively, at the polylinker cloning site of pt7-3. end-labeled cdna restriction fragments were separated by electrophoresis on preparative polyacrylamide gels (maxam and gilbert, 1980) and purified as described previously (hansen et a/., 1980; hansen, 1981) . sequencing of the primer-extended product of mrna7 was performed by the chemical modification procedure (maxam and gilbert, 1977) . sequence analysis was performed by the lntelligenetics and seqaid programs. hydropathy profiles were constructed using the pepplot program of the university of wisconsin computer genetics group, which employs both the kyle-doolittle (kd) and goldman, engelman, steitz (ges) algorithms. to determine the optimum time for extracting 229especific mrnas, we first studied the kinetics of virusspecific mrna synthesis. intracellular rna was extracted from infected l132 monolayers at specified times p.i. the rna was separated by agarose gel electrophoresis (fig. 1) . as can be seen, viral mrna synthesis could be detected as early as 7 hr p.i. and reached maximum at 29 hr p.i. thereafter, total rna synthesis gradually declined. by 46 hr p.i. onlythe most abundant mrna species were evident. the number and size of these mrna species are comparable to those of mhv mrnas and are in agreement with previously published results (weiss and leibowitz, 1981) . significantly, mrna 2a, which was previously found only in bcv-infected cells and proposed to encode hemagglutinins (king et a/., 1985; keck et a/., 1988) was not present. this is consistent with the finding that hcv-229e does not have hemagglutinating activity (hierholzer, 1976) . the relative amounts of the mrna species were the same throughout the replication cycle. therefore, in all of our subsequent experiments, the virus-specific intracellular rnas were extracted at 15 hr p.i. molecular cloning of hcv-229e genomic rna and intracellular virus-specific mrnas cdna cloning was initially performed using virion genomic rna as a template. the sizes of inserts in the resultant cdna clones ranged from 0.2 to 0.5 kb in length. one clone, a34, contained a 0.45-kb insert, which was subsequently characterized by restriction mapping and northern blot analysis. the 0.45kb fragment was labeled with 32p by nick-translation and hybridized with intracellular rna from 229e-infected cells. the result, shown in fig. 2 , revealed that the fragment hybridized to each of the mrna species. this result suggested that the hcv229e subgenomic mrnas possess a nested-set structure similar to other coronaviruses and that a34 represented a cdna clone of either the 3'-end of the genomic rna or the leader sequence. cloning was subsequently carried out using intracellular rna from 229e-infected cells as a template. the resulting cdna clones were screened by colony hybridization using the 0.45-kb fragment from clone a34 as a nick-translated probe (fig. 3) . several positive colonies were identified and characterized further. clone l8 contained a 3.6-kb insert but lacked a 3'-poly(a) tail. clone l37, which contained an insert of 1.7 kb, overlapped l8 but was 0.1 kb shorter at the 3'-end. this clone also lacked a poly(a) sequence (see below). therefore, additional cdna clones were isolated using a 0.24-kb bal i-ecori fragment of l8 (fig. 3a) as a probe. these latter clones were further characterized by southern blot analysis. clone slo contained an insert of 0.8 kb which overlapped the 3'-ends of the two previous clones and extended another 0.4 kb in that direction. figure 3b shows the orientation and sizes of clones l8, l37, sl 0, and a34 with reference to theviral genome. restriction enzyme sites used for sequencing are also shown. to determine the sequence of the 3'-end of hcv-229e genome, various restriction fragments of l8, l37, and slo were subcloned into ml 3 vectors. for l8, only the 1.2-kb fragment extending from an internal pstl site toward the 3'-end was sequenced. clone l37 was also sequenced in part. figure 3c shows the cdna fragments and strategy used in sequencing. each region a primer extension study was carried out using a synthetic oligodeoxyribonucleotide complementary to an 18.mer sequence underlined near the 5'.end of the gene. the 3'noncoding region contains a conserved sequence which is shown by the double line. the intergenic conserved sequence, tctaaact, is also shown (dotted line) was verified by dideoxy chain termination sequencing of both strands or by the chemical modification method. clone sl 0 was found to have a poly(a) stretch of 34 bases. figure 4 shows the complete dna sequence with a translation of the main open reading frame (orf) in one-letter amino acid code. this orf extends from base 147 to base 13 13 and predicts a 389 amino acid protein with a molecular weight of 43,366 da. this predicted molecular weight is slightly smaller than the measured molecular weight of the nucleocapsid protein of hcv-229e, which is 50 kda as determined by sds-polyacrylamide gel electrophoresis (macnaughton, 1980) . the difference is probably due to phosphorylation or other modification of the protein. the predicted protein shares features with the nucleocapsid proteins of tgev, mhv, bcv, hcv-oc43, and ibv (kapke and brian, 1986; skinner and siddell, 1984; armstrong er a/., 1983; lapps et a/., 1987; kamahora et al., 1988; boursnell et a/., 1985) . namely, the protein is highly basic and rich in serine residues. sixty percent of the amino acid residues are basic and 12% are acidic. there are 39 serine residues (10% of total), which are presumed to be sites of phosphorylation (stohlman and lai, 1979) . when compared to tgev, with which hcv-229e shares antigenic properties, both n proteins have identical amounts of basic and acidic amino acids and serine residues and similar molecular weights (kapke and brian, 1986) . figure 5 shows a schematic diagram of the possible orfs obtained by translating the nucleotide sequence. the orf in frame 3 is likely the one which encodes the nucleocapsid protein. in frame 2, the 5'-flanking region probably contains part of the sequence of the matrix protein encoded by gene 6. this possibility is sup--i1 i ii i iii -----__ 1 iii1 -i -llil i ii ii i i i iii i ii 2-j i i 111111 ii i1111111 lllll iii i ii, 1111 i ill1 iiii i 3 i i i i ill i i i11 i i i i i ported by the finding that reading frame 2 remains open at the extreme 5'-end. furthermore, the sequence tctaaact, which is found in the intergenic regions of several other coronaviruses (kapke and brian, 1986; skinner and siddell, 1984; armstrong et a/., 1983; lapps et al., 1987; kamahora et a/., 1988; budzilowicz eta/., 1985) is also present between the presumed initiation codon of the main orf and the 3'-end of gene 6. this sequence is the proposed site of fusion of the leader sequence with the mrna coding region makino et al., 1986; budzilowicz et al., 1985) . the 3'-noncoding region contains the sequence tggaagagcca, 75 nucleotides from the 3'-end (fig. 4) which is relatively conserved among coronaviruses and is found at approximately the same location in all of these viral genomes (kapke and brian, 1986; skinner and siddell, 1984; armstrong et a/., 1983; lapps et al., 1987; kamahora et a/., 1988; boursnell et al., 1985) ( table 1) . there is only one nucleotide difference in this conserved sequence when it is compared with that of tgev, bcv, and hcv-oc43. two and three nucleotide differences are found in ibv and mhv, respectively. this conservation of sequence and location suggests that it may be important for viral rna replication. in frame 1, there are several additional orfs of at least 30 amino acids. some of these, including one found in the 3'-noncoding region, lack appropriate translation start sites. another long internal orf is found from base 322 through 693. this contains an appropriate initiation sequence and encodes a hypothetical protein of 13,974 da, which is rich in leucine residues (17%). the significance of this orf remains to be defined. the mrnas of coronaviruses contain a stretch of leader sequence which is derived from the 5'-end of the viral genome and exhibits homologywith the intergenic consensus sequence budzilowicz et al., 1985) . since our cdna clones did not appear to contain leader sequences, we used primer extension studies to determine the sequence of the hcv-229e leader rna. a synthetic oligodeoxyribonucleotide which was complementary to an 18-mer sequence located near the 5'-end of the gene (fig. 4) was end-labeled and used in a primer extension study with poly(a)-selected intracellular mrna as a template. the reaction products, separated by agarose gel electrophoresis, revealed six bands (data not shown). since these bands were most likely to represent the primerextended products of the individual mrna species, the smallest and most abundant band, corresponding to the primer-extended product of mrna7, was eluted and sequenced by the chemical modification method (maxam and gilbert, 1977) . the sequence of the 3'-end of the primer-extended product was identical to the l8 sequence from nucleotides 129 to 17 1. at nucleotide 128, immediately 5' to the proposed leader mrna fusion site, the sequence diverged from the l8 sequence and revealed a putative 60-base leader sequence which is shown in fig. 6 . the figure also shows a degree of homology with the leader sequence of ibv. considerably less homology exists between the leader sequence of hcv-229e and those of hcv-oc43 and mhv-jhm (data not shown). this report presented the primary sequence of the nucleocapsid gene and leader sequence of hcv-229e. when compared to the known sequences of other coronaviruses (kapke and brian, 1986; skinner and siddell, 1984; armstrong et a/., 1983; lapps et al., 1987; kamahora et al., 1988; boursnell et al., 1985) , common features of coronavirus nucleocapsid proteins emerged; namely, they are highly basic and have a high proportion of serine residues, which have been shown 30 40 50 60 i i i i i i hcv-22 9e 5'-cttaag*taccttat*ctatcta*caaatagaaaag **ttgctttttagactttgtgtc*ta*cttc . . . . . . . . . . . . :: : : ::: :: : : :: : :: :::: :::. . ::: : :: : ibv 5'-acttaagatagatattaatatatatctattacactagccttgc**gctagatttttaa*cttaacaaa..... fig. 6. hcv-229e mrna leader sequence compared to the leader sequence of ibv. the ibv leader extends for at least 16 nucleotides in the 3' direction. to be sites of phosphorylation (stohlman and lai, region of 46% homology within the amino-terminal 1979). the relationship between the nucleocapsid one-third of the protein which extends from residues genes of wv-229e and tgev is particularly interest-29 to 134 in hcv-229e, and 41 to 146 in tgev. furing since the viruses are antigenically related (mac-thermore, approximately 10 amino acids downstream naughton, 1981). the predicted molecular weights of from the homologous region in both proteins lies an the n protein and the number of potential phosphoryla-area which is abundant in serine residues, suggesting tion sites of both viruses are almost identical. although that this may be an important functional domain of the these two viruses have little nucleotide sequence ho-molecule. to further examine such functional homolmology between their nucleocapsid genes, the amino ogy between the two proteins, hydropathy profiles acid sequences are homologous within a limited re-were constructed (fig. 7) . the contour of these plots gion. amino acid sequence analysis revealed several suggests that a certain degree of functional homology structural features common to both viruses, which may exists within the first and last one-third of each molehave functional significance. for instance, there is a cule, with an additional region around position 200. the peak around position 200 occurs just after the serine-rich region of the molecule. the relative conservation of these regions suggests a possible role in the interaction of the n protein with the viral genome. similar structural features exist among the n proteins of hcv-229e, ibv, mhv, hcv-oc43, and bcv (skinner and siddell, 1984; lapps et a/., 1987; kamahora et a/., 1988; boursnell et a/., 1985) . this is demonstrated by the hydropathy profiles of these proteins, which are also shown in fig. 7 . further studies are required to reveal the functional significance of the conserved domains. another interesting finding is the open reading frame internal to the main coding region of the hcv-229e n gene. thus far, two other coronaviruses, bcv and mhv-jhm, have been found to contain internal orfs in gene 7 (skinner and siddell, 1984; lapps eta/., 1987) which are preceded by optimum translation initiation signals according to kozak's consensus sequence (kozak, 1983) . the predicted amino acid sequences could encode hypothetical proteins of molecular weights 13,973; 14,842; and 23,057 for hcv-229e, mhv-jhm, and bcv, respectively. interestingly, all three sequences are abundant in leucine residues (17 to 19%). hcv-oc43 also has two smaller internal orfs encoding potential leucine-rich proteins of 8830 and 16,297 molecular weights (kamahora et a/., 1988) . further studies to determine whether this hypothetical protein can be detected in 229e-infected cells or by in vitro translation of a full-length cdna clone (i.e., l8) are in progress. finally, the 3'-noncoding conserved sequence of gene 7 lends additional support to a common ancestry for coronaviruses, regardless of antigenic subgroup. this sequence has been proposed as a recognition site for the virus-encoded rna-dependent rna polymerase prior to negative-strand synthesis (kapke and brian, 1986) . certainly future studies must focus on examining the role of this conserved region in the viral replication cycle. sequence of the nucleocapsid gene from murine coronavirus mhv-a59 characterization of leader-related small rnas in coronavirus-infected cells: further evidence for leader-primed mechanism of transcription sequences of the nucleocapsid genes from two strains of avian infectious bronchitis virus completion of the sequence of the genome of the coronavirus avian infectious bronchitis virus characterization of two rna polymerase activities induced by mouse hepatitis virus. 1. viral three intergenic regions of coronavirus mouse hepatitis virus strain a59 genome rna contain a common nucleotide sequence that is homologous to the 3'end of the viral mrna leader sequence two coronaviruses isolated from central nervous system tissue of two multiple sclerosis patients prolonged incubation in calcium chloride improves the competence of escherichia coli cells colony hybridization: a method for the isolation of cloned dnas that contain a specific gene simple and very efficient method for generating cdna libraries use of solubilizable acrylamide disulfide gels for isolation of dna fragments suitable for sequence analysis chemical and electrophoretic properties of solubilizable disulfide gels purification and biophysical properties of human coronavirus 229e the rna and proteins of human coronaviruses sequence analysis of nucleocapsid gene and leader rna of human coronavirus oc43 sequence analysis of the porcine transmissable gastroenteritis coronavirus nucleocapsid protein gene temporal regulation of bovine coronavirus rna synthesis characterization of viral proteins synthesized in 229e-infected cells and effect(s) of inhibition of glycosylation and glycoprotein transport /76). isolation and morphology of the internal component of human coronavirus, strain 229e bovine coronavirus hemagglutinin protein comparison of initiation of protein synthesis in procaryotes, eucaryotes, and organelles replication of coronavirus rna. /n "rnagenetits characterization of leader rna sequences on the virion and mrnas of mouse hepatitis virus, a cytoplasmic virus mouse hepatitis virus a59: messenger rna structure and genetic localization of the sequence divergence from the hepatotropic strain mhv-3 coronavirus: a jumping rna transcription presence of leader sequences in the mrna of mouse hepatitis virus the rnaof mouse hepatitis virus sequence analysis of the bovine coronavirus nucleocapsid and matrix protein genes the polypeptides of human and mouse coronaviruses structural and antigenic relationships between human, murine and avian coronaviruses leader sequences of murine coronavirus rna can be freely reassorted: evidence for the role of free leader rna in transcription analysis of genomic and intracellular viral rnas of small plaque mutants of mouse hepatitis virus a new method for sequencing dna sequencing end-labeled dna with base-specific chemical cleavages coronavirus infection in acute lower respiratory tract disease of infants analysis of singleand double-stranded nucleic acids on polyacrylamide and agarose gels by using glyoxal and acridine orange a new pair of ml3 vectors for selecting either dna strand of double-digest restriction fragments a micromethod for detailed characterization of high molecular weight rna antigenic relationship of the feline infectious peritonitis virus to coronaviruses of other species isolation and propagation of a human enteric coronavirus translation of three mouse hepatitis virus strain a59 subgenomic rnas in xenopuslaevisoocytes dna sequencing with chain-terminating inhibitors the 5'-end sequence of the murine coronavirus genome: implications for multiple fusion sites in leaderprimed transcription nucleotide sequencing of mouse hepatitis virus strain jhm messenger rna 7 phosphoproteins of murine hepatitis virus a bacteriophage t7 rna polymerase/promoter system for controlled exclusive expression of specific genes hybridization of denatured rna and small dna fragments transferred to nitrocellulose the biology and pathogenesis of coronaviruses viral models of demyelination comparison of the rnas of murine and human coronaviruses we thank carol flores for assistance in preparation of the manuscript. this work was supported by public health service research grants nsl8146 and all 9244 from the national institutes of health and grant 1449 from the national multiple sclerosis society. s.s.s. is supported by a postdoctoral training fellowship from the national institutes of health grant ns07149. key: cord-324216-ce3wa889 authors: wang, zheng; malanoski, anthony p; lin, baochuan; kidd, carolyn; long, nina c; blaney, kate m; thach, dzung c; tibbetts, clark; stenger, david a title: resequencing microarray probe design for typing genetically diverse viruses: human rhinoviruses and enteroviruses date: 2008-12-01 journal: bmc genomics doi: 10.1186/1471-2164-9-577 sha: doc_id: 324216 cord_uid: ce3wa889 background: febrile respiratory illness (fri) has a high impact on public health and global economics and poses a difficult challenge for differential diagnosis. a particular issue is the detection of genetically diverse pathogens, i.e. human rhinoviruses (hrv) and enteroviruses (hev) which are frequent causes of fri. resequencing pathogen microarray technology has demonstrated potential for differential diagnosis of several respiratory pathogens simultaneously, but a high confidence design method to select probes for genetically diverse viruses is lacking. results: using hrv and hev as test cases, we assess a general design strategy for detecting and serotyping genetically diverse viruses. a minimal number of probe sequences (26 for hrv and 13 for hev), which were potentially capable of detecting all serotypes of hrv and hev, were determined and implemented on the resequencing pathogen microarray rpm-flu v.30/31 (tessarae rpm-flu). the specificities of designed probes were validated using 34 hrv and 28 hev strains. all strains were successfully detected and identified at least to species level. 33 hrv strains and 16 hev strains could be further differentiated to serotype level. conclusion: this study provides a fundamental evaluation of simultaneous detection and differential identification of genetically diverse rna viruses with a minimal number of prototype sequences. the results demonstrated that the newly designed rpm-flu v.30/31 can provide comprehensive and specific analysis of hrv and hev samples which implicates that this design strategy will be applicable for other genetically diverse viruses. human febrile respiratory illness (fri) results in significant annual health and economic burden worldwide, but the diversity and number of pathogens make differential diagnosis very challenging. thus, it represents a useful example where many organisms ranging from bacteria (haemophilus influenzae) to fairly conserved viruses (respiratory syncytial virus) to genetically diverse viruses, i.e. influenza a virus, human rhinoviruses (hrv), and human enteroviruses (hev) need to be detected for successful differential diagnosis. several technologies, mass-code™ multiplex rt-pcr system [1] , electrospray ionization mass spectrometry analysis of pcr amplicons [2] , luminex ® xmap™ [3] , and various microarray-based approaches [4] [5] [6] [7] [8] , are currently under development as diagnostic platforms to effectively and simultaneously detect and identify large numbers of diverse viral and bacterial respiratory pathogens. one high-density resequencing microarray platform, the respiratory pathogen microarray version 1 (rpm v. 1) , has been successfully demonstrated to identify a much broader range of pathogens (including bacteria and dna and rna viruses) in a single test at sensitivities and specificities that are similar to or improved over those of other technologies [9, 10] . in addition, the rpm v.1 platform has the demonstrated capability to discriminate among known and previously unknown strains and variants of targeted pathogens [11, 12] . while promising, the rpm v.1 platform was a proof-ofconcept microarray for the detection of 26 common respiratory pathogens primarily encountered among military basic trainees. it did not provide comprehensive coverage of all potential respiratory pathogens and the design methodology used was not appropriate for genetically diverse viruses. the design methodology for the rpm v.1 microarray consisted of applying selection rules developed for long oligonucleotide microarrays. these rules were not optimal but worked for bacterial organisms and fairly conserved viruses since previous studies had shown a single sequence on a resequencing microarray could reliably detect and serotype strains with as much as 10 to 15% variation [8, [10] [11] [12] . their application to cover more diverse viral organisms was less successful. for example, the 5' untranslated region (5'utr) sequence chosen for hrv on the rpm v.1 only provided identification of the prototype hrv-89 and very little coverage of other hrv serotypes. the 5'utr sequences, which are relatively conserved among hrv and hev, have been used in pcr and de novo sequencing for tentative viral identification or serotype classification in lieu of the much more variable capsid proteins that actually determine serotypes [13, 14] . however, the 5'utr sequences still have ~5 to 30% nucleotide sequence variations among different serotypes so require more than one prototype sequence for proper identification and serotyping. serotyping hrv and hev is important to fri differential diagnosis because even though these "common cold" viruses generally only induce mild symptoms, they can cause a wide variety of other severe illnesses, such as aseptic meningitis [15] , bronchitis and asthma [16] . new resequencing pathogen microarray designs, versions 3.0 and 3.1 (rpm-flu v.30/31), have been constructed to address the shortcomings of the previous design. the use of 8 μm feature allows microarrays with greater coverage, currently 86, of common respiratory organisms and high human health risk zoonotic pathogens (bacteria and viruses). a new approach to select a minimal number of prototype sequences that can be used to detect all and correctly identify many of the relevant strains of genetically diverse viruses such as hrv and hev was developed. due to the great genetic diversity of hrv and hev, in order to ensure that designed probes (referred to as probe sequences) generated from selected database sequences (referred to as prototype regions) would detect and discriminate all serotypes of hrv and hev, a predictive model was used to assist the microarray design [17] . this in silico model developed for predicting resequencing microarray hybridization patterns shows good concordance in the overall percentage of base calls predicted versus experimental results. thus it is possible to use this model for evaluating the performance of database sequences as potential prototype regions. in this study, we report on results of this algorithm applied to the 5'utr sequences of hrv and hev and confirm that using ~15% of the rpm-flu v.30/31 microarray (17,335 hrv and hev nucleotides of total 117,254 nucleotides on array) is sufficient to detect and differentiate many hrv and hev serotypes. in silico modeling figure 1 illustrates the procedures used for the selection of hrv and hev probe sequences. first, sequences that contain the specified target region (5'utr) and meet any selection criteria applied were downloaded from a database (currently genbank). downloaded sequences were trimmed to cover the same region using pair-wise sequence alignment. these sequences were treated as target sequences (what would be detected by the microarray) and also as prototype sequences (the potential probe sequences tiled on the microarray). each downloaded sequence was treated as a prototype used to generate probe sequences (fig 1. step 1-2) and the remaining sequences were treated as target sequences (fig 1. step [3] [4] . sets of 4 25-mer probes (1 perfect match and 3 mismatches in the 13 th position) were generated from a prototype sequence and correspond to what would actually appear on a resequencing microarray. the other sequences were treated as a target one at a time and generated overlapping fragments from 13 to 25 bases long with a near neighbor δg energy less than -14.5. these fragments have been shown to have strong binding strength and produce unique base calls. the generated probes and sequence fragments were the input to the in silico model [17] for simulation which compared the fragments to the probe sets and determined the base calls a target sequence would generate. the predicted base calls were assembled into a simulated resequencing microarray result. the simulated result of a target sequence for the current prototype sequence was then run through the previously developed cibsi analysis algorithm [8] with the following criteria. a sequence was considered detected by the cur-rent prototype sequence if at least one region of 50 or more contiguous nucleotides was predicted to consist of a, c, g, and t base calls and no ambiguous base calls (ns). as shown in figure 1 "yes" for "cibsi would identify" updated the list of sequences that could be detected by the current prototype sequence. this procedure was applied for every downloaded sequence. for example, if we collect sequences "a-z" from genbank, we will first use sequence "a" as a prototype sequence, then use sequences "b-z" each in turn to generate target sequences for in silico schematic of algorithm representing the prototype sequences selection process figure 1 schematic of algorithm representing the prototype sequences selection process. a collection of database sequences covering a specified region are processed together. each sequence is treated as probe sequence that the other sequences are tested against. the numbers of these sequences detected by the probe sequences are determined. a group of sequences that are predicted to detect all the sequences is then selected. simulation. after the completion of the simulation and cibsi analysis, a list of sequences from the pool of sequences "b-z" that can be detected by prototype sequence "a" will be generated. then the cycle begins again with sequence 'b" as prototype sequence, while sequences "a, c-z" each in turn is used as target sequences to generate the list for sequence "b". the cycle will continue until we generate the list for all download sequences (sequences 'a-z"). after this is completed, the second stage of the process will be undertaken. the number of sequences that a sequence (as prototype) is predicted to detect will be sorted and ordered. the sequence that was predicted to detect the most other sequences (as targets) was selected as a probe sequence to be used in the microarray design. it was then removed from the list of sequences. all the target sequences detected by that prototype sequence were also removed from the list of sequences. this procedure was repeated until the list of sequences was empty. when two or more prototype sequences were predicted to detect the same maximal number of target sequences, one was randomly selected hence the method was non-deterministic. the process was repeated with different random seeds and the number of required probe sequences did not vary significantly while the sequences used in the microarray design could change. application: hrv and hev the described design method could be applied to any group of sequences and a minimum set of prototype regions would be determined. the group of sequences used for hrv probe design was chosen using different criteria than those used to select the hev sequences in the hev probe design due to differences in the available sequences for each in genbank. at the time of this design, only eight hrv serotypes had complete genomes sequenced. these genome sequences and all complete and partial 5'utr sequences available for hrv in genbank were retrieved in april 2006 and a total of 150 sequences were used in the predictive modeling. a set of 26 sequences with lengths between 145 and 500 bp were predicted to provide detection of all those input sequences (additional file 1). because hev is better characterized with complete genome sequences of all 60 recognized serotypes, the design algorithm was applied to one complete genome sequence of each serotype. in addition, the design algorithm was applied to the 3d region. the design procedure generated 4 to 8 sequences for hev detection using 5'utr region, and 13 sequences were predicted to detect all hev 3d regions. it was decided to use corresponding 5'utr sequences of the 13 genomes that the 3d targets were selected from so that the same serotype was targeted by both target regions (additional file 1). these 13 5'utr regions were predicted to still provide complete and now redundant coverage. to assess the performance of rpm-flu v.30/31 chip design, 34 known hrv serotype strains obtained from atcc were tested. of the 34 strains, 18 had corresponding 5'utr sequences tiled on the microarray and were called prototype serotypes, which were used to verify the accuracy of the designed hrv probes. the remaining 16 strains, representing near neighbor serotypes, were selected from diverse clades based on phylogenetic classification of hrv serotypes [18, 19] . these strains were used to investigate the capability of the microarray to detect other hrv serotypes that did not have their sequences tiled on the microarray. overall, the selected strains covered every single clade of 101 hrv serotypes based on phylogenetic analysis of the p1-p2 regions of 5'utr sequences [18] . one metric of the hybridization in a reference region is to divide the number of bases reported as a, c, g, or t by the total number of bases for that region (probe length), which we refer to as the base call rate and proportionally reflects the hybridization strength or homology between the prototype and target sequences. a hybridization profile ( fig. 2a) using the base call rates clearly showed a unique pattern for each serotype. the closely related serotypes with less nucleotide divergences had similar hybridization profiles across the tiled regions, so it is possible to assign species (hrva and hrvb) based only on the hybridization patterns. the brighter red spots (higher base call rates) along the diagonal suggested that stronger hybridizations between the tiled probes and 5'utrs from the prototype serotypes. it is also of note that hrv87 does not fall into either of two major clusters which agrees with other findings that it should really be classified as a hev [20, 21] . to validate the accuracy of the array clustering, 5'utr of each serotype was amplified by type-specific rt-pcr and subjected to conventional sequencing. the phylogenetic tree derived from the de novo 5'utr sequences (additional file 2) confirms the hrva and hrvb classification. pair-wise sequence comparisons also indicated that the average nucleotide divergence of 5 to database entries, this information could be integrated into the final identification. for these samples with high base call rates, the best hit would have tens to hundreds more matched bases than the next best hit. the remaining 16 atcc strains represent near neighbor serotypes, which 5'utr sequences share >80% identifies to those of prototypes. in all but the case of hrv5, the information was sufficient to identify the correct serotypes. in these samples, the difference in the number of base calls matching in the best hit and the next best hit was more variable and on average was fewer. the confidence of the identification depends on the accuracy of the base call. since the best hit always had at least one base that matched the resequencing results and was a mismatch for the next best hit, it was hybridization profiles of hrv and hev serotypes from rpm-flu v.30/31 microarrays figure 2 hybridization profiles of hrv and hev serotypes from rpm-flu v.30/31 microarrays. (a) 34 hrv serotypes were classified into two clusters corresponding to species hrva and hrvb; (b) 29 hev serotypes (including hrv87) were classified into two clusters corresponding to heva/b and hevc/d species. base call rates (number of base calls/probe length in each tile) generated from viral samples (rows) and prototype probes (columns) were calculated and clustered using dchip software. rows standardized base call rates. positive hybridization was represented by red color. higher base call rates were shown as brighter red colors. negative hybridization (no base call) was represented by green color. the sample hrv87 is underlined. possible to establish the lowest confidence level that the best hit was the correct identification. the resequencing microarray's accuracy for determining the base call has been established under a variety of conditions [23] . using this information, there is a .000001 n that the next best hit is the most similar sequence in the database because a base is misidentified by the resequencing microarray where n is the number of mismatches between the next best hit and the matches for the best result. with .0001% being the largest level of uncertainty seen for these samples, it was deemed acceptable to treat the best hit as the correct identification. for hrv5, several database sequences representing different serotypes had the same score and since no further information was available it was only possible to determine that a hrv species b was present. the 5'utr sequences from the tested strains generated with de novo sequencing were subjected to in silico predictive modeling analysis and the result of this was used as input in the cibsi analysis program. for the case of hrv5, the in silico model predicted a larger fraction of base calls being made than were observed in the experiment. for all other samples there was a good correspondence on base call fractions between model and experiment as expected. this leads us to suspect there was a processing error or sample degradation leading to the less accurate identification. a panel of 28 hev serotypes, including serotypes from all four hev species, was similarly used to validate the specificity of rpm-flu v.30/31 for hev detection and identification. these serotypes were originally typed based on vp1 sequences (personal communication -steve oberste) prototype serotypes having strong hybridization signals were designated in bold characters. the strain not identified at serotype level by microarray was underlined. *serotype identities were made by searching 5'utr sequences of hrv isolates against genbank; () indicates the highest percentage of identity to the sequence in genbank. and the majority of them belonged to members of hevb. the hybridization profile ( figure 2b) shows distinct clusters in a similar fashion to the hrv samples based on serotypes. in this case, heva and hevb make up one cluster, while hevc and hevd (including hrv87) comprise a second cluster. this finding is consistent with the previously described clusters for hev utr sequences [24, 25] . the redundancy of the targets that was a consequence of how they were selected is apparent in the more uniform response observed within each cluster to the various strains. analysis of sequences reported from rpm-flu 30/31 array analysis indicated that two levels of identifications were obtained from 28 strains ( table 2 ). serotype level identification was made for 11 of the 28 strains, in which 9 cases correlated with typing made by the vp1 genes. for example, hev71 and coxsackievirus a16 (cava16), known to cause hand-foot-mouth disease, were unambiguously recognized as hev71 and cava16 respectively using rpm-flu v.30/31 and analysis program. two strains were identified by rpm-flu v.30/31 as hev4 and hev5, results in agreement with the conventional sequencing of each 5'utr. however, the strains as provided by cdc were identified as hev6 and cavb3 respectively, based on the vp1 region. specific serotypes could not be identified for the remaining sixteen samples using the sequence read generated from the array. nevertheless these samples were easily categorized into the respective species. due to amplification problems only a subset of strains were successfully de novo sequenced, which showed 3 -18% variations in 5'utr sequences. the base call rates obtained by in silico predictions based on the de novo sequences were similar to the microarray results and the identifications agreed in all but one case. this study demonstrated the use of an algorithm for the design of probe sets based on an in silico predictive model [17] , developed by our group, that minimized the probes needed for detection and identification of most serotypes of hrv and hev. the potential of using resequencing microarray for simultaneous detection and identification of highly diverse respiratory pathogens, such as hrv and hev, was also demonstrated. the conserved nature of the 5' utr regions of hrv and hev genomes and the capabilities of the resequencing microarray allow serotype level identification of near-neighbor serotypes of hrv and hev, when long (> 100 nucleotides) sequences are read from the array. identifications can be still made for shorter length sequences to the species level particularly when the array has one or more such sequences derived from different probes. the utility of the resequencing microarray is related to the target selection, the optimized prototype sequences represented on the array. in the case of rpm-flu v.30/31, the selection of hrv targets has proved to be very robust. the 5'utr has been shown in this study to be a good choice for serotyping hrv on rpm, as it performed similarly well on other platforms [13, 26] . all hrv variants tested in this study could be detected and identified at least to the species level. the limited number of hrv sequences available in genbank during the time of design of rpm-flu v.30/31 rendered a few of the targets represented on rpm-flu v.30/31 are shorter than 200 bp. in the past year, complete genome sequences from 46 more serotypes and another two divergent hrv'x's have been reported [6, 19, 27] . it will be worthwhile to update the design for the next generation of the chip. in the case of hev, the rpm-flu v.30/31 assay identified only 11 of 28 strains tested at serotype level. several strains not producing serotype identifications might have been indicative of assay protocol issues or probe design. the fact that the in silico model prediction was also not serotype specific indicates it was most likely a design issue. this was further confirmed by agreement in base calls made from the resequencing microarray and from conventional resequencing. although all the strains of hev have complete genome sequences, there are also many partial sequence submissions for each strain in gen-bank that were ignored for the hev design. a re-examination of the 5'utr regions showed up to 14% difference in sequences grouped in the same serotype. this indicates that a redesign of the hev prototype regions is needed where selection of a minimal set of prototype regions would be based on all available 5'utr sequence data (complete and partial) and not a subset of genome sequences. comparing the identifications made from de novo sequencing to the identifications made by cdc (sources of the samples) illustrated another shortcoming of using the 5'utr region for hev that did not occur for hrv. oberste et al. demonstrated that typing based upon hev vp1 capsid gene sequences showed excellent correlation with serotype determined by classical antigenic methods [28] . thus amplification and sequencing of the partial vp1 amino-terminal coding region has been accepted as a standard molecular typing method for hev but such is not the case for the 5'utr region [29] [30] [31] [32] [33] . our results show that the 5'utr region did not correlate as closely as vp1-based typing to antigenic type definitions for hev unlike how it performed for hrv. while the 5'utr region is sufficient to accurately identify the groupings, a design using vp1 as the probe region is needed to provide serotyping identifications that will match classical methods. the current rpm design can detect and identify a more comprehensive set of viral and bacterial respiratory pathogens in parallel, including detailed discrimination of certain serotypes of hrv and hev. this study showed that most shortcomings in the design were a result of not including adequate reference sequences for the initial design. the selection of vp1 and 3d regions also showed that incorporation of primer design considerations must be contemplated sooner in the design process than it has been currently done to prevent the selection of regions that cannot be used. future development will address these limitations by reducing hev probe redundancy and lack of coverage, by updating or confirming the hrv probes to be derived from newly available hrv sequences, and by involving primer design earlier in the overall design process. a powerful feature of the expanded rpm-flu v.30/31 resequencing pathogen microarray is that the nucleotide sequences generated from hybridization of the sample rna/dna and array-bound probe sets in conjunction with previously developed sequence analysis algorithm cibsi can be easily interpreted to make serotype or strain identifications. this feature and the platform's high resolution and high throughput aspects undoubtedly have great potential for use as a diagnostic tool, and therefore, efforts are currently underway to test the utility of this array on more clinical samples. the results presented also validated the usefulness of the design methodology and it is currently being applied to assist in a new microarray application associated with other genetically diverse viruses. a panel of 27 cultured enterovirus (hev) prototype strains was purchased from center for disease control and prevention (cdc, atlanta, ga). the prototype strains of 34 rhinoviruses (hrv) and hev69 with known titers were purchased from the american type culture collection (atcc, manassas, va). total nucleic acids were extracted from 125 μl cultured samples by using the mas-terpure™ dna purification kit (epicentre technologies, madison, wi) and dissolved in 20 μl of nuclease-free water. all 5'utr sequences of hrv and hev with approximately 750 bp sizes were downloaded from genbank. potential pcr primer pairs that are able to amplify 600 -700 bp fragments from hrv and hev were automatically selected by a perl script primer search program developed by our group using the rules described in previous publication [10, 11] . the multiplex reverse transcription polymerase chain reaction (rt-pcr) protocols for rpm-flu v.30/31 were carried out as previously described [10] with the following modifications. for the rt step, primer ln was replaced by primer nln (a random 9mer with the unique linker sequence), 1 pg each of two internal controls nac1 and triosephosphate isomerase (tim), and 5 μl of the extracted viral nucleic acids were used. the 5 μl rt reaction product was subjected to the multiplex pcr reaction. platinum taq dna polymerase (invitrogen life technologies, carlsbad, ca) was replaced by gotaq ® dna polymerase (promega corporation, madison, wi) in the pcr reaction. primer nl instead of primer l was used with 50-150 nm each of 5'utr primers in the multiplex pcr. the amplification reaction was carried out in a peltier thermal cycler -ptc240 dna engine tetrad 2 (mj research inc., reno, nv) with an initial incubation at 25°c for 10 min, then preliminary denaturation at 94°c for 2 min followed by 16 cycles of 94°c for 30 s, 45-60°c for 30 s (incremental increase of 1°c per cycle), and 72°c for 90 s, then 24 cycles of 94°c for 30 s and 60°c for 120 s. microarray hybridization and processing, and the image scanning were performed according to the manufacture's recommended protocol (affymetrix inc., santa clara, ca) using a genchip resequencing assay kit (affymetrix) with modification as previously described [10] . after scanning, gcos software was used to reduce the raw image (.dat) file to a simplified file format (.cel file) with intensities assigned to each of the corresponding probe positions. gdas software was then used to produce nucleotide reads (a, c, g and t) or base calls, comparing the respective intensities for the sense and antisense probe sets. the sequences from base calls made for each tiled region of the resequencing microarray were exported from gdas as the fasta-formatted files. base call rate refers percentage of number of base calls generated from the full length of probe in each tile. final pathogen identification for the rpm-flu v.30/31 assay was performed using computer-implemented biological sequence identifier (cibsi) version 2.0 software [22] , an automatic pathogen identification algorithm based on nucleic acid sequence alignment, which was developed and tested in detail in previous studies [10, 11] . the ncbi blast and taxonomy databases used for cibsi analysis was downloaded in december 2007. heat-map and clustering dendrogram was made with dchip 2005 (dna-chip analyzer, http://www.dchip.org). the rows of the imported data (base call rates) were standardized and clustered. clustering distance was 1 -correlation with average linkage, and gene ordering by cluster tightness. 5'utr sequences were amplified from hrv-or hev cdna with specific primers. amplified products were purified and sent to macrogen usa (gaithersburg, md) for automated sanger/electrophoresis-based sequencing using corresponding specific primers. phylogenetic analysis of 5'utr sequences was performed by using neighbor-joining method in mega software http://www.megasoft ware.net. all nucleotide sequences used in this study are available at genbank (accession nos. eu870449-eu870493). diagnostic system for rapid and sensitive differential detection of pathogens rapid identification and strain-typing of respiratory pathogens for epidemic surveillance applications of luminex xmap technology for rapid, high-throughput multiplexed nucleic acid detection detection of respiratory viruses and subtype identification of influenza a viruses by greenechipresp oligonucleotide microarray microarray-based detection and genotyping of viral pathogens pan-viral screening of respiratory tract infections in adults with and without asthma reveals unexpected human coronavirus and human rhinovirus diversity microarray-based detection of genetic heterogeneity, antimicrobial resistance, and the viable but nonculturable state in human pathogenic vibrio spp broad-spectrum respiratory tract pathogen identification using resequencing dna microarrays use of resequencing oligonucleotide microarrays for identification of streptococcus pyogenes and associated antibiotic resistance determinants using a resequencing microarray as a multiple respiratory pathogen detection assay application of broad-spectrum, sequence-based pathogen identification in an urban population identifying influenza viruses with resequencing microarrays amplicon sequencing and improved detection of human rhinovirus in respiratory samples goossens h: improved detection of rhinoviruses by nucleic acid sequence-based amplification after nucleotide sequence determination of the 5' noncoding regions of additional rhinovirus strains enteroviruses: polioviruses, coxsackieviruses, echoviruses, and newer enteroviruses lower airways inflammation during rhinovirus colds in normal and in asthmatic subjects a model of base-call resolution on broad-spectrum pathogen detection resequencing dna microarrays a diverse group of previously unrecognized human rhinoviruses are common causes of respiratory illnesses in infants genome-wide diversity and selective pressure in the human rhinovirus human rhinovirus 87 and enterovirus 68 represent a unique serotype with rhinovirus and enterovirus features evidence for frequent recombination within species human enterovirus b based on complete genomic sequences of all thirty-seven serotypes automated identification of multiple micro-organisms from resequencing dna microarrays highthroughput variation detection and genotyping using microarrays classification of enteroviruses based on molecular and biological properties complete genome sequences of all members of the species human enterovirus a assessing unmodified 70-mer oligonucleotide probe performance on glass-slide microarrays new complete genome sequences of human rhinoviruses shed light on their phylogeny and genomic features molecular evolution of the human enteroviruses: correlation of serotype with vp1 sequence and application to picornavirus classification typing of human enteroviruses by partial sequencing of vp1 improved molecular identification of enteroviruses by rt-pcr and amplicon sequencing molecular strategy for 'serotyping' of human enteroviruses molecular characterization of human enteroviruses in clinical samples: comparison between vp2, vp1, and rna polymerase regions using rt nested pcr assays and direct sequencing of products molecular identification and typing of enteroviruses isolated from clinical specimens the funding for this research was provided in part by the office of naval research via the nrl base program. partial support from tessarae, llc (potomac falls, va) through cooperative research and development agreement that help make this research possible is also gratefully appreciated.the opinions and assertions contained herein are those of the authors and are not to be construed as those of the u.s. navy or military service at large. zw conceived and designed the study, performed microarray experiments, analyzed data and wrote the manuscript; am designed microarray probes, analyzed data and wrote the manuscript; bl assisted in data analysis and preparing the manuscript; ck, nl and kb performed microarray experiments; dt helped to generate heatmap; ct assisted in data analyses; ds initiated the project and helped to prepare the manuscript. additional key: cord-264746-gfn312aa authors: muse, spencer title: genomics and bioinformatics date: 2012-03-29 journal: introduction to biomedical engineering doi: 10.1016/b978-0-12-238662-6.50015-x sha: doc_id: 264746 cord_uid: gfn312aa this chapter discusses the basic principles of molecular biology regarding genome science and describes the major types of data involved in genome projects, including technologies for collecting them. genome science is heavily driven by new technological advances that allow for rapid and inexpensive collection of various types of data. the emergence of genomic science has not simply provided a rich set of tools and data for studying molecular biology. it has been the catalyst for an astounding burst of interdisciplinary research, and it has challenged long-established hierarchies found in most institutions of higher learning. the next generation of biologists needs to be as comfortable at a computer workstation as they are at the lab bench. recognizing this fact, many universities have already reorganized their departments and their curricula to accommodate the demands of genomic science.the chapter discusses practical applications and uses of genomic data. for example, in the foreseeable future, are gene therapies that can repair genetic defects. at the conclusion of this chapter, the reader will be able to: use key bioinformatics databases and web resources. in april 2003, sequencing of all three billion nucleotides in the human genome was declared complete. this landmark of modern science brought with it high hopes for the understanding and treatment of human genetic disorders. there is plenty of evidence to suggest that the hopes will become reality-1631 human genetic diseases are now associated with known dna sequences, compared to the less than 100 that were known at the initiation of the human genome project (hgp) in 1990. the success of this project (it came in almost 3 years ahead of time and 10% under budget, while at the same time providing more data than originally planned) depended on innovations in a variety of areas: breakthroughs in basic molecular biology to allow manipulation of dna and other compounds; improved engineering and manufacturing technology to produce equipment for reading the sequences of dna; advances in robotics and laboratory automation; development of statistical methods to interpret data from sequencing projects; and the creation of specialized computing hardware and software systems to circumvent massive computational barriers that faced genome scientists. clearly, the hgp served as an incubator for interdisciplinary research at both the basic and applied levels. the human genome was not the only organism targeted during the genomic era. as of june 2004, the complete genomes were available for 1557 viruses, 165 microbes, and 26 eukaryotes ranging from the malaria parasite plasmodium falciparum to yeast, rice, and humans. continued advances in technology are necessary to accelerate the pace and to reduce the expense of data acquisition projects. improved computational and statistical methods are needed to interpret the mountains of data. the increase in the rate of data accumulation is outpacing the rate of increases in computer processor speed, stressing the importance of both applied and basic theoretical work in the mathematical and computational sciences. in this chapter, the key technologies that are being used to collect data in the laboratory, as well as some of the important mathematical techniques that are being used to analyze the data, are surveyed. applications to medicine are used as examples when appropriate. understanding the applications of genomic technologies requires an understanding of three key sets of concepts: how genetic information is stored, how that information is processed, and how that information is transmitted from parent to offspring. in most organisms, the genetic information is stored in molecules of dna, deoxyribonucleic acid ( fig. 13.1) . some viruses maintain their genetic data in rna, but no emphasis will be placed on such exceptions. the size of genomes, measured in counts of nucleotides or base pairs, varies tremendously, and a curious observation is that genome size is only loosely associated with organismal complexity (table 13 .1). most of the known functional units of genomes are called genes. for purposes of this chapter, a gene can be defined as a contiguous block of nucleotides operating for a single purpose. this definition is necessarily vague, for there are a number of types of genes, and even within a given type of gene, experts have difficulty agreeing on precisely where the beginning and ending boundaries of those genes lie. a structural gene is a gene that codes instructions for creating a protein ( fig. 13 .2). a second category of genes with many members is the collection of rna genes. an rna gene does not contain protein information; instead, its function is determined by its ability to fold into a specific three-dimensional configuration, at which point it is able to interact with other molecules and play a part in a biochemical process. a common rna gene found in most forms of life is the trna gene illustrated in figure 13 .3. structural genes are the entities most scientists envision when the word ''gene'' is mentioned, and from this point on, the term gene will be used to mean ''structural gene'' unless specified otherwise. the number and variety of genes in organisms is a current topic of importance for genome scientists. gene number in organisms ranges from tiny (470 in mycoplasma) to enormous (60,000 or more in plants). non-free-living organisms have even smaller gene numbers (the hiv virus contains only nine). the number of genes in a typical human genome has been estimated to be about 30,000, perhaps the single most surprising finding from the human genome project. this number was thought to be as large as 120,000 as recently as 1998. the confusion over this number arose in part because there is a not a ''one gene, one protein'' rule in humans, or indeed, in many eukaryotic organisms. instead, a single gene region can contain the information needed to produce multiple proteins. to understand this fact, the series of steps involved in creating a functional protein from the underlying dna sequence instructions must be understood. the central dogma of molecular biology states that genetic information is stored in dna, copied to rna, and then interpreted from the rna copy to form a functional protein ( fig. 13 .2). the process of copying the genetic information in dna into an rna copy is known as transcription (see chapter 3). the process is thought by many to be a remnant of an early rna world, in which the earliest life forms were based on rna genomes. it is at the level of transcription that gene expression is regulated, determining where and when a particular gene is turned on or off. the transcription of a gene occurs when an enzyme known as rna polymerase binds to the beginning of a gene and proceeds to create a molecule of rna that matches the dna in the genome. it is this molecule of messenger rna (mrna) that will serve as a template for producing a protein. however, it is necessary for organisms to regulate the expression of genes to avoid having all genes being produced in all cells at all times. transcription factors interact with either the genomic dna or the polymerase molecule to allow delicate control of the gene expression process. a feedback loop is created whereby an environmental stimulus such as a drug leads to the production of a transcription factor, which triggers the expression of a gene. in addition to this example of a positive control mechanism, negative control is also possible. an emerging theme is that sets of genes are often coregulated by a single or figure 13 .3 the transfer rna (trna) is an example of a non-protein-coding gene. its function is the result of the specific two-and three-dimensional structures formed by the rna sequence itself. 13 .1 introduction small group of transcription factors. these sets of genes often share a short upstream dna sequence that serves as a binding site for the transcription factor. one of the earliest surprises of the genomic era was the discovery that many eukaryotic gene sequences are not contiguous, but are instead interrupted by dna sequences known as introns. as shown in figure 13 .2, introns are physically cut, or spliced, from the mrna sequence before the rna is converted into a protein. the presence of introns helps to explain the phenomenon that there are more proteins produced in an organism than there are genes present. the process of alternative splicing allows for exons to be assembled in a combinatoric fashion, resulting in a multitude of potential proteins. for example, consider a gene sequence with exons e1, e2, and e3 interrupted by introns i1 and i2. if both introns are spliced, the resulting protein would be encoded as e1-e2-e3. however, it is also possible to splice the gene in a way that produces protein e1-e3, skipping exon e2. much like transcription factors regulate gene expressions, there are factors that help to regulate alternative splicing. a common theme is to find a single gene that is spliced in different ways to produce isoforms that are expressed in specific tissues. the process of reading the template in an mrna molecule and using it to produce a protein is known as translation. conceptually, this process is much more simple than the transcription and splicing processes. a structure known as a ribosome binds to the mrna molecule. the ribosome then moves along the rna in units of 3 nucleotides. each of these triplets, or codons, encodes one of 20 amino acids. at each codon the ribosome interacts with trnas to interpret a codon and add the proper amino acid to the growing chain before moving along to the next codon in the sequence (see chapter 3). genome science is heavily driven by new technological advances that allow for the rapid and inexpensive collection of various types of data. it has been said that the field is data-driven rather than hypothesis-driven, a reflection of the tendency for researchers to collect large amounts of genomic data with the (realistic) expectation that subsequent data analyses, along with the experiments they suggest, will lead to better understanding of genetic processes. although the list of important biotechnologies changes on an almost daily basis, there are three prominent data types in today's environment: (1) genome sequences provide the starting point that allows scientists to begin understanding the genetic underpinnings of an organism; (2) measurements of gene expression levels facilitate studies of gene regulation, which, among other things, help us to understand how an organism's genome interacts with its environment; and (3) genetic polymorphisms are variations from individual to individual within species, and understanding how these variations correlate with phenotypes such as disease susceptibility is a crucial element of modern biomedical research. the basic principles for obtaining dna sequences have remained rather stable over the past few decades, although the specific technologies have evolved dramatically. the most widely used sequencing techniques rely on attaching some sort of ''reporter'' to each nucleotide in a dna sequence, then measuring how quickly or how far the nucleotide migrates through a medium. the principles of sanger sequencing, originally developed in 1974, are illustrated in figure 13 .4. dna sequences have an orientation. the 5' end of a sequence can be considered to be the left end, and the 3' end is on the right. sanger sequencing begins by creating all possible subsequences of the target sequence that begin at the same 5' nucleotide. a reporter, originally radioactive but now fluorescent, is attached to the final 3' nucleotide in each subsequence. by using a unique reporter for each of the four nucleotides, it is possible to identify the final 3' nucleotide in each of the subsequences. consider the task of sequencing the dna molecule aggt. there are four possible subsequences that begin with the 5' a: a, ag, agg, and aggt. the technology of sanger sequencing produces each of those four sequences and attaches the reporter to the final nucleotide. the subsequences are sorted from shortest to longest based on the rate at which they migrate through a medium. the shortest sequence would correspond to the subsequence a; its reporter tells us that the final nucleotide is an a. the second shortest subsequence is ag, with a final nucleotide of g. by arranging the subsequences in a ''ladder'' from shortest to longest, the sequence of the complete target sequence can be found simply by reading off the final nucleotide of each subsequence. a series of new advances have allowed sanger sequencing to be applied in a highthroughput way, paving the way for sequencing of entire genomes, including that of the human. radioactive reporters have been replaced with safer and cheaper fluorescent dyes, and automatic laser-based systems now read the sequence of fluorescentlabeled nucleotides directly as they migrate. early versions of sanger sequencing only allowed for reading a few hundred nucleotides at a time; modern sequencing devices can read sequences of 800 nucleotides or more. perhaps most important has been the replacement of ''slab gel'' systems with capillary sequencers. the older system required much labor and a steady hand; capillary systems, in conjunction with the development of necessary robotic devices for manipulating samples, have allowed almost completely automated sequencing pipelines to be developed. not to be ignored in the series of technological advances is the development of automated base-calling algorithms. a laser reads the intensities of each of the four fluorescent reporter dyes as each nucleotide passes it. the resulting graph of those intensities is a chromatogram. statistical algorithms, including the landmark program phred, are able to accept chromatograms as input and output dna sequences with very high levels of accuracy, reducing the need for laborious human intervention. by assessing the relative levels of the four curves, the base-calling algorithms not only report the most likely nucleotide at each position, but they also provide an error probability for each site. a single state-of-the-art dna sequencing machine can currently produce upwards of one million nucleotides per day. large regions of dna are not sequenced in single pieces. instead, larger contigs of dna are fragmented into multiple, short, overlapping sequences. the emergence of shotgun sequencing (fig. 13 .5), pioneered by dr. craig venter, has revolutionized approaches for obtaining complete genome sequences. the fundamental approach to shotgun sequencing of a genome is simple: (1) create many identical copies of a genome; (2) randomly cut the genomes into millions of fragments, each short enough to be sequenced individually; (3) align the overlapping fragments by identifying matching nucleotides at the ends of fragments, and finally; (4) read the complete genome sequence by following a gap-free path through the fragments. until venter's work, the idea of shotgun sequencing was considered unfeasible for a variety of reasons. perhaps most daunting was the computational task of aligning the millions of fragments generated in the shotgun process. specialized hardware systems and associated algorithms were developed to handle these problems. following in the footsteps of high-throughput genome sequencing came technology that allowed scientists to survey the relative abundance of thousands of individual gene products. these technologies are, in essence, a modern high-throughput replacement of the northern blot procedure. for each member in a collection of several thousand genes, the assays provide a quantitative estimate of the number of mrna copies of each gene being produced in a particular tissue. two technologies, cdna and oligonucleotide microarrays, currently dominate the field, and they have opened the door to many exploratory analyses that were previously impossible. as a first example, consider taking two samples of cells from an individual cancer patient: one sample from a tumor and one from normal tissue. a microarray experiment makes it possible to identify the set of genes that are produced at different levels in the two tissue types. it is likely that this set of differentially expressed genes contains many genes involved in biological processes related to tumor formation and proliferation ( fig. 13 .6). a second common type of study is a time course experiment. microarray data is collected from the same tissues at periodic intervals over some length of time. for instance, gene expression levels may be measured in 6-hour increments following the administration of a drug. for each gene, the change in gene expression level can be plotted against time ( fig. 13 .7). groups of coregulated genes will be identified as having points in time where they all experience either an increase or decrease in expression levels. a likely cause for this behavior is that all genes in the coregulated set are governed by a single transcription factor. a final important medical application of microarray technologies involves diagnosis. suppose that a physician obtains microarray data from tumor cells of a patient. figure 13 .5 shotgun sequencing of genomes or other large fragments of dna proceeds by cutting the original dna into many smaller segments, sequencing the smaller fragments, and assembling the sequenced fragments by identifying overlapping ends. the data, consisting of the relative levels of gene expression for a suite of many genes, can be compared to similar data collected from tumors of known types. if the patient's gene expression profile matches the profile of one of the reference samples, the patient can be diagnosed with that tumor type. the advent of microarray techniques has rapidly improved the accuracy of this type of diagnosis in a variety of cancers. cdna microarrays were the first, and are still the most widely used, form of highthroughput gene expression methods. the procedure begins by attaching the dna sequences of thousands of genes onto a microscope slide in a pattern of spots, with each spot containing only dna sequences of a single gene. a variety of technologies have emerged for creating such slides, ranging from simple pin spotting devices to technologies using laser jet printing techniques. the rna of expressed genes is next collected from the target cell population. through the process of reverse transcription, a cdna version of each rna is created. a cdna molecule is complementary to the genomic dna sequence in the sense that complementary base pairs will physically bind to one another. for example, a cdna reading gttac could physically bind to the genomic dna sequence caatg. during the process of creating the cdna collection, each cdna is labeled with a fluorescent dye. the collection of labeled cdnas is poured over the microscope slide and its set of attached dna molecules. the cdnas that match a dna on the slide physically bind to their mates, and unbound cdnas are washed from the slide. finally, the number of bound molecules at each spot (genes) can be read by measuring the fluorescence level at each spot. highly expressed genes will create more rna, which results in more labeled cdnas binding to those spots. a more common variant on the basic cdna approach is illustrated in figure 13 .8. in this experiment, rna from two different tissues or individuals is collected, labeled with two different dyes, and competitively hybridized on a single slide. the relative abundance of the two dyes allows the scientist to state, for instance, that a particular gene is expressed fivefold times more in one tissue than in the other. oligonucleotide arrays take a slightly different approach to assaying the relative abundance of rna sequences. instead of attaching full-length dnas to a slide, oligonucleotide systems make use of short oligonucleotides chosen to be specific to individual genes. for each gene included in the array, approximately 10 to 20 different oligonucleotides of length 20-25 nucleotides are designed and printed onto a chip. the use of multiple oligonucleotides for each gene helps to reduce the effects of a variety of potential errors. fluorescently labeled rna (rather than cdna) is collected from the target tissue and hybridized against the oligonucleotide array. one limitation of the oligonucleotide approach is that only a single sample can be assayed on a single chip-competitive hybridization is not possible. although oligonucleotide and cdna approaches to assaying gene expression rely on the same basic principles, each has its own advantages and disadvantages. as already noted, competitive hybridization is currently only possible in cdna systems. the design of oligonucleotide arrays requires that the sequences of genes for the chip are already available. the design phase is very expensive, and oligonucleotide systems are only available for commercially important and model organisms. in contrast, cdna arrays can be developed fairly quickly even in organisms without sequenced figure 13 .8 a cdna microarray slide is created by (1) attaching dna to spots on a glass slide, (2) collecting expressed rna sequences (expressed sequence tags, ests) from tissue samples, (3) converting the rna to dna and labeling the molecules with fluorescent dyes, (4) hybridizing the labeled dna molecules to the dna bound to the slide, and (5) extracting the quantity of each expressed sequence by measuring the fluorescence levels of the dyes. genomes. in their favor, oligonucleotide arrays allow for more genes to be spotted in a given area (thus allowing more measurements to be made on a single chip) and tend to offer higher repeatability of measurements. both of these facts reduce the overall level of experimental error rate in oligonucleotide arrays relative to cdna microarrays, although at a higher per observation cost. because of the trade-off between obtaining many cheap noisy measurements versus a smaller number of more precise but expensive measurements, it is not clear that either technology has an obvious cost advantage. both techniques share the same major disadvantage: only measurements of rna levels are found. these measurements are used as surrogates for the much more desirable and useful quantities of the amount of protein produced for each gene. it appears that rna levels are correlated with protein levels, but the extent and strength of this relationship is not understood well. the near future promises a growing role for protein microarray systems, which are currently seeing limited use because of their very high costs. the ''final draft'' of the human genome was announced in april 2003. it included roughly 2.9 billion nucleotides, with some 30,000 to 40,000 genes spread across 23 pairs of chromosomes. the next phase of major data acquisition on the human genome is to discover how differences, both large and small, from individual to individual, result in variation at the phenotypic level. toward this end, a major effort has been made to find and document genetic polymorphisms. polymorphisms have long been important to studies of genetics. variations of the banding patterns in polytene chromosomes, for instance, have been studied for many decades. allozyme assays, based on differences in the overall charge of amino acid sequences, were popular in the 1960s. most modern studies of genetic polymorphisms, though, focus on identifying variation at the individual nucleotide level. the international snp consortium (http://snp.cshl.org) is a collaboration of public and private organizations that discovered and characterized approximately 1.8 million single nucleotide polymorphisms (snps) in human populations. in medicine, the expectation is that knowledge of these individual nucleotide variants will accelerate the description of genetic diseases and the drug development process. pharmaceutical companies are optimistic that surveys of variation will be of use for selecting the proper drug for individual patients and for predicting likely side effects on an individual-to-individual basis. most snps (pronounced ''snips'') are the result of a mutation from one nucleotide to another, whereas a minority are insertions and deletions of individual nucleotides. surveys of snps have demonstrated that their frequencies vary from organism to organism and from region to region within organisms. in the human genome, a snp is found about every 1000 to 1500 nucleotides. however, the frequency of snps is much higher in noncoding regions of the genome than in coding regions, the result of natural selection eliminating deleterious alleles from the population. furthermore, synonymous or silent polymorphisms, which do not result in a change of the encoded amino acids, are more frequent than nonsynonymous or replacement polymorphisms. the fields of population genetics and molecular evolution provide many empirical surveys of snp variation, along with mathematical theory, for analyzing and predicting the frequencies of snps under a variety of biologically important settings. simple sequence repeats (ssrs) consist of a moderate (10-50) number of tandemly repeated copies of the same short sequence of 2 to 13 nucleotides. ssrs are an important class of polymorphisms because of their high mutation rates, which lead to ssr loci being highly variable in most populations. this high level of variability makes ssr markers ideal for work in individual identification. ssrs are the markers typically employed for dna fingerprinting in the forensics setting. in human populations, an ssr locus usually has 10 or more alleles and a per generation mutation rate of 0.001. the fbi uses a set of 13 tetranucleotide repeats for identification purposes, and experts claim that no two unrelated individuals have the exact same collection of alleles at all 13 of those loci. as the technology for collecting genomic data has improved, so has the need for new methods for management and analysis of the massive amounts of accumulated data. the term bioinformatics has evolved to include the mathematical, statistical, and computational analysis of genomic data. work in bioinformatics ranges from database design to systems engineering to artificial intelligence to applied mathematics and statistics, all with an underlying focus on genomic science. a variety of bioinformatics topics may be illustrated using the core technologies described in the preceding section. it is necessary to carry out sequence alignments in order to assemble sequence fragments. all of these sequences, along with the vital information about their sources, functions, and so on, must be stored in databases, which must be readily available to users in a variety of locations. once a sequence has been obtained, it is necessary to annotate its function. one of the most fundamental annotation tasks is that of computational gene finding, in which a genome or chromosome sequence is input to an algorithm that subsequently outputs the predicted location of genes. a gene sequence, whether predicted or experimentally determined, must have its function predicted, and many bioinformatics tools are available for this task. once microarray data are available, it is necessary to identify subsets of coregulated genes and to identify genes that are differentially expressed between two or more treatments or tissue types. polymorphism data from snps are used to search for correlations with, for example, the presence or absence of a disease in family pedigrees. these questions are all of fundamental importance and draw on many different fields. by necessity, bioinformatics is a highly multidisciplinary field. genome projects involve far-reaching collaborations among many researchers in many fields around the globe, and it is critical that the resulting data be easily available both to project members and to the general scientific community. in light of this requirement, a number of key central data repositories have emerged. in addition to providing storage and retrieval of gene sequences, several of these databases also offer advanced sequence analysis methods and powerful visualization tools. in the united states, the primary public genomics resource is that of the national center for biotechnology information (ncbi). the ncbi website (http:// www.ncbi.nlm.nih.gov) provides a seemingly endless collection of data and data analysis tools. perhaps the most important element of the ncbi collection is the genbank database of dna and rna sequences. ncbi provides a variety of tools for searching genbank, and elements in genbank are linked to other databases, both within and outside of ncbi. figure 13 .9 shows some results from a simple query of the genbank nucleotide database. genbank data files contain a wealth of information. figure 13 .10 shows a simple genbank file for a prion sequence from duck. the accession number, af283319, is figure 13 .9 the result of a simple query of the genbank database at ncbi. this query found 9007 entries in the genbank nucleotide database containing the term ''tyrosine kinase.'' each entry can be clicked to find additional information. figure 13 .10 a simple genbank file containing the dna sequence for a prion protein gene. the unique identifier for this entry. the genbank file contains a dna sequence of 819 nucleotides, its predicted amino acid sequence, and a citation to the chinese laboratory that obtained the data. the ''links'' icon in the upper right provides access to related information found in other databases. it is essential for those working in genomics or bioinformatics to become familiar with genbank and the content of genbank files. ncbi is also the home of the blast database searching tool. blast uses algorithms for sequence alignment (described later in this chapter) to find sequences in genbank that are similar to a query sequence provided by the user. to illustrate the use of blast, consider a study by professor eske willerslev at the university of copenhagen. willerslev and his colleagues collected samples from siberian permafrost that included a variety of preserved plant and animal material estimated to be 300,000-400,000 years old. they were able to extract short dna sequences from the rbcl gene. these short sequences were used as input to the blast algorithm, which reported a list of similar sequences. it is likely that the most similar sequences come from close relatives of the organisms that provided the ancient dna. the european bioinformatics institute (ebi, http://www.ebi.ac.uk) is the european ''equivalent'' of ncbi. users who explore the ebi website will find much of the same type of functionality as provided by ncbi. of particular note is the ensembl project (http://www.ensembl.org), a joint venture between ebi and the sanger institute. ensembl has particularly nice tools for exploring genome project data through its genome browser. figure 13 .11 shows a portion of the display for a region of human chromosome 7. ensembl provides comparisons to other completed genome sequences (rat, mouse, and chimpanzee), along with annotations of the locations of genes and other interesting features. most of the items in the display are clickable and provide links to more detailed information on each display component. many other databases and web resources play important roles in the day-to-day working of genome scientists. table 13 .2 includes a selection of these resources, along with short descriptions of their unique features. the most fundamental computational algorithm in bioinformatics is that of pairwise sequence alignment. not only is it of immediate practical value, but the underlying dynamic programming algorithm also serves as a conceptual framework for many other important bioinformatics techniques. the goal of sequence alignment is to accept as input two or more dna, rna, or amino acid sequences; identify the regions of the sequences that are similar to one another according to some measure; and output the sequences with the similar positions aligned in columns. an alignment of six sequences from hiv strains is shown in figure 13 .12. sequence alignments have numerous uses. alignments of pairs of sequences help us to determine whether or not they have the same or similar functions. regions of alignments with little sequence variation likely correspond to important structural or functional regions of protein coding genes. by studying patterns of similarity in an alignment of genes from several species, it is possible to infer the evolutionary history resources and software indices of the species, and even to reconstruct dna or amino acid sequences that were present in the ancestral organisms. many methods for annotation, including assigning protein function and identifying transcription factor binding sites, rely on multiple sequence alignments as input. to illustrate the principles underlying sequence alignment, consider the special case of aligning two dna sequences. if the two sequences are similar, it is most likely because they have evolved from a common ancestral sequence at some time in the past. as illustrated in figure 13 .13, the sequences differ from the ancestral sequence and from each other because of past mutations. most mutations fall into one of two classes: nucleotide substitutions, which result in these two sequences being different at the location of the mutation (fig. 13.13a) , and insertions or deletions of short sequences (fig. 13.13b) . the term indel is often used to denote an insertion or deletion mutation. figure 13 .13b shows that indels lead to one sequence having nucleotides present at certain positions, whereas the second sequence has no nucleotides at those positions. to align two sequences without error, it would be necessary to have knowledge of the entire collection of mutations in the history of the two sequences. since this information is not available, it is necessary to rely on computational algorithms for reconstructing the likely locations of the various mutation events. a score function is chosen to evaluate alignment quality, and the algorithms attempt to find the pairwise alignment that has the highest numerical score among all possible alignments. consider aligning the two short sequences cagg and cga. it can be shown that there are 129 possible ways to align these two sequences, several of which are shown in figure 13 .14. how does one determine which of the 129 possibilities is best? alignments (a) and (b) each have two positions with matching nucleotides; however, alignment (b) includes three columns with indels, whereas (a) has only one. on the other hand, alignment (a) has one mismatch to (b)'s zero. there is no definitive answer to the question of which alignment is best; however, it makes sense that ''good'' alignments will tend to have more matches and fewer mismatches and indels. it is possible to quantify that intuition by invoking a scoring scheme in which each column receives a score, s i , according to the formula s i â¼ m, the bases at column i match d, the bases at column i do not match i, there is an indel at column i 8 < : using this scheme with match score m â¼ 5, mismatch score d â¼ ã�1, and indel score i â¼ ã�3, the alignment in figure 13 .14a would receive a score of 5 ã� 3 ã¾ 5 ã� 1 â¼ 6. similarly, the alignment in figure 13 .14b has a score of 5 ã� 3 ã� 3 ã¾ 5 ã� 3 â¼ 1. the remaining alignments in figure 13 .14 have scores of 0, 0, 1, 0, ã�5, and 1, respectively. alignment (a) is considered best under the standards of this scoring scheme, and, in fact, it has the best score of all 129 possible alignments. this example suggests an algorithm for finding the best scoring alignment of any two sequences: enumerate all possible alignments, calculate the score for each, and select the alignment with the highest score. unfortunately, it turns out that this approach is not practical for real data. it can be shown that the number of possible alignments of 2 sequences of length n is approximately 2 2n = ffiffiffiffiffiffiffiffiffi 2pn p when n is large. even for a pair of short sequences of length 100, the number of alignments is 6 ã� 10 58 , 35 orders of magnitude larger than avogadro's number! techniques such as the needleman-wunsch and smith-waterman algorithms, which allow for computationally efficient identifications of the optimal alignments, are important practical and theoretical components of bioinformatics. conceptually, the task of aligning three or more sequences is essentially the same as that of aligning pairs of sequences. the computational task, however, becomes enormously more complex, growing exponentially with the number of sequences to be aligned. no practically useful solutions have been found, and the problem has been shown to belong to a class of fundamentally hard computational problems said to be np-complete. in addition to the increased computation, there is one important new concept that arises when shifting from pairwise alignment to multiple alignment. scoring columns in the pairwise case was simple; that is not the case for multiple sequences. complications arise because the evolutionary tree relating the sequences to be aligned is typically unknown, which makes assigning biologically plausible scores difficult. this problem is often ignored, and columns are scored using a sum of pairs scoring scheme in which the score for a column is the sum of all possible pairwise scores. for example, the score for a column containing the three nucleotides cgg, again using the scores m â¼ 5, d â¼ ã�1, and i â¼ ã�3, is ã�1 ã� 1 ã¾ 5 â¼ 3. other algorithms, such as the popular clustalw program, use an approach known as progressive alignment to circumvent this issue. almost all widely used methods for finding sequence alignments rely on a scoring scheme similar to the one used in the preceding paragraphs. clearly, this formula has very little biological basis. furthermore, how does one select the scores for matches, mismatches, and indels? considerable work has addressed these issues with varying degrees of success. the most important improvement is the replacement of the simple match and mismatch scores with scoring matrices obtained from empirical collections of amino acid sequences. rather than assigning, for example, all mismatches a value of ã�1, the blosum and pam matrices provide a different penalty for each possible pair of amino acids. since these penalties are derived from actual data, mismatches between chemically similar amino acids such as leucine and isoleucine receive smaller penalties than mismatches between chemically different ones. a second area of improvement is in the assignment of indel penalties. the alignments in figure 13 .14b and 13.14e each have a total of three sites with indels. however, the indels at sites 2 and 3 of figure 13 .14b could have been the result of a single insertion or deletion event. recognizing this fact, it is common to use separate open and extension penalties for indels. if the open penalty is o â¼ ã�5 and the extension penalty is e â¼ ã�1, then a series of three consecutive indels would receive a score of ã�5 ã� 1 ã� 1 â¼ ã�7. the most common bioinformatics task is searching a molecular database such as genbank for sequences that are similar to a query sequence of interest. for example, the query sequence may be a gene sequence from a newly isolated viral outbreak, and the search task may be to find out if any known viral sequences are similar to this new one. it turns out that this type of database searching is a special case of pairwise sequence alignment. essentially, all sequences in the database are concatenated end to end, and this new ''supersequence'' is aligned to our query sequence. since the supersequence is many, many times longer than the query sequence, the resulting pairwise alignment would consist mostly of gaps and provide relatively little useful information. a more useful procedure is to ask if the supersequence contains a short subsequence that aligns well with the query sequence. this problem is known as local sequence alignment, and it can be solved with algorithms very similar to those for the basic alignment problem. the smith-waterman algorithm is guaranteed to find the best such local alignment. even though the smith-waterman algorithm provides a solution to the database search problem for many applications, it is still too slow for high-volume installations such as ncbi, where multiple query requests are handled every second. for these settings, a variety of heuristic searches have been developed. these tools, including blast and fasta, are not guaranteed to find the best local alignment, but they usually do and are, therefore, valuable research tools. it is no exaggeration to claim that blast (http://www.ncbi.nlm.nih.gov/blast) is one of the most influential research tools of any field in the history of science. the algorithm has been cited in upwards of 30,000 studies to date. in addition to providing a fast and effective method for database searching, the use of blast spread rapidly because of the statistical theory developed to accompany it. when searching a very large database with a short sequence, it is very likely that one or more instances of the query sequence will be found in the database simply by chance alone. when blast reports a list of database matches, it sorts them according to an e-value, which is the number of matches of that quality expected to be found by chance. an e-value of 0.001 indicates a match that would only be found once every 1000 searches, and it suggests that the match is biologically interesting. on the other hand, an e-value of 2.0 implies that two matches of the observed quality would be found every search simply by chance, and therefore, the match is probably not of interest. consider an effort to identify the virus responsible for sars. the sequence of the protease gene, a ubiquitous viral protein, was isolated and stored under genbank accession number ay609081. if that sequence is submitted to the tblastx variant of blast at ncbi, the best matching non-sars entries in genbank (remember that the sars entries would not have been in the database at the time) all belong to coronaviruses, providing strong evidence that sars is caused by a coronavirus. this type of comparative genomic approach has become invaluable in the field of epidemiology. much work in genomic science and bioinformatics focuses on problems of identifying biologically important regions of very long dna sequences, such as chromosomes or genomes. many important regions such as genes or binding sites come in the form of relatively short contiguous blocks of dna. hidden markov models (hmms) are a class of mathematical tools that excel at identifying this type of feature. historically, hmms have been used in problems as diverse as finding sources of pollution in rivers, formal mathematical descriptions of written languages, and speech recognition, so there is a rich body of existing theory. predictably, many successful applications of hmms to new problems in genomic science have been seen in recent years. hmms have proven to be excellent tools for identifying genes in newly sequenced genomes, predicting the functional class of proteins, finding boundaries between introns and exons, and predicting the higher-order structure of protein and rna sequences. to introduce the concept of an hmm in the context of a dna sequence, consider the phenomenon of isochores, regions of dna with base frequencies unique from neighboring regions. data from the human genome demonstrate that regions of a million or more bases have g+c content varying from 20% to 70%, a much higher range than one would expect to see if base composition were homogeneous across the entire genome. a simple model of the genome assigns each nucleotide to one of three possible classes (fig. 13.15a) : a high g+c class (h), a low g+c class (l), or a normal g+c class (n). in the normal class, each of the four bases a, c, g, and t is used with equal frequency (25%). in the high g+c class, the frequencies of the four bases are 15% a, 35% c, 35% g, and 15% t, and in the low g+c class, the frequencies are 30% a, 20% c, 20% g, and 30% t. in the parlance of hmms, these three classes are called hidden states, since they are not observed directly. instead, the emitted characters a, c, g, and t are the observations. thus, this simple model of a genome consists of successive blocks of nucleotides from each of the three classes ( fig. 13.15b) . the formal mathematical details of hmms will not be discussed, but it is useful to understand the basic components of the models (fig. 13.16 ). each hidden state in an hmm is able to emit characters, but the emission probabilities vary among hidden states. the model must also describe the pattern of hidden states, and the transition probabilities determine both the expected lengths of blocks of a single hidden state and the likelihood of one hidden state following another (e.g., is it likely for a block of high g+c to follow a block of low g+c?). the transition probabilities play important roles in applications such as gene finding. & what is the chance of seeing a block of high g+c nucleotides shorter than 5000? these types of questions will be addressed in the examples discussed in the next section. the task of gene prediction is conceptually simple to describe: given a very long sequence of dna, identify the locations of genes. unfortunately, the solution of the problem is not quite as simple. as a first pass, one might simply find all pairs of start (atg) and stop (tag, tga, taa) codons. blocks of sequence longer than, say, 300 nucleotides that are flanked by start and stop codons and that have lengths in multiples of three are likely to be protein coding genes. although this simple method will be likely to find many genes, it will probably have a high false positive rate figure 13 .16 an hmm for the states in fig. 13 .15. transition probabilities govern the chance that one hidden state follows another. for example, an n state is followed by another n state 90% of the time, by an l state 5% of the time, and by an h state 5% of the time. emission probabilities control the frequency of the four nucleotides found at each type of hidden state. in the hidden state l, there is 30% a, 20% c, 20% g, and 30% t. (incorrectly predict that a sequence is a gene), and it will certainly have a high false negative rate (fail to predict real genes). for instance, the method fails to consider the possibility of introns, and it is unable to predict short genes. gene finding algorithms rely on a variety of additional information to make predictions, including the known structure of genes, the distribution of lengths of introns and exons in known genes, and the consensus sequences of known regulatory sequences. hmms turn out to be exceptionally well-suited for gene finding, and the basic structure of a simple gene finding hmm is shown in figure 13 .17. note that the hmm includes hidden states for promoter regions, the start and stop codons, exons, introns, and the noncoding dna falling between different genes. also note that not all hidden states are connected to one another. this fact reflects an understanding of gene and genome structure. the sequence of states start -intron -stop -exon -promoter is not biologically possible, whereas the series noncoding -promoter -start -exon -intron -exon -intron -exon -stop -noncoding is. good hmms incorporate this type of knowledge extensively. to put the hmm of figure 13 .17 to use, the model must first be trained. the training step involves taking existing sequences of known genes and estimating all of the transition and emission probabilities for each of the model's hidden states. for example, if a training data set included 150 introns, the observed frequency of c in those intron sequences could be used to come up with the emission probability for c in the hidden state intron. the average lengths of introns and exons would be used to estimate the transition probabilities to and from the exon and intron hidden states. once the training step is complete, the hmm machinery can be used to predict the locations of genes in a long sequence of dna, along with their intron/exon boundaries, promoter sites, and so forth. gene finding algorithms in actual use are much more complex than the one shown in figure 13 .17, but they retain the same basic structure. the performance of gene finders continues to get better and better as more genomes are studied and the quality of the underlying hmms is improved. in bacteria, modern gene finding algorithms are rarely incorrect. upwards of 95% of the predicted genes are subsequently found to be actual genes, and only 1-2% of true genes are missed by the algorithms. the situation is not as rosy for eukaryotic gene prediction, however. eukaryotic genomes are much larger, and the gene structure is more complex (most notably, eukaryotic genes have introns). the effectiveness of gene finding algorithms is usually measured in terms of sensitivity and specificity. if these quantities are measured on a per nucleotide basis, an algorithm's sensitivity, s n , is defined to be the percentage of nucleotides in real genes that are actually predicted to be in genes. the specificity, s p , is the percentage of nucleotides predicted to be in genes that truly are in genes. good gene predictors have high sensitivity and high specificity. the best gene eukaryotic gene finders today have sensitivities and specificities around 90% at the individual nucleotide. if the quantities are measured at the level of entire exons (e.g., did the algorithm correctly predict the location of the entire exon or not?), the values drop to around 70%. an emerging and powerful approach for predicting the location of genes uses a comparative genomics approach. the entire human genome sequence is now available, and the locations of tens of thousands of genes are known. suppose that a laboratory now sequences the genome of the cheetah. since humans and cheetahs are both mammals, they should have reasonably similar genomes. in particular, most of the gene sequences should be quite similar. gene prediction can proceed by doing a pairwise sequence alignment of the two genomes and then predicting that positions in the cheetah genome corresponding to locations of known human genes are also genes in the cheetah. this approach is remarkably effective, although it will obviously miss genes that are unique to one species or the other. the degree of relatedness of the two organisms also has a major impact on the utility of this approach. the human genome could be used to predict genes in the gorilla genome much better than it could be used to predict genes in the sunflower or paramecium genomes. in addition to hmms and comparative genomics approaches, a variety of other techniques are being used for gene prediction. neural networks and other artificial intelligence methods have been used effectively. perhaps most intriguing, as more and more genomes become available, are hybrid methods that integrate, for example, hmms with comparative genomic data from two or more genomes. once a genome is sequenced and its genes are found or predicted, the next step in the bioinformatics pipeline is to determine the biological function of the genes. ideally, molecular biological work would be carried out in the laboratory to study each gene's function, but clearly that approach is not feasible. two basic computational approaches will be described, one using comparative genomics and the other using hmms. comparative genomics approaches to assigning function to genes rely on a simple logical assumption: if a gene in species a is very similar to a gene in species b, then the two genes most likely have the same or related functions. this logic has long been applied at higher biological levels (e.g., the kidneys of different species have the same basic biological function even though the exact details may differ in the two species). at the level of genes, the inference is less accurate, especially if the species involved in the comparison are not closely related, but the approach is nonetheless useful and usually effective. simple database searches are the most straightforward comparative genomic approach to functional annotation. a newly discovered gene sequence that returns matches to cytochrome oxidase genes when input to blast is likely to be a cytochrome oxidase gene itself. complications arise when matches are to distantly related species, when the matching regions are very short, or when the sequence matches members of a multigene family. in the first case, the functions of the genes may have changed during the tens or hundreds of millions of years since the two organisms shared a common ancestor. however, if two or more such distantly related organisms have gene sequences that are nearly identical, a strong argument can be made that the gene is critical in both organisms and that the same function has been maintained throughout evolutionary history. short matches may arise simply as a result of elementary protein structure. for example, two sequences may have regions that match simply because they both encode alpha helical regions. such matches provide useful structural information, but the stronger inference of shared function is not justified. multigene families are the result of gene duplications followed by functional divergence. examples include the globin and amylase families of genes. at some point in the past, a single gene in one organism was completely duplicated in the genome. at that point, the duplicated copy was free to evolve a new, but often related, function. subsequent duplications allow for the growth and diversification of such families. because of their shared ancestry, all members of a gene family tend to have similar dna sequences. this fact makes it difficult to assign function with high accuracy when matches appear in database searches, but it often provides a general class of functions for the query sequence. efforts have been made to classify all known proteins into functional groups using comparative genomics. suppose that the genbank protein database is queried with protein sequence a and the result is that its closest match is protein sequence b. if the database is next queried using sequence b and the closest match for b is found to be sequence a, then these two proteins are said to be reciprocal best matches, and they are likely to have the same function. likewise, if the best match to sequence a is b, the best match to b is c, and the best match to c is a, then a, b, and c are likely to have the same function. this general principle has been used to create clusters of genes that are predicted to have similar or identical functions. the cogs (clusters of orthologous groups of proteins) database at ncbi (http://www.ncbi.nlm.nih.gov/cog) represents a comprehensive clustering of the entire genbank protein database using this type of scheme. there are many known examples of proteins or individual protein domains that have the same function or structure. the pfam (protein family) database (http:// www.sanger.ac.uk/software/pfam) includes multiple sequence alignments of almost 7500 such protein families. using the sequence data for each alignment, the pfam project members created a special type of hmm called a profile hmm. this database makes it possible to take a query sequence and, for each of the 7500 families and their associated profile hmms, ask the question, ''is the query sequence a member of this gene family?'' a query to the pfam results in a probability assigned to each of the included protein families, providing not only the best matches but also indications of the strength of the matches. currently, about 74% of the proteins in genbank have a match in pfam, indicating a fairly high likelihood of any newly discovered protein having a pfam match. pfam is of interest not only because of its effectiveness, but also because of its theoretical approach of combining comparative genomic and hmm components. a common experiment is to use microarray or oligonucleotide array technology to measure the expression level for several thousand genes under two different ''treatments.'' it is often the case that one treatment is a control while the other is an environmental stimulus such as a drug, chemical, or change in a physical variable such as temperature or ph. other possibilities include comparisons between two tissue types (e.g., brain vs. heart), between diseased and undiseased tissues (e.g., tumor vs. normal), or between samples at two developmental phases (e.g., embryo vs. adult). one of the primary reasons to carry out such an experiment is to identify the genes that are differentially expressed between the two treatments. the basic format of the data from a simple two-treatment microarray experiment is the following: each spot on a microarray corresponds to a single gene, and in competitive hybridization experiments, a single spot usually provides measurements of gene expression under two different treatments. note that the first column has been intentionally labeled ''spot'' instead of ''gene.'' it is important that the same gene be used and measured multiple times; therefore, a number of different spots will typically correspond to the same gene. the final column of data is the most important for interpreting this experiment. the most extreme difference in relative expression levels is found at spot 4, where the gene is expressed almost fourfold higher under treatment 2. the question now becomes, ''how large (or small) must the ratio be to say that the expression levels are really different?'' this question is one of variability and of statistical significance. phrased differently, would a ratio near 0.27 for spot 4 be likely if the experiment were repeated? the data in the table do not provide the necessary information to answer this question, and this fact points out the importance of replication in experimental design. whenever quantitative measurements are to be compared, replication is needed in order to estimate the variance of the measurements. this fundamental tenet of experimental design was largely ignored during the early history of microarray studies. fortunately, recent work has included careful attention to experimental design and proper analysis using the analysis of variance (anova). typical experiments now include five or more replicate measurements of each gene. in order to detect very small treatment effects on levels of expression, even larger amounts of replication are needed. a second type of microarry experiment is designed not to find differentially expressed genes, but to identify sets of genes that respond to two or more treatments in the same manner. this type of study is best illustrated with a time course study in which expression levels are measured at a series of time intervals. examples of such studies might involve measuring expression levels in laboratory mice each hour following exposure to a toxic chemical, expression levels in a mother or fetus at each trimester of a pregnancy, or expression levels in patients each year following infection with hiv. if plots of expression levels (y axis) against time (x axis) for each gene are overlaid as shown in figure 13 .18, it is possible to visually compare the expression profiles of genes. the desired pattern is a group of genes that tend to increase or decrease their expression levels in unison. in figure 13 .18 it appears that genes 2 and 5 have very similar expression profiles, as do genes 1 and 4. the similarity between the expression profiles of two genes can be described using the correlation coefficient, where x i and y i are the expression levels of genes x and y at time point i. values near 1 or ã�1 indicate that the two genes have very similar profiles. when faced with thousands of profiles, the task becomes a bit more problematic. a common theme is to cluster genes on the basis of the similarity in their profiles, and many algorithms for carrying out the clustering have been published. all of these algorithms share the objective of assigning genes to clusters so that there is little variation among profiles within clusters, but considerable variation between clusters. top down clustering begins with all genes in a single cluster, then recursively partitions the genes into smaller and smaller clusters. bottom up methods start with each gene in its own cluster and progressively merge smaller clusters into larger ones. clustering algorithms may also be supervised, meaning that the user specifies ahead of time the final number of clusters, or unsupervised, in which case the algorithm determines the final number of clusters. the emergence of genomic science has not simply provided a rich set of tools and data for studying molecular biology. it has been the catalyst for an astounding burst of interdisciplinary research, and it has challenged long-established hierarchies found in most institutions of higher learning. the next generation of biologists will need to be as comfortable at a computer workstation as they are at the lab bench. recognizing this fact, many universities have already reorganized their departments and their curricula to accommodate the demands of genomic science. from a more practical point of view, the results of genomic research will begin to trickle into medicine. already, diagnostic procedures are changing rapidly as a result of genomics. the next phase of genomics will focus on relating genotypes to complex phenotypes, and as those connections are uncovered, new therapies and drugs will follow. consider, for example, a drug that is of significant benefit to 99% of users, but causes serious side effects in the remaining 1%. such drugs currently have difficulty remaining in the marketplace. however, the use of genetic screens to identify the patients likely to suffer side effects should make it possible for these drugs to be used safely and effectively. less imminent, but certainly in the foreseeable future, are gene therapies that will allow for repair of genetic defects. the continued interplay of figure 13 .18 overlaid expression profiles for 5 genes. note that genes 2 and 5, as well as genes biology, engineering, and the mathematical sciences will be responsible for exploration of these frontiers. exercises 1. how many possible proteins could be formed by a gene region containing four exons? 2. in general, eukaryotes have introns, whereas prokaryotes do not. what are possible advantages and disadvantages of introns? 3. most amino acids are encoded by more than a single codon. if one of these synonymous codons is energetically more efficient for the organism to use, what effect would that have on the organism's genome content? how might this fact be used in gene finding algorithms? 4. what is the chance that a 20-nucleotide oligonucleotide matches a sequence other than the one it was designed to match? assume for simplicity that all nucleotides have frequency 25%. how many matches to that oligonucleotide would one expect to find in the human genome? 5. if each of the 13 ssr markers used by the fbi for identification purposes has 20 equally frequent alleles, what is the chance that two randomly chosen individuals have the same collection of alleles at those 13 markers? 6. how many mammalian genomes have been completely sequenced? what are they? 7. what is the size of the anopheles gambiae genome? how many chromosomes does it have? how many genes does it have? 8. what is the length of the drosphila melanogaster alcohol dehydrogenase gene? 9. consider the following two alignments for the sequences cggtca and cagca: c-ggtca c-ggtca ca-g-ca ca-gca. a. find the score of each alignment using a match score of 5, mismatch penalty of ã�2, and gap penalty of ã�4. b. find the score of each if the gap penalty is ã�5 for opening and ã�1 for extending. 10. suppose a computer can calculate the scores for one million alignments per second. how long would it take to find the best alignment of two 1000 bp sequences by exhaustive search? 11. find an example of a zinc finger gene sequence using genbank. use blast to discover how many genbank sequences are similar to the sequence you found. what does the result tell you about zinc finger genes? 12. what are some additional features that might be added to the simple gene finding hmm of fig. 13 .17? draw a diagram of a simple gene finding hmm that might be useful for prokaryotes. the hmm should contain hidden states for exons and intergenic regions, and it should guarantee that exons have lengths that are multiples of three use the pfam website to give a brief description of the structure and function of members of the hamartin gene family 18, gene 2 seems to be expressed at higher levels than gene 5. justify the claim that the two genes have similar profiles and might be coregulated compute the correlation coefficient for each pair of genes. do any of them have similar profiles? 17. the expression levels for two genes measured at four times are: implication of the correlation coefficient? often, the gene sequences placed on microarray slides are of unknown function. suppose that an experiment identifies such a gene as being important for formation of a particular type of tumor when carrying out a database search using blast with a protein coding gene as the query sequence, there are two possible approaches. first, it is possible to query using the original dna sequence. second, one could translate the coding dna and query using the amino acid sequence of the encoded protein basic local alignment search tool isochores and the evolutionary genomics of vertebrates exploring the new world of the genome with dna microarrays prediction of complete gene structures in human genomic dna the human genome project after a decade: policy issues genomics: the science and technology behind the human genome project new goals for the us human genome project the minimal gene complement of mycoplasma genitalium a primer of genome science principles of population genetics amino acid substitution matrices from protein blocks initial sequencing and analysis of the human genome a map of human genome sequence variation containing 1.42 million single-nucleotide polymorphisms analysis of variance for gene expression in microarray data gene-expression profile of the aging brain in mice bioinformatics: sequence and genome analysis a general method applicable to the search for similarities in the amino acid sequences of two proteins a gene expression database for the molecular pharmacology of cancer identification of common molecular subsequences pfam: multiple sequence alignments and hmm profiles of protein domains increasing biological complexity is positively correlated with the relative genome-wide expansion of non-protein-coding dna sequences shotgun sequencing of the human genome the sequence of the human genome database resources of the national center for biotechnology information diverse plant and animal genetic records from holocene and pleistocene sediments gene expression profiles in normal and cancer cells suggested reading key: cord-321715-bkfkmtld authors: redelings, benjamin d; suchard, marc a title: incorporating indel information into phylogeny estimation for rapidly emerging pathogens date: 2007-03-14 journal: bmc evol biol doi: 10.1186/1471-2148-7-40 sha: doc_id: 321715 cord_uid: bkfkmtld background: phylogenies of rapidly evolving pathogens can be difficult to resolve because of the small number of substitutions that accumulate in the short times since divergence. to improve resolution of such phylogenies we propose using insertion and deletion (indel) information in addition to substitution information. we accomplish this through joint estimation of alignment and phylogeny in a bayesian framework, drawing inference using markov chain monte carlo. joint estimation of alignment and phylogeny sidesteps biases that stem from conditioning on a single alignment by taking into account the ensemble of near-optimal alignments. results: we introduce a novel markov chain transition kernel that improves computational efficiency by proposing non-local topology rearrangements and by block sampling alignment and topology parameters. in addition, we extend our previous indel model to increase biological realism by placing indels preferentially on longer branches. we demonstrate the ability of indel information to increase phylogenetic resolution in examples drawn from within-host viral sequence samples. we also demonstrate the importance of taking alignment uncertainty into account when using such information. finally, we show that codon-based substitution models can significantly affect alignment quality and phylogenetic inference by unrealistically forcing indels to begin and end between codons. conclusion: these results indicate that indel information can improve phylogenetic resolution of recently diverged pathogens and that alignment uncertainty should be considered in such analyses. reconstructing viral phylogenies is important for determining the parent stock of newly emerging strains [1] , as well as for understanding how viruses evolve over time, both within a single host and at the population level [2] . viral phylogenies are commonly inferred from aligned molecular sequence data, using the information available in substitutions shared by descent [3] [4] [5] [6] . short time-scales dominate in the development of rapidly emerging disease strains, such that the number of observed substitutions between sequences can be too low to yield well-resolved phylogenies. thus, to increase phylogenetic resolution for such disease strains we seek to make use of a wider class of phylogenetic information. insertions and deletions (indels) are a promising category of molecular sequence information that is largely ignored in phylogenetic reconstruction. researchers commonly remove gaps from molecular sequences alignments by coding them as missing data or by throwing out columns that contain gaps [3] [4] [5] [6] . indels may be useful to resolve deep branches in the tree of life that are difficult to resolve using information in shared substitutions [7, 8] . at the other extreme, on which we focus here, indels can help to resolve phylogenies in situations where the number of nucleotide substitutions is inadequate. for example, indels in non-coding chloroplast dna have been helpful in resolving the branching order of recent plant radiations [9, 10] . the rate of indel events in these regions approaches or surpasses the rate of substitution, making indels too important to ignore [10] . several species of viruses are also known to accumulate indels, sometimes at a high rate. cheyner et al. [11] note that indel rates are higher than substitution rates in hyper-variable regions of simian immunodeficiency virus (siv) and human immunodeficiency virus (hiv). other viruses also experience indels on short time-scales. hepatitis b virus (hbv) accumulates deletions in the core/pre-core region during the course of infection [12] , while equine infectious anemia virus accumulates insertions [13] . three deletion variants of severe acute respiratory syndrome (sars) appeared during the beginning of the sars outbreak in china [14] . influenza b viruses accumulate indels over several decades [15] . we note that these viruses are all rna viruses, with the exception of hbv. although hbv is a dna virus, it reverse transcribes its dna genome from an rna intermediate. redelings and suchard (2005) describe a statistical method of incorporating indel information into phylogeny estimation. this method uses a joint reconstruction framework that simultaneously infers the alignment, tree, and insertion/deletion rates. estimation proceeds through markov chain monte carlo (mcmc) within a bayesian framework and naturally accounts for uncertainty in alignments, phylogenies, and other parameters through posterior probabilities. unlike sensitivity analysis [16, 17] , this approach takes into account uncertainty resulting from the myriad of near-optimal alignments. this approach involves averaging over unobserved quantities such as the alignment and interal node states, which can lead to improved estimates [18] . this is different from other approaches which iteratively optimize a heuristically chosen cost function until no improvement is seen [19, 20] . joint estimation of alignment and phylogeny sidesteps bias that results from conditioning on a single alignment estimate [21, 18] , bias which may be exaggerated when indel information is inappropriately used. this method is based on a probabilistic model of sequence evolution that contains insertion and deletion events as well as substitution events. heuristic "costs" for opening and extending gaps are replaced by the insertion/ deletion rate and the mean indel length respectively, which are biologically interpretable parameters and can be estimated from the data without circularity [22, 23] . gaps are not treated as a fifth character state, since this overweights the evidence of shared indels by treating an indel of multiple residues as multiple shared indels [3] . instead, the indel process is separate and independent of the substitution process, and allows indels of several residues simultaneously. in addition, because alignments represent positional homology, the indel process does not allow a newly inserted character to be aligned to a previously deleted character. we introduce a new indel model to remedy a shortcoming of the redelings and suchard (rs05) model. unlike the tkf1 [22] and tkf2 [23] indel models that are not reversible on pairwise alignments, the reversible rs05 model does not make use of branch length information in the indel process and therefore does not place indels preferentially on longer branches. in order to increase biological realism, we describe an extended indel model that is able to incorporate branch length information. in doing so we overcome a substantial theoretical difficulty in using reversible indel models during phylogenetic reconstruction. we further enhance the estimation method of redelings and suchard [24] by introducing a novel mcmc transition kernel to improve mixing among topologies. this transition kernel is based on the subtree-prune-andregraft (spr) operator but is modified to partially sample the alignment along with the tree. block sampling improves mixing efficiency because topologies and alignments are highly inter-correlated. we introduce codon models [25] into joint estimation. codon models are often used in both bayesian and likelihood-based phylogeny estimation because they naturally allow different rates at the third codon position, but we are not aware of any work using codon models in joint estimation. we note that codon models implicitly alter the indel process as well as the substitution process by forcing indels to begin and end between codons. this constraint may not be biologically realistic and would result in misaligned nucleotides when indels are not in phase with the reading frame. such misalignment can artificially inflate the number of inferred substitutions. when the total number of substitutions is small, this may significantly alter the model fit or introduce bias. we compare nucleotide and codon indel models to see if these effects are significant. we analyze data sets from siv and hiv. the siv data set consists of a short section of the envelope (env) gene from 9 within-host strains. to see if indel information improves phylogenetic resolution we compare the number of bi-partitions that are supported under the joint model and the traditional sequential approach, in which topology reconstruction assumes a previously determined alignment. we also assess the importance of alignment ambiguity by assessing the sensitivity of phylogeny estimation to fixed alignments under both the traditional and joint models. the hiv data set consists of about 600 nucleotides from the env gene from 27 within-host strains. we compare the number of bi-partitions supported under the sequential and joint models to assess the importance of indel information. we also compare nucleotide and codon models to see if the assumption of unbreakable codons significantly decreases model fit or influences phylogeny estimates. in summary, we seek to improve the power to infer clades in rapidly emerging taxa by making use of indel information in a statistically rigorous manner. we also seek to determine whether indels can actually resolve extremely short branches with few substitutions. to accomplish these goals, we introduce an improved statistical model of the insertion-deletion process to improve the accuracy of the inference, and describe a novel mcmc transition kernel to improve the speed of the inference. once our statistical framework is in place, we then demonstrate that indel information can help to detect previously undetected bi-partitions in two real data examples from rna viruses. while analyzing these data, we note that alignment ambiguity may significantly affect phylogeny inference. we note that codon-based alignments can unrealistically shift indels to avoid breaking codons, and we develop the necessary statistical machinery to demonstrate that this can substantially affect phylogeny estimates. we introduce a time-dependent reversible indel process to the probabilistic framework for joint estimation of alignment and phylogeny of redelings and suchard [24] . time-dependence enables us to place indels preferentially on longer branches of the tree, producing a more realistic description of the evolutionary process. further, we also introduce a novel mcmc transition kernel to increase topology mixing so that we can estimate phylogenies and alignments containing increasingly more taxa. we review the salient features of the rs05 model here and propose the necessary extensions for a time-dependant indel process. our model starts with data y, where y is a collection of unaligned molecular sequences y i for i = 1, ..., n taxa. each molecular sequence y i is a collection of letters of length |y i |. we characterize the stochastic model that describes how the sequences in y diverged from a common ancestor in terms of a number of unknown but estimable parameters. these parameters include a multiple alignment a that specifies the positional homology between the sequences y, an evolutionary tree (τ, t) where τ is an unrooted bifurcating tree topology and t = (t 1 , ..., t 2n -3 ) is a vector of branch lengths along the edges in τ, and vectors θ and λ are parameters that characterize the letter substitution and indel processes respectively. alignment a includes felsenstein wildcard sequences of random lengths at the internal nodes of τ. thus, a also depicts the complete indel history among the sequences in y. we scale branch lengths in terms of expected number of substitutions per site. in contrast to traditional methods of phylogeny estimation that arbitrarily fix the alignment, we treat the alignment a as a random variable, leading to the probability expression the substitution likelihood p(y|a, τ, t, θ) and the priors p(τ, t) and p(θ) occur in traditional bayesian models that fix the alignment. however, the alignment prior p(a|τ, t, λ) and the prior on indel process parameters p(λ) are novel in the joint model, allowing for estimation and a natural way to handle uncertainty in a. to model the substitution process that specifies p(y|a, τ, t, θ), we assume that substitutions in each column of a occur independently and follow a continuous-time markov chain (ctmc) process [26] . under this process, letters at the root of the tree arise according to some distribution π. evolution then occurs independently along each branch of τ with rate matrix q. we restrict ourselves to reversible markov chains and use π as the equilibrium distribution of q. this makes the position of the root unidentifiable and so we use unrooted trees throughout this paper. ctmc models are in common usage for letters from nucleotide-, codon-, and amino acid-based alphabets. in contrast to nucleotide-based ctmc models, codon-based models group the three nucleotides in a codon into a single letter. given the small number of substitutions that occur during the emergence of rapidly evolving pathogens, codon-based models are preferred over amino-acid based models because they do not discard synonymous substitutions. codon-based models can also improve model efficiency over nucleotide-based models because the codon-based models can include non-independent nucleotide frequencies and rule out missense mutations [25] . codon-based models may also improve the accuracy of estimation by allowing the third-codon position to evolve at a higher rate. however, when the number of observed substitutions is low it may not be possible to estimate the non-synonymous to synonymous rate ratio ω, requiring researchers to fix ω to a previously estimated value. importantly, we note that codon-based models also affect the indel process by forbidding frameshift mutations and also indels that begin or end within a codon. while the former constraint is realistic for biologically active viruses, the latter constraint may force incorrect alignments at the nucleotide level, causing up to two misaligned residues per indel. this may result in a significant bias when the total number of substitutions is small. redelings and suchard [24] make the simplifying assumption that the alignment prior p(a|τ,t,l) = p(a|τ, l) (2) is independent of branch lengths. while this assumption implies that indels are equally likely to occur on each branch regardless of length, it trivially enforces that sequence length distributions φ on all nodes in τ remain the same. this is a necessary condition for constructing a reversible evolutionary hidden markov model (hmm) from pair-hmms along the branches of τ. reversibility substantially decreases implementation complexity. the assumption further allows us to avoid fragment based pair-hmms that tend to separate indels by the average indel length, which is not necessarily biologically realistic. here we develop an alignment prior p(a|τ, t, λ) that explicitly depends on branch lengths but retains equivalent sequence length distributions on all nodes of the tree. we begin construction of the extended model by briefly summarizing how the original indel model is constructed from a pairwise alignment distribution ν. we modify this construction to build the new indel model from a parameterized distribution ν t on pairwise alignments that corresponds to a divergence time t. we then describe a new pair-hmm which serves to generate ν t . finally we describe how to calculate posterior probabilities under this model. to describe our original multiple alignment model, we begin by noting that, given a topology τ, the multiple alignment a can be decomposed into a set of pairwise alignments a (b) along each branch b of the topology. this decomposition is possible because of the inclusion of felsenstein wildcard sequences at the internal nodes of τ. imposing an arbitrary distribution ν on each pairwise alignment a (b) independently yields a joint distribution over a. however, pairwise alignments on neighboring branches are not strictly independent because they both specify the length of the random sequence at the node they share. to handle this dependence, we first choose an arbitrary internal node in τ as the root; this imposes an orientation on each branch. we then label the sequence in each branch alignment a (b) that is closest to the root as the ancestral sequence and the other sequence as the descendant sequence. we sample the sequence length at the root from a distribution and draw the pairwise alignment a (b) for each branch b from ν conditional on the length of the ancestral sequence, proceeding down the tree from the root to the leaves. we note that the pairwise alignment distribution ν induces a sequence length distribution on each sequence in the pair it emits. to proceed, we require that the pairwise alignment distribution ν be symmetric under interchange of the two sequences in the pair. this implies that there is no preferred direction of evolution between the two sequences. it also implies that the sequence length distribution for the ancestral and descendant sequences are equal; we call this common distribution φ. if we set the root length distribution = φ, then we can write the multiple alignment prior as where i represents the set of internal nodes in τ [24] . note that in this expression the arbitrary root is not identifiable. unfortunately the parameters that characterize our original pairwise alignment distribution ν can not vary from branch to branch without inducing unequal length distributions. we therefore propose a new pairwise alignment prior that maintains a fixed sequence length distribution φ even when the indel probability varies from branch to branch. to accomplish this aim, we assume that each sequence consists of a series of unbreakable fragments, as in the tkf2 model. the fragment lengths are geometrically distributed with continuation probability ε and minimum length 1. the number of fragments is uniformly distributed over the non-negative integers. following an ancestral fragment at one end of a branch, a geometric number of new fragments are inserted in the descendent 3 with continuation probability δ(t). each ancestral fragment survives in the descendent with probability δ'(t) = δ(t)/(1 -δ(t)). following our previous model and the tkf models, insertions and deletions are equally likely. this model can be expressed as a symmetrical pair-hmm (figure 1 ), implying that alignments can be considered non-directed, since the probability does not change when ancestor and descendant sequences are interchanged. this contrasts with the tkf models that induce irreversible distributions on pairwise alignments. a major advantage of this symmetry is that it is clear how to construct alignment models on an unrooted tree and leads to greater simplicity in model implementation and, arguably, decreased computation time. the model described here diverges from our previous model in that match fragments no longer contain only a single letter, but instead follow the same length distribution as gap fragments. this is represented graphically in the pair-hmm by the addition of a loop with non-zero weight ε from the match state (+/+) to itself. to facilitate dependence of pairwise alignment distribution ν t on t, we seek a natural relationship between δ(t) and t. we define λ as the indel rate per residue scaled in pair-hmm representation of the fragment-based indel model becomes the probability of a fragment being inserted or deleted. we wish to re-parameterize the fragment model in terms of a per-residue indel rate; the probability of an indel occurring between two residues is (1 -ε)δ'(t). however, if we attempt to set then the probability δ'(t) can become greater than 1. we therefore move the factor of (1 -ε) into the time scale, such that we note that equation (4) agrees with equation (5) to first order in λt b and serves to connect fragment indel rates to per-residue indel rates. the product λt b is in general << 1, so matching on higher order terms is unnecessary. the distribution ν t naturally gives rise to two models. in the first model, denoted "fragments", we set ν (b) = ν λ for all b, making the probability f an indel independent of branch length again. in the second model, denoted as "fragments+t", we set making the probability of an indel roughly proportional to branch length t b . we now show that the sequence length distribution induced by ν t is independent of t. the pairwise alignment distribution is a uniform distribution on the number of fragments, with each fragment being a match (+/+), insertion (-/+) or deletion (+/-) with probabilities 1 -2δ(t), δ(t) and δ(t), respectively, and with exit measure (1 -δ(t)). this results in the following probability generating function for the length of either sequence in the pair-hmm: therefore, the length distribution is independent of δ(t), and is uniform except for an anomaly at length 0. this allows us to specify a different value of δ(t) in the pair-hmm on each branch of the tree without affecting φ. defining l 1 and l 2 as the emitted sequence lengths from the pair-hmm, we note that p(l 1 = l 1 ) has finite measure and that the distribution p(l 2 = l 2 |l 1 = l 1 ) on l 2 is therefore proper. this implies that the posterior distribution of the joint model is proper because the distribution conditions on the observed leaf sequence lengths. we introduce a novel mcmc transition kernel that improves mixing between topologies and alignments. the new transition kernel uses the spr operator ( figure 2 ) to propose new trees, but is extended to be alignment-aware. our previous approach used only nearest-neighbor-interchange (nni) operators to propose new trees [24] . this resulted in long convergence times and inefficient mixing when there were many taxa. the spr operator improves on this situation by proposing non-local topology rearrangements that would require several nni moves, and thus avoids several intermediates [27] . a along with the topology τ. in our framework, it is necessary to alter a when τ is altered because a specifies the homology of internal sequences and this homology may be inconsistent with the proposed topology. this happens when some column of a contains a letter that would be deleted and reinserted given the new topology. after an spr tree proposal, we note that the alignment of the subset of sequences corresponding to taxa in the pruned subtree ( figure 2 , blue) must remain consistent because their phylogeny remains unchanged. likewise, the alignment of the other sequences ( figure 2 , green) must remain consistent because the phylogeny of that subset remains unchanged. however the alignment of the complete set of sequences may not be consistent. our solution to this problem involves collapsed gibbs sampling [28] with τ' as long as c is large enough to contain alignments consistent with both τ and τ'. then we sample a single point from the chosen collapsed point in proportion to its posterior probability. to satisfy detailed balance, the set c must be constructed so that it contains at least the current alignment a; full conditions under which this procedure satisfies detailed balance are described in the appendix. we now seek a set c that is large enough to contain alignments consistent with τ' and yet small enough for integration and sampling to be computationally feasible. unfortunately, integration over the set of all alignments is not practical, even if we constrain the alignment of leaf sequences to be constant. therefore, we fix parts of the alignment and collapse only the remaining portions. allowing only the three branch alignments adjacent to node o ( figure 2 ) to vary will certainly allow an align-ment consistent with τ'. this is therefore a loose constraint, which we call c 3 (a, τ, o). it requires an o(l 3 ) dynamic programming algorithm for integration and resampling. to decrease the order of the dynamic programming algorithm to o(l 1 ), we consider imposing the additional constraint, which we call c 1 (a, τ, o), that the alignments between the three nodes connected to o the subtree-prune-and-regraft operator figure 2 the subtree-prune-and-regraft operator. (a) first a subtree (blue) and its associated node o are detached from the rest of the tree (green). (b) the subtree is then regrafted along into a different branch through its node o. in both (a) and (b), three branches connect to node o. the phylogeny relating sequences at the pruned nodes (blue) and the phylogeny relating sequences at the remaining nodes (green) do not change. therefore alignments within each of these sequence subsets can remain unchanged from (a) to (b). and may not include any alignments consistent with τ'. as an alternative, we propose to fix the alignment between sequences in the pruned subtree and the alignment between sequences in the remainder, but allow the alignment between the two groups of sequences to vary. this constraint, which we call algorithm that is significantly more computationally efficient than an o(l 3 ) algorithm. note that we have demonstrated above that the alignment within the two subgroups of sequences remains consistent under an spr proposal. thus, the constraint set c 2 (a, τ, op) contains an alignment that is consistent with τ' as well as τ, making c 2 (a, τ, op) a useful constraint set for collapsed sampling. triplet models coalesce three adjacent nucleotide letters into a single triplet letter. the size of the triplet alphabet is therefore approximately the cube of the size of the singlet alphabet. the larger alphabet size allows a more complex substitution model such as the codon model of goldman and yang [ [25] , m0]. triplet substitution models can prohibit stop codons, can make use of codon frequencies instead of nucleotide frequencies and can differentiate between synonymous and non-synonymous substitutions. triplet alphabets affect the alignment model as well as the substitution model by forcing indels lengths to be multiples of 3 singlet letters and by forcing indels to start and end between triplets. while the former is biologically realistic, the latter may not be. we describe a method of comparing triplet with singlet models to assess how forcing indels to begin and end between codons affects model fit. to accomplish this, we first remove the substitution benefits of the m0 model listed above to focus solely on the effects of the triplet alignment process. we construct a triplet substitution model that generates the same likelihood as a singlet substitution model given the same alignment. traditionally, both models are reversible and have a rate matrix q = {q xy } that is constructed from the equilibrium letter frequencies π and a symmetric exchangeability matrix s = {s xy } in the following way: fraction f can vary from 0 to 1 but traditionally f is fixed to 1. the fraction specifies the relative importance of unequal conservation (f = 0) and unequal replacement (f = 1) in creating the equilibrium frequency distribution [29] . given a singlet nucleotide model with exchangeability matrix s (s) , we build a triplet model with exchangeability matrix s (t) in the following fashion. each allowable substitution from triplet α to triplet β involves only one nucleotide substitution from nucleotide i to nucleotide j. we note that for branch lengths to agree between the singlet and triplet models, q (t) must be scaled so that instead of the usual 1, because q (t) measures changes of each of the three sites in the triplet. we use hky as the singlet model in our comparison because the hky × 3 model is identical to the m0 codon model with ω = 1, stop codons included, and independent nucleotide frequencies. we analyze two data examples to demonstrate the advantages of joint bayesian estimation. while both data sets come from related genes, they differ in their sequence lengths, number of taxa, and sequence characteristics. we select these datasets for their relative sparseness of phylogenetic information, typical of rapidly evolving pathogens. thus, although the joint model makes full use of both indels and substitutions shared by descent, we do not expect to recover fully resolved trees. rather, we note substantial improvement over traditional, sequential methods. we first examine a data set drawn from siv, a non-human primate lentivirus. lentiviruses contain a single-stranded rna genome that reverse transcribes into dna by upon infection. the dna then inserts into the host genome before expression. reverse transcriptase is extremely errorprone, giving lentiviruses high mutation rates. the data set consists of 9 partial env sequences sampled from within a single macaque initially infected by injection with strain sivmac251 [31] . cheynier et al (2001) have previously presented an alignment of these sequences as a typical example of phylogenetically informative indels in siv [11] . the env gene encodes glycoprotein gp160, which is split after translation to form the smaller glycoproteins gp120 and gp41. because gp120 and gp41 are displayed on the surface of mature virions, exposed to the host immune system, env tends to mutate more quickly than other siv genes through positive selection. from the data set, we remove a phylogenetically uninformative duplication in a single sequence because our model assumes insertions of random sequence but not duplications. all sequences then range in length from 57 to 69 nucleotides with an alignment length of 69 nucleotides independent of the method used to compute the fixed alignment. the data set contains 10 variable sites and 6 informative sites under the clustal w alignment, 12 variable and 7 informative sites under the muscle alignment, and 11 variable and 6 informative sites under the map estimate from the joint model (table 1 , -indel contribution). for a prior on ln κ, we assume a double-exponential distribution with median ln 2 and standard deviation . on ln λ, we assume a double-exponential distribution with median -5 and standard deviation . for ε we assume an exponential distribution with mean 5 on the expected indel length. we assume a uniform distribution over the topology τ. on the branch lengths we assume an exponential distribution with mean μ, and on μ we assume an exponential distribution with mean 0.04. continuous parameter estimates under the joint model are as follows: κ has median 2.4 with a 95% bayesian credible interval of (1.64,5.32). the median of ln λ is -3.4 and its 95% bci is (-4.99, -1.85). the median of ln ε is -0.71 with a 95% bci of (-1.12, -0.428). the mean branch length μ, has posterior median 0.0178 with a 95% bci of (0.00854,0.0368). to assess the usefulness of indel information and the importance of alignment ambiguity in phylogenetic inference, we compare the posterior topology distributions for the traditional sequential model, the joint model restricted to a fixed alignment, and the full joint model. we note that the joint model increases the number of resolved internal branches by 3, 2, and 2 at posterior probability (pp) > 0.9, > 0.95, and > 0.99, respectively, over the traditional model using the clustal w alignment. the joint model supports 4, 3, and 3 branches at these levels of posterior probability and we depict the tree with branches supported at pp > 0.99 in figure 3 . this increase in resolution is sensitive to the alignment estimation method. for example, the resolution increase changes to 0, 0, and 2 under the muscle alignment, and 1, 2, 2 under the joint map alignment. thus, even accounting for alignment uncertainty, we achieve an increase in the phylogenetic resolution. at high posterior probabilities indels become relatively more important because they are rarer than substitutions. we note that alignment ambiguity is significant in this data set. first, estimates under the traditional or restricted models are sensitive to the alignment method used (table 1). second, fixing the alignment under the joint model yields an increase in the number of supported branches if the alignment is fixed to the clustal w estimate or the joint map estimate, but a decrease if the muscle estimate is used. furthermore, the increased support when the clustal w alignment is used includes a branch that conflicts with the joint map model, and the conflicting branch is present in the guide tree. thus ignoring alignment ambiguity can lead to exaggerated support for branches and bias towards the guide tree, especially when indel information is used. figure 4 displays a "gold" plot [24] to summarize the posterior alignment distribution of a under the full joint model. we observe a high level of alignment uncertainty. this is borne out by the observation of only 4 unique indels under the full joint model, while the clustal w alignment contains 5 indels. this difference is reflected in the lower estimate of λ under the full joint model and in the restricted models not using the clustal w alignment (table 1) . s1 s10 s11 s15 s5 s9 ref s16 s20 0.02 (a) -indels s1 s10 s11 ref s16 s20 s15 s5 s9 0.02 (b) + indels estimates of continuous parameters κ, ln λ, and ln ε are presented as a posterior median followed by a 95% bayesian credible interval. our second data set consists of comparatively longer sequences from hiv-1, a lentivirus closely related to siv. we consider a collection of 27 partial env gene sequences sampled serially at three time points from patient 1 reported by shankarappa et al [4] . we first analyze these data using the m0 codon model [25] to assess the importance of selection in this region (table 2) . we use the same prior distributions on λ, ε, κ, τ, and t as in example 1. we additionally place a double-exponential distribution on ln ω with median 0 and standard deviation 0.1. in addition to the standard m0 model in which f is fixed to 1, we consider the case in which f is a random variable with a uniform prior distribution. the posterior distribution of ω has median 0.996 and a 95% bci of (0.834,1.20). this changes little when f is free. the estimated interval is quite close to the prior 95% bci of (0.84,1.16) so we conclude that these data possess little information about ω. allowing ω to vary does not yield much benefit, and we henceforth consider only ω = 1. we also note that fixing f = instead of the traditional value of 1 produces a decrease in marginal likelihood of 2 log units for the hky model and a substantial increase of 1 2 siv alignment uncertainty plot figure 4 siv alignment uncertainty plot. we annotate the joint maximum a posteriori alignment estimate to indicate the approximate probability that each letter aligns to the root taxon in its column [24] . the 8 gaps in the alignment are a result of only 4 indel events under the joint model, whereas the clustal w alignment requires at least 5 indel events. colors other than red indicates that letters or gaps may shift to adjacent positions. the high frequency of the caa triplet is partially responsible for the level of alignment uncertainty. uncertain certain aaatcatcaacaacaacaacaacagcatcaacaacacc------aacatcaacaaagtcaataaacatg s10 aaaccatcaacaacaacaacaacagcatcaacaacacc------aacatcaacaaagtcaataaacatg s11 aaatcatcaacaataacaacaacagcaccaacaacaccaaatacaacatcaacaaagtcaataaacatg s15 aaatcatcaacaacaacaacaacag---------caccaaatacaacatcaacagagtcaataaacatg s16 aaatcatcaacaacaac---aacagcaccaacaccaacaaacacaacatcaacaaagtcaataaacatg s20 aaatcatcaacaacaac---aacagcaccaacaccaacaaacacaacatcaacaaagacaataaacatg s5 aaatcatcaacaacaacaacaacaa---------caccaagtacaacatcaacaaagtcaataaacatg s9 aaatcatcaacaacaacaa---cac---------caccaagtacaacatcaacaaagtcaataaacatg we therefore assume f = 0.5 for the remainder of our analyses. under the hky model we find that κ has a posterior median of 7.2 with a 95% bci of (4.6,11.7). the posterior median of ln λ is -3.3 with a 95% bci of (-4.1, -2.7) and ln ε has a median of -1.0 and a 95% bci of (-1.3, -0.78). this estimate of ε corresponds to a mean indel length of 1.58 nucleotides. the posterior median of μ, is 0.0036 with a 95% bci of (0.00257,0.00508). to examine the model appropriateness of forcing indels to begin and end between codons, we compared the marginal likelihoods and posterior tree lengths for the hky singlet and hky × 3 triplet models. under both models, we fixed f = for equivalence and set independent nucleotide frequencies to their empirical estimates. the log marginal likelihood is -1555.7 ± 0.3 for the singlet model and -1579.8 ± 0.3 for the triplet model (table 2) . to examine the substantial decrease of 24.1 log units between models, we calculate the posterior distribution of parsimony tree lengths under both models. the posterior median tree length is 104 substitutions with a 95% bci of (103,106) for the singlet model and increases to 109 substitutions with a 95% bci of (108,110) for the triplet model. to verify that this increase results from forcing indels out of phase, we first calculate the posterior distribution of the number of indels under the singlet model. the posterior mean number of indels is 11.0 and the bci is (11, 11) . the posterior mean number of indels beginning 0, 1, or 2 nucleotides from the beginning of a codon is 2.6, 5.8, and 2.6 respectively. the 95% bci for the number of indels beginning inside a codon is (6, 10) . inspecting the alignment estimate from the map point using a "gold" plot demonstrates alignment uncertainty (data not shown). in the map alignment we observe 11 indel events. only 5 of the indels are consistently present with unambiguous phase, and none of these indels can be placed between codons. interestingly, one indel of 3 nucleotides occurs independently in clades (16, 17) , 19 , and 22 according to the map estimate ( figure 5 ). we note that augmented alignments such as those used in our model distinguish between indels shared by state and indel shared by descent through the inclusion of felsenstein wildcard sequences at internal nodes of τ. use of the triplet model instead of the singlet model has a discernible effect on phylogeny estimation. the posterior odds in favor of the clade (10, 12, 18) decrease by a factor of 8.0 from 24.6 to 3.1 ( table 2) . we note that in one column of the singlet map alignment estimate, only variants 10, 12 and 18 have an a residue, while other taxa have either a g residue or a gap (figure 5a) . however, the triplet model shifts these gaps out of the column to avoid breaking a codon. taxa that contain a gap in this column under the singlet alignment contain a residues according to the triplet map alignment, decreasing the support for (10, 12, 18) clade ( figure 5b) . thus, comparing marginal likelihoods for model selection between the singlet and triplet models may not provide the whole picture. triplet models have discernible effects on estimates of the indel parameters λ and ε, but little effect on the substitution parameters μ and κ. for ln ε that is significantly smaller than the hky estimate of -1.0. however, accounting for the fact that one triplet contains three nucleotides, the hky × 3 model predicts a mean indel length of 1.1 triplets and 3.2 nucleotides, but the hky model predicts a mean indel length of 1.6 nucleotides. this may be because a geometric distribution on the number of nucleotides in a gap does not fit the data as well as a geometric distribution on the number of triplets in a gap. this is especially true in data sets such as the present one in which the number of triplets tends to be small. it may also be because the indel model used is fragment-based. finally, we note that estimates of 7.5 for κ in the hky × 3 model are quite similar to estimates of about 7.2 under the hky model. to assess how much indel information improves the reswhile the number of branches supported at pp > 0.9 is equal, not all supported branches are the same. the number of branches supported only under the joint model is 2, 2, and 2. the joint model supports the clades (16, 17) and (21, 24 ) over the traditional model at all three levels of pp. the traditional model supports the clade (19, 21, 24, 25) at a pp of 0.980 compared to 0.887 with indel information. the traditional model also supports the clade (10, 12, 15, 16, 18, 19, 21, 24, 25) at pp > 0.9 that has support < 0.5 when indel information is included. this results because the large clade conflicts with the clade (16, 17) that is supported by two shared indels. thus, the number of branches supported in only one of the two models at each level of pp is 4, 3, and 2. since the joint model balances substitution and indel information as well as taking alignment ambiguity into account we assume that these differences represent an improvement in the accuracy of estimation. however, because the true tree is not observed, we cannot be certain which, if any, of the predictions is correct. the partitions supported under the two models at pp > 0.99 are depicted in figure 6 . in summary, indel information conflicts with one branch in the substitution-only tree and down-weights the evidence for another branch. the conflicting branch is ruled out by the support of 2 shared indels for the clade (16, 17) , although one of these is homoplastic. we demonstrate that the novel mcmc transition kernel introduced in the sub-section sampling improves the computational efficiency of topology estimation when using indel information. the transition kernel improves the triplet alignments may shift indels and cause misaligned residues figure 5 triplet alignments may shift indels and cause misaligned residues. triplet alignments may shift indels and cause misaligned residues. (a) maximum a posteriori (map) alignment estimate under the singlet hky model. (b) map alignment estimate under the triplet hky × 3 model. in the triplet alignment, two g residues (blue) and four a residues (red) are forced into a different column to avoid breaking the alignment-wide reading frame. the displaced a residues join a residues from strains 10, 12, and 18 (green) which were previously the only a residues in that column. under both models, the map alignment estimates display 8 gaps. the alignment of internal sequences (not shown) indicates that these gaps arose from 5 indel events on branches partitioning clades (20, 23) , (21, 24) , (16, 17) , (19) , and (22) . thus, the gaps in sequences 19 and 22 arose independently of the gap in (16,17) even though they have the same length and position. prefixes on sequence names indicate elapsed time in weeks between the initial infection and when the sequences were obtained. convergence properties of the markov chain substantially, so that fewer initial samples must be discarded as "burnin". we compare the behavior of the estimation procedure when the new transition kernel is disabled (nni-only) or enabled (nni+spr) by running 15 instances of each chain starting from a randomly chosen tree and alignment. we use the data-set from example 2 that consists of 27 hiv sequences, with a maximum length 612 nucleotides. to assess convergence for each markov chain, we count the number of iterations required for the sampled tree topology to approach its equilibrium distribution of tree topologies. to accomplish this task, we need to define a distance from a single tree topology to a distibution of tree topologies. we start with the robinson-foulds distance (rf) between two tree topologies that we denote as d rf (τ 1 , τ 2 ). this distance does not depend on branch lengths. we then define the distance d(τ 1 , ξ) from a topology τ 1 to a distribution of topologies ξ as the average rf distance between τ 1 and a tree τ 2 ~ ξ: the expectation of d(τ 1 , ξ) does not converge to 0 as the markov chain approaches stationarity; rather the expectation approaches the average distance between two trees sampled from the equilibrium topology distribution. with this in mind, we consider a chain to have converged when the distance from the chain's current topology to the equilibrium distribution reaches the lower 25th percentile of distances from trees at stationarity to the equilibrium distribution. we approximate the equilibrium topology distribution with 200 topologies sampled at widely spaced intervals from a long-running mcmc analysis. we find that this distribution is not sensitive to the starting point of the markov chain, and does not change when the new transition kernel is enabled. without the new transition kernel based on spr, the median time to convergence is 2112 iterations with an average of 1976.9. however, when the new transition kernel is enabled, the median time decreases to 66 iterations, triplet alignments may shift indels and cause misaligned residues to visualize the convergence properties of the two approaches, we project the tree samples from two typical chains into the plane using metric multidimensional scaling based on their rf distances (figure 7 ). some researchers question the ability of indel information to improve phylogenetic resolution of recently diverged taxa. golenberg et al. analyze non-coding spacer regions between chloroplast genes in a parsimony framework and claim that indels shared by state recur more often than substitutions shared by state [32] , leading to a concern that indels are not reliable characters for phylogenetic analysis. however, simmons and ochoterana find indels to be reliable markers with low levels of homoplasy [33] . this contrast is partially explained by noting that the original golenberg study incorrectly codes overlapping gaps of different lengths as homologous, leading to false homoplasy. improved methods of coding indels when gaps overlap can lead to more accurate and more informative indel characters [33, 34] . in addition, researchers note that chloroplast intergenic spacers contain indel "hotspots" and that sequence duplications or changes in the number of tandem repeats occur at a significantly higher rate than non-repeat indels [32, 10] . this high rate can lead to identical but non-homologous insertions in different taxa, and so repeat indels experience higher homoplasy than non-repeat indels [9] . repeat indels should therefore be down-weighted, but unfortunately an appropriate weighting scheme has not yet been developed [35] . we also note that current alignment algorithms do not recognize duplications or indel hotspots, so that automatic alignments must be adjusted manually. despite these difficulties with repeat indels, researchers have examined intergenic spacers in various plant species using improved indel coding and find that indel information is consistent with substitution information and largely reinforces it, improving phylogenetic resolution and support [9, 10] . in some analyses, indels are useful only in distinguishing larger groups [36] . despite the utility of indels in phylogeny estimation, most researchers note difficulties in indel coding that result from alignment ambiguity [35] . this can be true even when the number of substitutions is too small to yield well-resolved phylogenies. while alignment ambiguity causes general problems with gap placement, some specific problems are worthy of mention. for example, aligning insertions of questionable homology may create spurious evidence for common ancestry [35] . also, when the number of tandem sequence repeats decreases, it is unclear which repeat has been deleted. resolving these ambiguities to yield a single alignment can increase the support for some trees while decreasing the support for others, leading to bias, and so regions whose homology is uncertain should be thrown out [35] . the joint estimation approach we advocate sidesteps many of the issues through the assignment of uncertainty on alignments, indel existence and placement. although the indel model described here improves on common multiple alignment algorithms by allowing indels to be shared by descent, it has some limitations. first, the model assumes that the indel rate is spatially homogeneous. however, biological sequences contain indel "hotspots" where indels are more likely to occur as well as invariant regions where indels are prohibited. incorrectly accounting for rates at which indels occur in different regions can lead to over-weighting of the indel evidence. clustal w attempts to place indels in hydrophilic regions of amino acid sequences, but does not have a mechanism for locating hotspots in non-coding sequences or hotspots resulting from weak selection or positive selection. second, the indel model makes the common assumption that residues in a single sequence are never homologous. duplications violate this assumption and are treated as insertions of random sequence by the indel process. third, changes in the number of tandem repeats of a short sequence often occur at a higher rate than other indels via slipped-strand mispairing (ssm). however, no commonly used alignment program accounts for within-sequence homology or ssm. an improved stochastic process model that accounts for these properties of biological sequences is highly desirable in order to accurately weight shared indel evidence and to produce both more accurate alignments and phylogenies. we extend the joint bayesian estimation framework of redelings and suchard [24] for recently diverged siv and hiv sequences to incorporate indel information into phylogeny estimates. in both examples, the use of indel information increases the number of supported bi-partitions even though the branch lengths are small, especially at high posterior probabilities. while many indels in these data sets occur in a single taxon or on a branch supported by many substitutions, some indels occur on branches with few or no substitutions. the relative weight of indels and substitutions shared by descent is specified by the relative rate λ estimated from the data. this offers an improvement over existing methods that force the relative weight to be set a priori. alignment uncertainty is significant in the siv data set. this uncertainty is illustrated by the fact that the topology distribution under the traditional model varies significantly depending on the choice of alignment ( table 1) . the joint estimation framework does not suffer from this sensitivity to alignment choice and allows alignment uncertainty to be estimated ( figure 4 ). we note that alignment-aware spr transition kernel decreases burn-in time figure 7 alignment-aware spr transition kernel decreases burn-in time. we consider the 27-sequence data set of hiv sequences described in the results section as example 2. points represent 200 topologies sampled from a markov chains with the alignment-aware spr transition kernel disabled (red; nni-only) or enabled (blue; nni+spr) or from the equilibrium distribution (green). while the convergence time for markov chains varies widely, this example illustrates the median convergence time. the nni-only chain takes 2112 iterations to converge versus only 66 iterations for the nni+spr chain. because the convergence times are so different, the figure depicts every 10th tree for the first 2000 iterations, whereas for the nni+spr chain the figure depicts every 2nd tree for the first 400 iterations. points represent trees projected onto the plane using multidimensional scaling based on the robinson-foulds distance. this distance depends only on the topology, not the branch lengths. start including indel information in analyses exaggerates the bias that results from fixing a single alignment choice. the high level of alignment uncertainty in the siv data set is partially explained by a large number of occurrences of the triplet caa. we note that in the hiv data set alignment uncertainty does not significantly effect the topology posterior. models such as m0 assume that codons are unbreakable, but the hiv data set shows that this can be unrealistic. forcing indels to codon boundaries results in a decrease in model fit of 24.1 log units because of an increase in the number of inferred substitutions. thus choosing a codon model over a singlet model involves a tradeoff between a substantially improved substitution model and a possibility of incorrect homology in the alignment. because the effects of the latter can be significant when the total number of substitutions is small, we welcome the development of an improved substitution model that does not force this tradeoff. such a substitution model would be able to calculate the likelihood of a singlet alignment while making use of codon frequencies and differentiating between synonymous and non-synonymous changes. we begin by considering a probability distribution π(x) on points x ∈ ω and a function f(x) that associates a subset of ω to each point x ∈ ω. we call f(x) a collapsing function if for any x and y in ω we have x ∈ f(x) and f(x) and f(y) are either identical or disjoint. if f is a collapsing function, then it partitions ω into a set of non-overlapping subsets, which we refer to as collapsed points. we denote the set of collapsed points as f(ω), and note that the probability π*(f(x)) of each collapsed point f(x) can be naturally defined as the integral of the probabilities π(y) of points y ∈ f(x). because the collapsed points are disjoint sets, these probabilities sum to 1 and yield a probability distribution on collapsed points. we then consider a transition kernel p on ω that is defined in terms of a transition kernel p* on f(ω). starting from the current point x, this transition kernel consists of collapsing x to f(x), moving to some other collapsed point a, and then selecting a point y from a in proportion to its probability π(y). we note that y ∈ a implies a = f(y) and write the probability expression for this transition kernel as the condition for p to satisfy detailed balance is by cancelling common terms. π* (f(x)) × p*(f(x), f(y)) = π* (f(y)) × p*(f(y), f(x)). (13) thus, the requirement for p to satisfy detailed balance on ω is simply that p* satisfies detailed balance on f(ω). we now demonstrate that the function f that maps (a, τ, t, θ, λ) to is a collapsing function. the directed branch po partitions the nodes of τ into two subsets excluding node o ( figure 2 ). set c 2 contains all alignments that are consistent with a on each of the two subsets. alignment a certainly fulfills this criterion, and therefore a ∈ c 2 (a, τ, po), implying that x ∈ f(x) for any x. in addition, c 2 (a', τ, po) = c 2 (a, τ, po) for any a' in c 2 (a, τ, po) and so f(y) = f(x) for any y ∈ f(x), implying that f(x) and f(y) are either identical or non-overlapping. therefore f(x) is a collapsing function. the transition kernel consisting of spr proposals for points collapsed using c 2 (a, τ, po) therefore satisfies detailed balance when we use the mh rule for acceptance or rejection and mh satisfies detailed balance on the collapsed points. our method for sampling alignments samples from a distribution η that approximates the correct distribution π but does not match exactly [24] . we therefore define an mh transition kernel that uses collapsed sampling of alignments as a proposal distribution ρ. after selecting a new topology and alignment that goes along with it, we reject this new point j and move back to the original alignment and topology i with a small probability 1 -α ij . the mh acceptance ratio can be calculated as follows: the ρ ij satisfy detailed balance with respect to another probability η i = π i f i . thus, origin of hiv-1 in the chimpanzee pan troglodytes troglodytes the causes and consequences of hiv evolution integrating ambiguously aligned regions of dna sequences in phylogenetic analyses without violating positional homology consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 phylogenetic analysis of polyomavirus simian virus 40 from monkeys and humans reveals genetic variation timing and reconstruction of the most recent common ancestor of the subtype c clade of human immunodeficiency virus type evidence that eukaryotes and eocyte prokaryotes are immediate relatives rare genomic change as a tool for phylogenetics the evolution of the atpβ-rbcl intergenic spacer in the epacrids (ericales) and its systematic and evolutionary implications molecular evolution of insertions and deletions in the chloroplast genome of silene insertion/ deletion frequencies match those of point mutations in the hypervariable regions of the simian immunodeficiency virus surface envelope gene long-term follow-up study of core gene deletion mutants in children with chronic hepatitis b virus infection in vivo dynamics of equine infectious anemia viruses emerging during febrile episodes: insertions/duplications at the principal neutralizing domain epidemiology consortium: molecular evolution of the sars coronavirus during the course of the sars epidemic in china reassortment and insertion-deletion are strategies for the evolution of influenza b viruses in nature alignment-ambiguous nucleotide sites and the exclusion of systematic data. molecular phylogenetics and evolution elision: a method for accommodating multiple molecular sequence alignments with alignment-ambiguous sites freeing phylogenies from artifacts of alignment frequency of insertion-deletion, transversion, and transition in the evolution of 5s ribosomal rna iterative pass optimization of sequence data the order of sequence alignment can bias the selection of tree topology an evolutionary model for maximum likelihood alignment of dna sequences inching towards reality: an improved likelihood model of sequence evolution joint bayesian estimation of alignment and phylogeny a codon-based model of nucleotide substitution for protein-coding dna sequences mathematical and statistical methods for genetic analysis subtree transfer operations and their induced metrics on evolutionary trees monte carlo strategies in scientific computing a novel use of equilibrium frequencies in models of sequence evolution dating of the human-ape splitting by a molecular clock of mitochondrial dna wain-hobson s: antigenic stimulation by bgc vaccine as an in vivo driving force for siv replication and dissemination evolution of a noncoding region of the chloroplast genome gaps as characters in sequencebased phylogenetic analyses incorporating information from length-mutational events into phylogenetic analysis the evolution of the non-coding chloroplast dna and its application in plant systematics indel patterns of the plastid dna trnl-trnf region within the genus poa (poaceae) clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice muscle: multiple sequence alignment with high accuracy and high throughput we would like to thank vladimir minin for many helpful discussions. b.d.r. is supported by nsf training grant dge9987641 and nih training grant gm008185. m.a.s. is supported in part by nih grant gm068955, the ucla aids institute and the james b. pendleton charitable trust. therefore the acceptance ratio is:distribution f i is proportional to the product of the length distributions on the internal nodes and changes very slowly in i. therefore is usually quite close to one and there are few rejections. to assess alignment ambiguity we compare the posterior topology distribution for the full joint model to the distribution generated under models restricted to a fixed alignment. as these distributions may be sensitive to the specific alignment chosen, we use three different choices. these alignments are the estimates obtained from clustal w [37] , muscle [38] , and bali-phy [24] . in the latter case, we fix the alignment to its maximum a posteriori (map) point determined jointly. we use the default parameters for clustal w and muscle. parameters and models used by bali-phy are described in the results section. the inference method described in this paper and implemented in the bali-phy software [24] requires significant computation time in order to handle alignment uncertainty and incorporate indel information. this means that it is often impractical to analyze data sets with greater than 12 taxa or sequence lengths longer than about 750 letters (nucleotide, amino acid, or codon). analyzing data sets of this size often takes about a week on current hardware. however, we wish to emphasize two points. first, the long computation time is not required to make a simple estimate, but to obtain measures of confidence that are accurate enough to publish. for simple estimates or unpublished results, significantly larger data sets can be analyzed. second, the amount of time required to analyze a data set depends not just on the size of the data set, but on various characteristics such as the level of uncertainty. for example, the second example in this paper contains 27 taxa of maximum length 612 and took about 3 weeks to analyze. ms formulated the problem and provided project management. br designed the algorithms and models. br performed the actual programming and computations. ms and br analyzed the data. ms and br wrote the paper. all authors read and approved the final manuscript. key: cord-321762-7kiahjyy authors: nandy, ashesh title: chapter 5 the granch techniques for analysis of dna, rna and protein sequences date: 2015-12-31 journal: advances in mathematical chemistry and applications doi: 10.1016/b978-1-68108-053-6.50005-3 sha: doc_id: 321762 cord_uid: 7kiahjyy abstract: the very rapid growth in molecular sequence data from the daily accretion of large gene and protein sequencing projects have led to issues regarding viewing and analyzing the massive amounts of data. graphical representation and numerical characterization of dna, rna and protein sequences have exhibited great potential to address these concerns. we review here in brief several different formulations of these representations and examples of applications to diverse problems based on what this author had presented at the second mathematical chemistry workshop of the americas in bogota, colombia in 2010. in particular, we note several insights that were gained from such representations, and the applications to the bio-medicinal field. my first brush with a dna sequence, in around 1990, left me totally puzzled: i could not "see" nor get a "feel" of anything noteworthy in the apparent jumble of characters that symbolized a dna, not the least because i had never studied biology myself. my background was physics, and i began a search for, to me, a more meaningful exposition of the sequence of characters that represented the dna sequence. my studies led me to appreciate and anticipate the immense potential opening up with the sequencing of genome length sequences and the concomitant need for rapidly scanning and analyzing dna sequences for matters of interest [1] , and to get excited at the new insights being gained from a global perspective of the dna sequences: jeffreys [2] had shown through his chaos generator representation that such sequences had an inherent fractal nature; peng et al. [3] speculated that dna sequences had long-range correlations, an observation that raised a storm of papers in very short order; and voss [4] showed that long range fractal correlations existed in dna sequences with the degree of correlation varying with evolutionary divergence. but a close up look at a dna sequence and how the bases were distributed along it still lacked an appealing representation. experimenting with various formats i determined that a 2d graphical representation, as explained later in this chapter, was what i could relate to on a purely personal basis. after many graphs of various sequences and discussions with some eminent persons to ensure that such a simple stratagem was not already familiar to cutting edge biologists, i published a paper on it in current science (bangalore) in 1994 [5] . imagine my consternation when i was informed soon after that gates had already anticipated such a device, albeit with different axes assignments, way back in 1986 [6] , but which seemed to have been in limbo since! a short note had to be published soon after informing of this oversight and explaining the differences although both used cartesian co-ordinate system to plot the graphs [7] . however, a physics background demands some quantitative appraisal of whatever nature has to offer. i had observed certain similarities and changes in plots of conserved gene sequences of various species, but coming up with some way to measure the changes posed difficulties with these plots of discrete numbers. i had done some number crunching with individual gene segments like introns and exons [8, 9] , but now the need was for whole sequences for which we came up with a geometrical interpretation to describe in general a macro-molecular sequence and measure sequence differences. we presented our scheme at the first indo-us workshop on mathematical chemistry in shantiniketan, west bengal, india in 1998 [10] where we reported, as stated in the abstract, that "geometrisation of macromolecular sequences in the form of a graphical representation provides one … technique where the nucleotides in a gene sequence can be viewed as objects in a 4-dimensional space; the method can be extended, in principle, to include, say proteins, in a 20-dimensional space. we have found a reduced 2-dimensional representation of dna sequences very useful in studies of nucleotide distribution and composition. …. we here propose a new measure of the dispersion of dna graphs that can be used to quantify the differences between two or more graphs of genes of various organisms …. lt also appears that once standardized the proposed scheme may help study molecular phylogeny in evolutionary time scale." although the participants in the shantiniketan workshop included stalwarts in the field like prof. milan randić, prof. haruo hosoya, prof. paul mezey and others, our scheme did not seem to evoke any response, not surprising since they apparently did not know about dna issues. but prof. subhash basak of the university of minnesota, usa and co-organiser of the workshop was intrigued enough by our work and its potential to describe dna sequences through graph invariants to meet me in kolkata after the workshop to discuss the possibility of using invariants for dna sequences as descriptors. subsequently prof. basak invited me the following summer to duluth to carry out further research on dna mathematical descriptors in his group funded by the natural resources research institute (nrri). prof. milan randić and some other distinguished scientists were also invited there to begin to work on dna descriptors in a project funded by nrri. it began with a talk i gave at the university of minnesota, duluth about my work on mathematical descriptors of dnas arising from my graphical representation method. among the attendees was prof. milan randić who, with prof. basak, immediately saw the potential for converting a dna sequence graph to a matrix and thereby extract numerical invariants which could be a more meaningful way to characterize dna sequences. we collaborated then on a proposal for a 3d graphical representation and a matrix method for extracting graph invariants for the first exons of beta globin sequences of several species. this was published in 2000 in the journal of chemical information and computer science [12] and led very soon to a whole host of papers on the dna graphical representations and numerical characterisations and applications of them that continues still as more and more areas keep opening up and a new field of research seems to have begun. this review is a brief introduction to the readers of this new and exciting field of research on graphical representation and numerical characterization (granch) of bio-molecular sequences, based on the talk i presented at the second mathematical chemistry workshop of the americas in bogota, colombia, in july 2010 [13] . some of the various applications made to date using these techniques are also covered briefly, with special emphasis on our recent work that provides a possible approach to anti-viral vaccine design that could be expected to be less susceptible to invalidation through mutational changes in the viral proteins. more details can be found in the several reviews [14] [15] [16] [17] [18] and book chapters [19] [20] [21] that have appeared on the subject, and of course there are always the original papers. (note added in proof: see also bibliography in ref [82] .) as sequence data on long stretches of dna began to become available in the late 1980's, there arose a problem on how to view them and a curiosity to know whether any systematics lay hidden in the apparently random arrangement of characters representing the bases in the sequence. h j jeffrey [2] came up with the idea of plotting them in a square grid where the four corners were identified with the four bases a, c, g, t. the algorithm was to start from the center of the square and for the first base plot a point midway between the origin and the home corner of the base. for the second base he started from the point representing the first base and plotted a point midway between it and the home corner of the second base. continuing in this way filled up the square with a series of points until the entire sequence was plotted. this diagram he called the chaos generator representation (cgr) of the dna sequence. he noticed that different animal kingdoms showed different patterns -double scoop depletion regions for vertebrates, striped patterns for plant sequences, an apparently random distribution for bacterial genomes. overall, each sub-square section of the cgr pattern seemed a replica of the whole, i.e. dna sequences had properties of selfsimilarity or fractal nature. the cgr diagrams of various sequences were investigated by several researchers to find different properties of biological interest. burma et al. [22] showed the structures observed in cgr diagrams arises from skews in base composition and presence of repetitive sequences or specific motifs. dutta and das [23] reported that a cgr plot can be reproduced by suitable algorithms by manipulating different combinations of strings of bases with appropriate frequencies. thus, the double scoop depletion patterns seen in vertebrate cgrs arises from scarcity of cg dinucleotides, and so on. baranidharan et al. [24] developed quantitative methods to generate similarity/dissimilarity maps of genomic sequences and showed that for certain mitochondrial genomes species wise characteristic features could be seen when nucleotide stretches of 7 or more bases at a time were analysed. in a slightly different vein, peng et al. [3] considered the structure of the bases in a dna sequence and analysed them on the basis of their being pyrimidines (c,t) or purines (a,g) only. on an x-y graph where the x-axis counted the nucleotide number, they plotted the appearance of the bases in a sequence by taking a step diagonally upwards if it was a purine or downwards if it was a pyrimidine to the next nucleotide number. plotting the whole sequence step by step in this manner they generated a graph with an irregular up-down structure which they called a "dna landscape". by taking subsections of the graph they found that the subsections also looked similar to the up-down structure of the whole, and the same was true of sub-sub-sections and so on showing that the purine-pyrimidine structure of a dna sequence had self-similarity, which was what jeffrey had remarked two years ago. peng et al. then searched for possible correlations by estimating frequencies of different lengths of nucleotide stretches and found that all gene sequences with the mosaic structure of introns and exons had long-range correlations whereas intronless genes did not show this feature. the implications of such an observation, on the face of it, are huge: the beginning of the dna sequence should, theoretically, be knowing what the end would be like! such an observation quite naturally led to a storm of papers on the subject until it quitened down after the observation that since dna sequences are known to elongate by duplicating long stretches of subsequences, it was possible that such sequences showed apparent long range correlations. around the same time voss [4] conducted a rigorous analysis of 25000 dna sequences with over 50 million bases covering organisms of all classes to search for long range correlations. using a spectral density function analysis he concluded that "(a) long range fractal correlations exist in dna sequences, (b) the degree of correlation as measured by a spectral exponent varies systematically with evolutionary category and (c) short range periodicities of period 3 are prominent while other periods, e.g. 9, are also present. the fractal correlations have been seen to extend over long ranges of nucleotide positions, with the smallest for phage and bacteria and extending to over 100,000 bases for the higher classes" [14] . to get a feel for the actual distribution of bases along a dna sequences you need a more direct graphical depiction than what the abstract representations of jeffrey [2] or peng et al., [3] can offer. this problem was addressed many years ago by hamori and ruskin [25] with their proposal for a 3-dimensional graphical representation of a dna sequence. they proposed a hypothetical square on the xy plane with four corners (nw, ne, se, sw) identified with the four bases a, c, g, t and the nucleotide number to be counted along the z-axis. thus for a dna sequence like acggt, one would plot a point on the a-corner at z-coordinate 1, then draw a line to the next base, c in this case, in its corner with z now equal to 2, and so on. for a sequence like acgtacgtacgt this would generate a spiral around the z-axis; in case there was a preponderance of one base or the other the curve would flow along those corners. these curves the authors called h-curves. visualizing such a 3d image on the 2d plane of the paper is admittedly difficult. however, the authors suggested that drawing two such curves at slightly different angles would allow stereoscopic vision so that the dna could be seen in 3d. taking the bacteriophage m13 as an example they showed that in their representation they could easily identify regions of sharp changes in base composition through visualization that would be difficult to determine from the normal character representation. this author's search for a meaningful display of dna sequence information led him to propose a 2-dimensional graphical representation where the four cardinal directions are associated with the four bases [5] . the method is to take a walk in the negative x-direction if there is an adenosine in the sequence, in the positive ydirection for a cytosine, positive x-direction for a guanine and the negative ydirection for thymine. proceeding to walk in succession in the appropriate direction in the order of the bases making up the particular dna sequence generates a path that visually depicts the arrangement of bases in the sequence. these dna plots were found to be characteristic of the types of gene sequences and that the same genes from different species showed almost the same pattern. since we know that specific genes from different species have significant homology, and in fact that is how often new genes are recognized, it is not surprising that their graphical plots will show basically the same shape. it was found later that gates [6] had already proposed a similar scheme to depicting gene sequences, although his assignment of bases were different from nandy's scheme. a year later leong and morgenthaler [26] independently proposed another 2d scheme, where the base assignments were again different from the two just mentioned. on a 2d cartesian co-ordinate system the assignments of the bases with the cardinal directions in the three schemes are, starting from the negative x-direction and going clockwise, gtca (gates), acgt (nandy) and ctag (leong and morgenthaler). it is interesting to note that these three axes representations exhaust all possible 2d schemes of this type, and these can be seen to be like 2d projections of the hamori-ruskin h-curves. the 2d plots can be scaled to accommodate from the largest to the smallest dna sequences depending on the level of detail one wishes to observe. in reference 27 (the illustration (3)) depicts the 73326 thousand base long human beta globin sequence that contains the beta, eta, delta and the gamma globins of less than 2000 bases each, which can individually be plotted on a smaller scale. plots such as these provide a quick estimation of base composition and distribution along a dna sequence. an inspection of the human beta globin sequence graph shows that it has two sections that are mainly a-t rich with one part in between that is t-dominated. a plot of a sequence like the chicken myosin heavy chain gene is represented in illustration 2 (loc. cit.,) shows that it also is at-rich; from the angle it makes with the axes, it is evident that the sequence is dominated by larger percentage of t's than a's, and likewise one can determine preponderance of structures like amtn from inspections of such plots. further applications of these graphs are taken up later in this chapter. the 2d representations, however, suffer from degeneracy in that nucleotide pairs, like ag or ct in the nandy scheme, will result in only one step instead of two. bielińska-wąż et al. [28] have shown that this can be accounted for by a mathematical method of using a weight parameter for each visit to the same location, but a number of researchers has been to propose different ways to represent dna sequences graphically that reduces or removes this degeneracy. an extensive coverage of these methods can be found in nandy, harle and basak [16] , but we may mention here that one of the first proposals to reduce the degeneracy was the scheme of guo, randić and basak [29] where the unit vectors for the four bases were aligned at a small angle to the cardinal directions. yau et al. [30] used a two-quadrant representation in 2d space where a,g were inclined to the x-axis in the 4 th quadrant and t,c were inclined to the x-axis but in the first quadrant, and the nucleotide count was recorded along the x-axis; this generated a dna graph extending in the positive x-direction and had no degeneracy. he et al., [31] proposed to characterize a dna sequence by their chemical (amino, keto), structural (purine, pyrimidine) and bond strengths (weak, strong) and plotted this set of three reduced sequences as characteristic curves that extended along the x-axis with nucleotide number thus avoiding degeneracy altogether. randić proposed several constructs, among them a 4 horizontal line scheme [32] where the four bases were plotted in order of the sequence along four lines parallel to the x-axis and placed unit distance apart wile the nucleotide number was again counted along the x-axis, a compact "worm curve" representation [33] , four-color maps [34] , "spectrum-like" curves [35] among others which reduced or eliminated the degeneracy inherent in the classical 2d approach. 3d and higher dimensional representations have been proposed to more faithfully reproduce the features of a dna sequence or enable more accurate calculations. hamori and ruskin [25] had originally proposed a 5d model where the four bases were plotted in four dimensions and the fifth was for nucleotide count, but since this was difficult to visualize he had moved to the 3d h-curve representation. 3d representations and variations were also proposed by randić, vracko, nandy and basak [36] , and li and wang [37] to name a few. a 4d method was proposed by chi and ding [38] , a 6d method by liao and wang [39] and an 8d method by liu and wang [40] . the interested reader can refer to the reviews [e.g. ref 16] and the literature for details of these interesting developments. thus, the study of dna sequences is facilitated in many ways by graphical representation, but making intra-and inter-sequence comparison becomes meaningful when the similarities and differences can be quantified in some manner. the difficulty is that since the graphical plots are composed of a set of discrete points, one has to apply either novel geometrical methods or use graph theory where the points are considered as nodes and the connections between the nodes as edges. we describe below first the geometrical methods and then the graph theoretic methods. two techniques were devised, one for intra-sequence comparison and another for inter-sequence comparison. for the variations within a sequence arising from the base distribution, we had observed [27] that coding regions of mammalian gene sequences appeared as a dense cluster of points in the 2d graphical representations implying high degree of mixing of the four bases in almost equal proportions, whereas the non-coding regions that were a-t or g-c rich usually appeared as long filaments. we therefore devised a cluster density measurement by enclosing such regions in a square grid and dividing the number of points in the grid by the area of the square. this was complemented by an inverse displacement method and a fractal coefficient method to numerically assess the differences between these two types of regions. analysis of 386 introns (noncoding regions of a gene) and exons (coding regions) of 35 genes from various species by these measures showed [27] that (a) cluster density of non-coding regions are very small and fall off exponentially rapidly, (b) the cluster density of coding regions grows to about 0.8 per unit area and falls off gradually, (c) exons of evolutionarily later genes have higher cluster densities, (d) cluster densities of intronless genes like the phage m13 genome or the bacteriophage lambda are very low, closely paralleling intron densities and (e) more recent genes show greater fragmentation and smaller lengths of the exons. the cluster density measure also enabled us to propose a way of predicting protein coding regions in new dna sequences [41] and was used to analyse the human chromosome 3 contig 7 and predict existence of several genes [42] . gates [6] had proposed a manhattan distance computation to compare two or more sequences, but this method is suitable for equal length sequences, whereas gene sequences are not generally of equal lengths. to study similarities and dissimilarities of genes from various species we devised a new and different methodology, which was reported for the first time in the first indo-us workshop 1998 [10] and published the following year [11] , as mentioned earlier. since we have in the 2d graphical representation a set of discrete points comprising each gene sequence, we defined a function to describe the sequence as where s 0 is the zeroth-order term representing the coordinates x f , y f of the end points, s 1 is the linear term representing the first-order moments about the two axes, s 2 the second-order term representing the variance about the mean, s 3 the third order term representing the skewness, etc., all of which taken together became a descriptor for the sequence. for the initial presentation we computed the first order moments as weighted center of mass only and defined a graph radius, g r , the distance of the weighted center of mass from the origin, for each sequence and a g r to estimate the difference between two sequences plotted on the same scale; this scheme gave a reasonable fit to the dispersion of the beta globin genes from various species [11] . because of the cumulative nature of the sequence plots, differences in base distributions will lead to progressively increasing differences in the plots. closely related sequences with less mutational changes between them will have smaller g r while unrelated sequences can be expected to lead to larger values of the g r . as remarked by the authors, this method could clearly be generalized to apply to the case of protein and other sequences where one may represent the sequences in a multidimensional hyperspace with a view to eventually develop phylogenetic trees. these techniques have been used by several authors (e.g., [43, 48] ). bielińska-wąż et al. [28] have computed the moments to various higher orders in a 2d dynamic graph with statistical moments of mass-density distributions as new descriptors. computing the moments for a set of histone genes, they showed that the larger number of descriptors improved the characterization of the object and different aspects of the dna could be compared separately while retaining the simplicity of the 2d graphs. nandy and nandy [44] showed that the g r s were quite sensitive measures where base composition or base arrangement differences caused the g r to change and that two or more sequences will not have the same g r value except in some pathological cases. the graph theoretic method arose out of deliberations after the first presentation of the 2d graphical representations in duluth in 1999. the method described in the paper by randić, vracko nandy and basak [36] was to first represent a dna sequence graphically in a 3d cartesian grid and then convert the points to elements of a matrix by computing the ratio of euclidean distance to graph theoretic distance between all possible pairs of points taken systematically. matrix methods are well studied and have well recognized properties. the d/d matrix generated by the distance measures was analysed to yield a set of eigenvalues with the leading eigenvalue being taken as invariant of the matrix and therefore of the sequence. differences between the leading eigenvalues of various gene sequences could then be taken as indicative of their evolutionary distances, although this seminal paper limited itself to computation on the basis of the first exons of 11 beta globin genes only. the interesting point to note is that this paper led to generation of intense interest among researchers and many different ways of representing dna sequences and computation of evolutionary distances subsequently ensued (see review nandy et al. [16] ). authors such as randić et al. [33] , randić [35] , he and wang [45] , song and tan [46] and many others proposed different ways to graphically represent dna sequences and convert the plots to mathematical objects, and derive leading eigenvalues as invariants of the sequences. for example, he and wang [45] reduced the dna sequences to a set of three sequences based on their structural, chemical and bonding nature and devised a vector of the three leading eigenvalues of the matrices associated with each of the reduced sequences which they proposed as being characteristic sequences of the original dna sequence. distances between two sequences then were computed by determining the distances between the end points of the two vectors. song and tan [46] similarly devised a 24-component vector characterizing a sequence, others came up with other ways of computing the intersequence distances based on vectors devised out of the matrix eigenvalues. such matrix invariants from their own representations were used by liao et al. [47] , wang et al. [43] , liao et al. [48] and others to draw phylogenetic trees for mitochondrial genes, sars coronavirus genomes, etc. the graph theoretic method, however, does not seem to have been applied so far to determine specific features within a sequence. developments in the graphical representation and numerical characterization of dna sequences raised the possibilities of using similar analysis of protein sequences, albeit with difficulty arising from the fact that now we have to contend with 20 amino acids making up a protein chain whereas dna sequences were made up of only four nucleotides. although meeta rani [49] had shown as early as 1998 the presence of statistical self-affinity, a kind of self-similarity, in protein sequences that implies a fractal nature, graphical representation methods for proteins drew attention with the paper of randić [50] . the basic idea here was to start with the cgr method of jeffrey to plot a rna sequence drawing triangles for every triplets of bases, i.e., the genetic codes, and taking the centers of each such triangle as corresponding to the residue the triplet would code for. thus starting with the mrna, this method generates a cgr-equivalent 2d graphical representation for the protein sequence. randić et al. [51] carried the method further to construct a zigzag curve for the a-chain of human insulin which allows a direct conversion of a protein sequence into a numerical sequence of (x,y) coordinates that can be used subsequently for construction of the graph-theoretic matrices and sequence invariants. the technique was refined to remove some arbitrariness that were inherent in the 2d scheme by converting the 2d graph to a 3d graphical representation where the triplets were assigned to the corners of a tetrahedron structure; although visual inspection of the graphical patterns had to be discarded in this scheme, the authors claimed that construction of graph invariants in this manner was more accurate and unique. randić et al. [52] proposed a magic circle representation where the protein sequence graph starts from the centre following the sequence by moving half way towards the corresponding amino acids which are positioned equally spaced on the circumference of a unit circle. the result of the complete execution of the protein sequence within the circle produces a typical graph for a particular protein, except for large protein sequences which are often found to have lesser visual benefits. bai and wang [53] considered the triplet codon concept and using a complex coordinate scheme constructed a purine-pyrimidine graph on the left half of the complex plane, with purines (a and g) in the first quadrant and pyrimidines (t and c) in the fourth quadrant. a protein sequence can then be drawn from the triplet codons extending along the x-axis allowing visual inspection of the trends and also from the co-ordinates generate graph-theoretic matrices and their leading eigenvalues as descriptors of the sequences. bai and wang [54] next proposed a 3d graphical representation for protein sequences where the 20 amino acids are represented as end points in a dodecahedron embedded in the 3d space, i.e. each amino acid is represented at one of the vertices of the dodecahedron. this allows construction of a sequence graph following the amino acids in the sequence where each point in the plot can be considered as a node of the graph, from which one can again generate matrices and sequence invariants. liao et al. [48] used a 2d graphical representation method to compare 24 coronavirus sequences where the four cardinal directions were associated with particular properties of the amino acids. they classified the 20 amino acids of a protein sequence into four separate groups according to the chemistry of their r groups: amino acids a,v,f,p,m,i,l to the hydrophobic chemical group; amino acids d,e,k,r to charged chemical group; amino acids s,t,y,h,c,n,q,w to polar chemical group; and the g amino acid to glycine chemical group. starting with the nucleotide sequence, this enabled them to construct three 2d graphs (one for each reading frame) for each gene sequence and compute a distance matrix. in a similar construction, aguero-chapin et al. [55] grouped the 20 amino-acids into four categories: acidic, basic, polar and non-polar and assigned the four groups to the four cardinal directions of a cartesian frame to compute numeric descriptors of 108 sequences of polygalacturonases. in recent years the field has progressed rapidly to numerically characterize protein sequences for application to different issues. gonzález-díaz and collaborators have extended these representations to the study of protein sequences [56] and applied to mass spectral data of proteins and protein serum profiles in parasites [57] . gonzalez-diaz has found that using different type of numerical indices derived from the protein 2d molecular graphics to perform qsar studies is simpler than having to work with the protein 3d structures [58] . integrated qsars [59] developed using chemodescriptors for ligands and biodescriptors of a molecular entity connect structural information of drug molecules, dna and rna sequences or rna secondary and protein tertiary structures. basak et al. [60] using a new differential qsar approach for study of dihydrofolate reductases (dhfr) from multiple strains of plasmodium falciparum showed that dhfr from the wild strain is substantially different from four mutant strains of their study; this indicated that the protocols indicated in the paper can be used for the development of drugs to combat drug-resistant pathogens arising continuously in nature due to mutations. nandy et al. [61] showed that their 20d graphical representation of protein sequences (explained later) was useful in generating phylogenetic relationships between sequences without necessity of multiple alignments and for determining conserved surface exposed stretches on viral proteins that could be useful in drug and vaccine designs [62] . we mention in passing that randić [63] , basak and gute [64] had developed mathematical techniques for analysis of proteomics data drawing parallels with dna granch techniques, but we do not go into any details about this topic in this review. a detailed review of graphical representations of proteins including of proteomics has been made by randić et al. [65] . any new technique needs to be tested through applications to real problems and these methods of graphical representation and numerical characterization of biomolecular sequences are no exception. the intense interest which these granch techniques have evoked amongst researchers have led to many and varied applications which shows the wide applicability and great potential of the methods. we cover some of these applications in brief here, with a novel application to anti-virus drug targeting in slightly more detail. as a natural application of the graphical representation of dna sequences, consider the visualization of patterns in base arrangements that are otherwise difficult to see in the normal character representation. as already mentioned, gates [6] had noticed large scale repeats that were revealed by his 2d graphical plots and nandy [5] showed that conserved genes have shapes on the 2d maps that are similar across species. from a detailed analysis of the graphical plots of families of conserved gene sequences that these altered with evolution such that the constituent bases appear to tend to greater homogeneity in base composition and higher complexity in base composition in the protein coding sequences [66] . also, visual inspection of the graphical plots can enable new insights into similarities of different stretches of dna sequences. larionov et al. [67] had thus found long range palindromes in the mouse and human chromosomes. nandy, gute and basak [68] reported on a stretch of the h5n1 avian flu neuraminidase gene that appeared to be well conserved among the various strains of the avian flu and reported on the possibility of using this site as drug or vaccine target so that these can be effective over many mutational changes (see below). further observations and numerical computations on over 600 h5n1 neuraminidase sequences showed the wide dispersion and mutations of the gene sequences and especially the possible exchange of structural parts of the genes, which was a new observation for this type of virus [69] . based on the observations of the plots of several conserved gene sequences, nandy [70] showed that the base arrangements of these sequences could be conceived as bound by a characteristic function of the instantaneous population of the four bases as one moves along the sequence. based on spot mutations, nandy proposed an equation connecting the instantaneous values of the purine and pyrimidine population asymmetries. it was hypothesized that this may have important consequences for genetic engineering since it implied that stability of engineered gene sequences required these constraints to be followed. an important issue in molecular biology is identification of protein coding regions in dna sequences. nandy showed from the 2d graphical representations that exon and intron regions of mammalian genes showed distinctly different patterns and how these could be used to discriminate between the exons and introns [41] . this method was used by ghosh et al. [42] to analyse a newly sequenced human chromosome iii contig 7 dna to identify coding regions and predict, using webbased tools, possible genes in the sequence. he, li and wang [31] used the numerical characterization of characteristic sequence representation of he and wang [45] to suggest a protein coding gene finding algorithm specific for the yeast genome and found that the total number of protein coding genes in the yeast genome was 5897, which matches very well with estimates from other methods of 5800-6000. discrimination between protein coding and non-coding regions was also proposed on an entropy-based approach [71] differentiating the dna sequence into three subsequences and using shannon's formula. wiesner and wiesnerova [72] did an interesting application of granch techniques to study plant germplasm identificators. for their study of multiallelic marker loci from 18 begonia × tuberhybridas, they used a 2d random walk digitization of the dna sequences by three transform classes according to the prescription of bai et al. [73] and derived invariants from the respective matrices to compute sequence similarities and dissimilarities. principal component analysis done to compare the 18 marker loci to the dna invariants found statistical correlations between the genetic diversity of the marker loci and the random walk invariants. based on their results, the authors concluded that "dna walk representation may function as an efficient pre-scanning procedure, which can predict allele-rich genomic loci as highly informative dna markers solely using the information from their primary sequence." one of the early observations was that these graphical and numerical techniques allowed comparison of dna and protein sequences without having to do multiple sequence alignment since here we are dealing with numbers derived from the method rather than having to compare base by base or residue by residue. almost all proposals of schemes for graphical representations have computed distances between dna sequences to determine similarities and dissimilarities without multiple sequence alignment and obtained fairly good, though not uniform, results. for example, liao et al. [47] used a 2d graphical representation proposed by liao [74] to derive a phylogenetic tree from the elements of a similarity matrix for eleven mitochondrial gene sequences without having to go through any multiple alignment procedure. they constructed a 2x2 covariance matrix of the weighted centers of masses from the co-ordinates of each base of a sequence and computed the euclidean distance between pairs of sequences to obtain their similarity/dissimilarity matrix. liao et al. [48] also investigated the phylogeny of 24 sars coronavirus genomes by their 2d graphical representations for protein sequences where they could draw three plots for each sequence by considering the three reading frames. these generate three eigenvalues for each sequence which are then used to compute a distance matrix from which they could diagrammatically show the relationships of various strains of the virus. in another exercise, bai and wang [75] compared nine different neurocan nerve protein sequences in their 3d dodecahedron representation scheme. a direct comparison of these protein sequences through alignments is difficult since these protein sequences have different lengths. using 10-and 35-component vectors from their model, they compared the distances between end-points of the vectors corresponding to each of the nine genes and built phylogenetic trees. nandy et al. [61] used their 20d representation of protein sequences to compute distances between sequences of the families of globin, the rat and human voltage gated sodium channel alpha subunit and their phylogenetic relationships. it is to be noted that deriving phylogenetic trees from protein sequences is usually a difficult matter when the sequences are of different lengths; but with the granch techniques where d/d and other matrices can be computed for any length sequence and only the eigenvalues compared, the sequence length differences become irrelevant. jayalakshmi et al. [76] generalized these methods to compute alignment free sequence comparison using n-dimensional similarity space. h gonzalez-diaz and his group have used 2d graphical methods for extensive work in the bio-medicine field. based on pseudo-folding lattice network (ln) and star-graphs (sg) topological indices they proposed two dna promoter qsar models to predict promoter sequences in the function regulation of several mycobacterial pathogens [57] . aguero-chapin et al. [55] using their reduced four groups of amino acids on a 2d cartesian co-ordinate framework computed numerical descriptors for 108 polygalacturonases through a markov model and were able to discriminate between these and other proteins and predict polygalacturonase activity of a new protein. comparison of rna secondary structures are important to understand their catalytic properties. bai et al. [77] considered a 3d graphical representation of rna characteristic sequences taken 2 bases at a time to compare similarities and dissimilarities in viral rnas of nine species. they computed three modular lengths and three phases for each sequence from which they constructed a 6component vector characteristic for each viral sequence. two sequences were considered to be similar if their vectors pointed in the same direction and difference between sequences could be quantified by computing the euclidean distance between the end points of the two vectors: the bigger the distance the less similar the sequence. the resultant difference table showed how methods such as these could be used to do cluster analysis without having to use alignment tools which are time consuming and requires several assumptions. in another instance, gonzalez-diaz et al. [78] has computed 2d-rna coupling numbers by adapting the 2d graphical representation method for dna sequences. a novel application of granch techniques was proposed by ghosh et al. [62] to determine targets on viral proteins for drug and vaccine design. viruses are known to mutate very fast and therefore become resistant to drugs and vaccine sin short time scales; the virulence of the avian flu led to an apprehension that it might mutate to a form that would enable human to human transmission of the disease and thus cause widespread infection and possible death as had happened in the case of the spanish flu outbreak in 1918 when millions died. new drugs and vaccines, especially ones that could be readily moved from table to dispensaries were badly needed. we had already noticed in early 2006 that certain parts of the neuraminidase gene appeared to be fairly well conserved [68] . the neuraminidase, along with hemagglutinin, are surface proteins that enable the viral particles to enter and leave the human cells where they proliferate, and of these the neruaminidase is the preferred target of the currently available drug, tamiflu. we therefore determined to search the neuraminidase protein for surface residues that were well conserved. our procedure was to scan a small stretch of the neuraminidase protein sequences of 600+strains of the h5n1 virus and then slide the window by one base and scan again to calculate the protein graph radius in our 20d representation system. we know that these radii are very sensitive to any changes in the sequence, so equal values of the radii in one stretch over all the strains implied that this stretch was conserved. by covering the entire sequence for all strains we could get a good profile of regions of least variability. the next step was to determine which parts of the sequences were surface exposed. there are several on-line engines available to scan a sequence and assign parameters to predict the degree of probability that certain portions were surface exposed. matching these predictions with the hard facts we had on low variability we were able to identify six regions in the neuraminidase protein that were surface exposed and largely stable to mutational changes. these included the peptide we had identified earlier as being exceptionally stable. however, in a recent report on influenza virus rna structure [79] , it has been noted that the structures seen in the crystalline form may be one of several structural forms in vivo and confirmation will need to be experimentally determined. the results of the analysis on the h5n1 neuraminidase protein sequence was published in 2010 [62] . subsequently we have done a similar study on the vp7 protein of the rotavirus, a mainly tropical disease responsible for causing deaths to over half a million children every year. we identified four regions on the vp7 which appeared to be surface situated and quite stable. our findings were reported at the 2 nd mathematical chemistry workshop at bogota, colombia in 2010 [13] and the indian biophysics conference, delhi 2011 [80] . while a number of applications have shown the usefulness of the granch approach to analyzing dna, rna and protein sequences, this remains as yet a nascent field where many issues need to be looked into and problems resolved for the potential to be well realized. an early indication of some of these areas was outlined some years ago [81] , but they are worth recapitulating along with some more issues that may bear scrutiny. the intense interest in this field of graphical representation and numerical characterization of bio-molecular sequences have led to proposals for a vast array of models for depicting the sequences, some real and some virtual, more for dna sequences, less for protein and rna sequences. this has almost become an intellectual sport, with new ideas being propounded on regular basis, generally without a proper rationale for yet another method or critical comparison with earlier proposals. what appears to be lost in the process is the target: how useful are these representations to the practicing biologist? critical to this issue is the problem of determining the domains of applicability of the various representations if different, i.e., which model is best suited to address which classes of problems. as of now, the vast majority of proposals have addressed themselves to comparisons of similarity and dissimilarity, but as we have seen in the previous section, the issues that we can address and which biologists need answers for are more varied. from the applications made to date, the 2d graphical representations where the sequence data are easily viewed have generated the most interest. even aside for the global characteristics revealed by the investigations of jeffreys [3] and peng et al. [3] , the particular patterns of intron and exon segments [9] or characteristic curves of he et al. [31] have led to models to predict protein coding regions, determination of long-range palindromes [67] , identification of target segments for vaccine development for viral proteins [62] and determination of allele-rich genomic loci for plants [72] among other applications have been based on 2d representation schemes. hamori had identified regions of sharp changes in base compositions from his 3d h-curves [25] , but for almost all other 3d, 4d and higher dimensionality representations applications have been restricted to sequence similarities and generation of phylogenetic trees. the mathematical technique involved in generating the descriptors and characterizers for dna sequences are still at a preliminary level. while the first moments in the geometrical method for generating descriptors have generally yielded reasonable results in comparing intra-and inter-genic sequences, attempts to calculate higher moments to increase the accuracy and effectiveness of these descriptors have only lately begun [28, 82] . the leading eigenvalues from the euclidean and graph theoretic distance ratios matrix have so far been used mainly to compute inter-sequence distances; given the rigorous mathematics of matrix mechanics, it may be worthwhile to try and extend the applications to other areas. for the benefit of users of these methods, it would be useful to have a comparison of the geometrical and graph theoretic models to determine at what level the two could give comparable results. in the case of 2d graphical representations using cartesian co-ordinates, we had seen that gene sequences take characteristic shapes [5] . this raises the possibility that some day we could create an atlas of gene sequences where samples of each gene would be depicted and the descriptor parameters listed for easy reference and rapid visual identification. we have described quantitatively the gross features of the graphical plots in the 2d representations by using the first moments in a geometrical method [10, 11] ; better descriptors can be determined through higher order moments [28] to quantify the curvature, skewness and other properties. these, and the leading eigenvalues from the graph theoretic approach, could be considered as a list of parameters describing the sequence, akin to the quantum numbers that are used to describe elementary particles. such a scheme then provides a method to electronically store, retrieve and compare data between various sequences more efficiently, especially with a view to quickly scan newly sequenced dna, rna and proteins to determine the genes and functions. we have considered the moments calculated from the geometrical approach to 2d graphical representations as numerical "descriptors" of the dna sequence and taken tentative steps to enhance the number of descriptors of a sequence by computing higher order moments to more completely describe the sequences. in the matrix method applied to different varieties of graphical representation, leading eigenvalues arising from the matrices have been taken as "invariants" of the sequence in the strict mathematical sense. however, the concept of invariants derived from these matrix methods of numerically characterizing dna sequences may require some modification to account for the fact that dna sequences constantly change due to mutations in the bases. the vast majority of these changes do not affect the functioning of the protein or the enzyme coded by the gene due to synonymous mutations in the coding segments or in the non-coding part; e.g. in the case of intronless gene like the neuraminidase of the avian flu h5n1 we had found [69] 447 out of total 682 sequences prevalent over the period 1997 to 2008 had undergone mutations in one or more bases in the gene, but even then, all of these variants coded for a functioning flu neuraminidase protein. for a beta globin gene, the common standard example of most graphical representation schemes, a sample from one person may differ by a base or more from the next person due to mutational changes. determining an "invariant" from one sample sequence of these genes, while being mathematically precise, may not adequately express or characterize a gene sequence from a practical point of view. perhaps a biologically more relevant measure would be a sampling of several such sequences and from them to compute an average eigenvalue with a standard deviation and derive a numerical to characterize the gene. in fact, in the absence of a sensitivity analysis or a standard deviation, it would be difficult to accept that the computations through leading eigenvalues of distances between several sequences that are only a few percentage points apart could be statistically meaningful. the descriptors are no exception either. once these basic issues are attended to, the granch techniques can become a useful tool in the medicare field. since the computations of the numerical descriptors/characteristics are quite simply done, they can be incorporated into the dna sequencing schemes so that there will be automatic computations of, e.g., g r and p r values which would enable the physician to immediately ascertain the presence of any harmful genetic disorders; huntington's potential to degenerate into a disease for the patient, or some similar genetic problem areas could be easily read out as the genome is sequenced provided we know the characteristic locus and have a standard genome, for example the readout for a normal person from the family, available for comparison. the viral application already discussed in detail in the previous section could be automated and extended to other viruses and bacterial genomes to promote new generation of drugs and vaccines. the researches of gonzalez-diaz [55, 58, 78] and basak [59, 60] are already pointers to new directions. many potential application areas remain to be explored. since the numerical descriptors mentioned previously are seen to be quite sensitive to changes in base composition and distribution, the potential exists to devise schemes to index various aspects related to the bio-molecular sequences. initial attempts have been made to index toxic chemicals that have damaging effects on dna sequences [83] , and to index snp gene sequences measured against some standard sequences [84] . however, these need to be refined and made more useful for the confidence to be generated for their use in laboratory situations. one area that requires in depth study is how to address non-contiguous sequence segments. for example, in the case of epitopes, it is found that there could be continuous epitopes and discontinuous epitopes; in the latter case the folded protein brings residues from different parts of the amino acid sequence close together, which then become sites for the antibodies to act upon. the methods delineated so far for g r and p r or leading eigenvalue evaluation require contiguous span of the bases or residues for the numbers to be calculated. one way to circumvent this difficulty is to work on small segments of the sequence at a time as had been done in ref. [62] . however, this is time consuming and inefficient, and more improved methods to be able to focus on regions of interest and calculate a minimum number of the parameters could offer better rewards. in summary it is apparent that graphical representation and numerical characterization of molecular sequences hold far-reaching potential of rapidly analyzing the sequences to extract numerous information. it opens up new ways to look at these sequences, and to gain new insights such as long range palindromes, fractal properties and intra-purine intra-pyrimidine relationships not seen by any other means. it allows one to compute many aspects of biological and medicinal interest and provide novel methods of tackling old problems; we have seen examples of gene identification, analysis of evolutionary trends and generation of phylogenetic trees, identification of conserved sites on viral proteins for drug and vaccine targeting, predict promoter sequences and new properties of polygalacturose proteins, among others and many possibilities remain unexplored, or barely scratched. still, from plants to viruses, from mammalian genes to mitochondrial genomes, a varied series of applications have been formulated. although many issues doubtless remain yet such as handling non-contiguous stretches of bases and residues like discontinuous epitopes, it is apparent that the granch techniques hold a lot of promise for a new direction in molecular analysis. recent investigations into characteristics of long dna sequences. ind chaos game representation of gene structure long range correlation in nucleotide sequences evolution of long-range fractal correlations and 1/f noise in dna base sequences a new graphical representation and analysis of dna sequence structure: i. methodology and application to globin genes simple way to look a dna graphical representation of long dna sequences graphical analysis of dna sequence structure: iii. indications of evolutionary distinctions and characteristics of introns and exons two dimensional graphical representation of dna sequences and intronl-exon discrimination in intron-rich sequences indexation schemes and similarity measures for macromolecular sequences. paper presented at the indo-us workshop on mathematical chemistry indexing scheme and similarity measures for macromolecular sequences on 3-d representation of dna primary sequences novel analysis of dna and protein sequences through graphical representation and numerical characterization techniques novel techniques of graphical representation and analysis of dna sequences -a review visualization and analysis of dna sequences using dna walks mathematical descriptors of dna sequences: development and applications new approaches to drug-dna interactions based on graphical representation and numerical characterization of dna sequences graphical representation and mathematical characterization of protein sequences and applications to viral proteins dna sequence visualization charcaterizations of dna primary sequences molecular descriptors for chemoinformatics, methods and principles in medicinal chemistry genome analysis: a new approach for visualisation of sequence organisation in genomes mathematicalc haracterisationo f chaos, game representation: new algorithms for nucleotide sequence analysis chaos game representation of similarities and differences between genomic sequences h curves, a novel method of representation of nucleotide series especially suited for long dna sequences random walk and gap plots of dna sequences graphical analysis of dna sequence structure: iii. indications of evolutionary distinctions and characteristics of introns and exons distribution moments of 2d-graphs as descriptors of dna sequences a novel 2-d graphical representation of dna sequences of low degeneracy dna sequence representation without degeneracy finding protein coding genes in the yeast genome based on the characteristic sequences analysis of similarity/dissimilarity of dna sequences based on novel 2-d graphical representation compact 2-d graphical representation of dna four-color map representation of dna or rna sequences and their numerical characterization spectrum-like graphical representation of dna based on codons on 3-d representation of dna primary sequences on a 3-d representation of dna primary sequences novel 4d numerical representation of dna sequences analysis of similarity/dissimilarity of dna sequences based on nonoverlapping triplets of nucleotide bases vector representations and related matrices of dna primary sequence based on l-tuple two-dimensional graphical representation of dna sequences and intron-exon discrimination in intron-rich sequences identification of new genes in human chromosome 3 contig 7 by graphical representation technique a graphical method to construct a phylogenetic tree on the uniqueness of quantitative dna difference descriptors in 2d graphical representation models characteristic sequences for dna primary sequence a new 2-d graphical representation of dna sequences and their numerical characterization application of 2-d graphical representation of dna sequence coronavirus phylogeny based on triplets of nucleic acids bases dynamics of protein evolution 2-d graphical representation of proteins based on virtual genetic code. sar & qsar unique graphical representation of protein sequences based on nucleotide triplet codons novel 2-d graphical representation of proteins a 2-d graphical representation of protein sequences based on nucleotide triplet codons on graphical and numerical representation of protein sequences novel 2d maps and coupling numbers for protein sequences. the first qsar study of polygalacturonases; isolation and prediction of a novel sequence from psidium guajava l alignment-free prediction of polygalacturonases with pseudofolding topological indices: experimental isolation from coffea arabica and prediction of a new sequence predicting antimicrobial drugs and targets with the march-inside approach generalized lattice graphs for 2d-visualization of biological information predicting pharmacological and toxicological activity of heterocyclic compounds using qsar and molecular modeling characterization of dihydrofolate reductases from multiple strains of plasmodium falciparum using mathematical descriptors of their inhibitors numerical characterization of protein sequences and application to voltage-gated sodium channel alpha subunit phylogeny computational analysis and determination of a highly conserved surface exposed segment in h5n1 avian flu and h1n1 swine flu neuraminidase on graphical and numerical characterization of proteomics maps mathematical biodescriptors of proteomics maps: background and applications graphical representation of proteins investigations on evolutionary changes in base distributions in gene sequences chromosome evolution with naked eye: palindromic context of the life origin graphical representation and numerical characterization of h5n1 avian flu neuraminidase gene sequence computational study of dispersion and extent of mutated and duplicated sequences of the h5n1 influenza neuraminidase over the period 1997-2008 empirical relationship between intra-purine and intra-pyrimidine differences in conserved gene sequences relative entropy of dna and its application 2d random walk representation of begonia × tuberhybrida multiallelic loci used for germplasm identification a representation of dna primary sequences by random walk a 2d graphical representation of dna sequence on graphical and numerical representation of protein sequences alignment-free sequence comparison using n-dimensional similarity space analysis of similarity between rna secondary structures 2d-rna-coupling numbers: a new computational chemistry approach to link secondary structure topology with biological function influenza virus rna structure: unique and common features characterization of conserved regions in rotaviral vp7 proteins: a graphical representation approach towards epitope prediction theory and computation: old problems and new challenges, g. maroulis and t. simos graphical and numerical representations of dna sequences: statistical aspects of similarity simple numerical descriptor for quantifying effect of toxic substances on dna sequences quantitative descriptor for snp related gene sequences the author confirms that this chapter contents have no conflict of interest. key: cord-291156-zxg3dsm3 authors: bernasconi, anna; canakoglu, arif; pinoli, pietro; ceri, stefano title: empowering virus sequences research through conceptual modeling date: 2020-05-01 journal: biorxiv doi: 10.1101/2020.04.29.067637 sha: doc_id: 291156 cord_uid: zxg3dsm3 the pandemic outbreak of the coronavirus disease has attracted attention towards the genetic mechanisms of viruses. we hereby present the viral conceptual model (vcm), centered on the virus sequence and described from four perspectives: biological (virus type and hosts/sample), analytical (annotations and variants), organizational (sequencing project) and technical (experimental technology). vcm is inspired by gcm, our previously developed genomic conceptual model, but it introduces many novel concepts, as viral sequences significantly differ from human genomes. when applied to sars-cov2 virus, complex conceptual queries upon vcm are able to replicate the search results of recent articles, hence demonstrating huge potential in supporting virology research. in addition to vcm, we also illustrate the data dictionary for patient’s phenotype used by the covid-19 host genetic initiative. our effort is part of a broad vision: availability of conceptual models for both human genomics and viruses will provide important opportunities for research, especially if interconnected by the same human being, playing the role of virus host as well as provider of genomic and phenotype information. despite the advances in drug and vaccine research, diseases caused by viral infection pose serious threats to public health, both as emerging epidemics (e.g., zika virus, middle east respiratory syndrome coronavirus, measles virus, or ebola virus) and as globally well-established epidemics (such as human immunodeficiency virus, dengue virus, hepatitis c virus). the pandemic outbreak of the coronavirus disease covid-19, caused by the "severe acute respiratory syndrome coronavirus 2" virus species sars-cov2 (according to the genbank [35] acronym 1 ), has brought unprecedented attention towards the genetics mechanisms of coronaviruses. thus, understanding viruses from a conceptual modeling perspective is very important. the sequence of the virus is the central information, along with its annotated parts (known genes, coding and untranslated regions...) and the nucleotides' variants with respect to the reference sequence for the specific species. each sequence is identified by a strain name, which belongs to a specific virus species. viruses have complex taxonomies, as they belong to genus, sub-families, and finally families (e.g., coronaviridae). other important aspects include the host organisms and isolation sources from which viral materials are extracted, the sequencing project, the scientific and medical publications related to the discovery of sequences; virus strains may be searched and compared intra-and cross-species. luckily, all these data are made available publicly by various resources, from which they can be downloaded and re-distributed. our recent work is focused on data-driven genomic computing, providing contributions in the area of modeling, integration, search and query answering. we have previously proposed a conceptual model focused on human genomics [6] , which was based on a central entity item, representing files of genomic regions. the simple schema evolved into a knowledge graph [5] , including ontological representation of many relevant attributes (e.g., diseases, cell lines, tissue types...). the approach was validated through the practical implementation of the integration pipeline meta-base 2 , which feeds an integrated database, searchable through the genosurf 3 interface [8] . very recently we have also been involved in the covid-19 host genetics initiative, 4 a collaborative effort that aims at joining forces of the broader human genetics community to generate, share, and analyze data to learn the genetic characteristics and outcomes of covid-19. in this project, we built a conceptual data definition (and related questionnaire) for describing the phenotype of covid-19, to be used by clinicians who contribute to the project. thus, we created a conceptually solid definition of the clinical information of patients affected by covid-19, acting as hosts to the sars-cov2 virus. based on these considerations, in this paper we contribute as follows: -we propose a new viral conceptual model (vcm), a general conceptual model for describing viral sequences, organized along specific dimensions that highlight a conceptual schema similar to gcm [6] ; -focusing on sars-cov2, we show how vcm can be profitably linked to a phenotype database with information on covid-19 infected patients; -we provide a list of interesting queries replicating newly released literature on infectious diseases; these can be easily performed on vcm. the manuscript is organized as follows: section 2 overviews current technologies available for virus sequence data management. section 3 proposes our vcm, while section 4 shows its possible intersection with a general clinical database. we show examples of applications in section 5 and review related works in section 6. section 7 discloses our vision for future developments. the landscape of relevant resources and initiatives dedicated to data collection, retrieval and analysis of virus sequences is shown in fig. 1 . we partitioned the space of contributors by considering: institutions that host data sequences, main sequence databases, tools provided for querying and searching them, and then organizations and tools hosting data analysis interfaces that also connect to viral sequence databases. the three main organizations providing open-source viral sequences are ncbi (us), ddbj (japan), and embl-ebi (europe); they operate within the broader contexts provided by the international nucleotide sequence database collaboration. 5 ncbi hosts the two, so far, most relevant viral sequence databases: gen-bank [35] contains the annotated collection of all publicly available dna and rna sequences; refseq [28] provides a stable reference for genome annotation, gene identification and characterization, and mutation/polymorphism analysis. genbank is continuously updated thanks to the abundant sharing of multiple laboratories and data contributors around the world (note that sars-cov2 nucleotide sequences have increased from about 300 around the end of march 2020, to 1,624 as of april 27th). embl-ebi hosts the european nucleotide archive [1] , which has a broader scope, accepting submissions of nucleotide sequencing information, including raw sequencing data, sequence assembly information and functional annotations. several tools are available for querying and searching these databases; among them, e-utilities [34] , ncbi virus 6 [19] , and pathogens 7 are tools and portals directly provided by the insdc institutions for supporting the access to their viral resources, however lacking possibility of querying based on annotations and variants. a number of databases and data analysis tools refer to these viral sequences databases. we mention: viralzone [20] by the sib swiss institute of bioinformatics, which provides access to sars-cov2 proteome data as well as cross-links to complementary resources; the virus pathogen database and analysis resource (vipr, [31] ), an integrated repository of data and analysis tools for multiple virus families, supported by the bioinformatics resource centers program; virusite [39] , an integrated database for viral genomics; the viral genome organizer, 8 implemented by the canadian viral bioinformatics research centre, focusing on search for sub-sequences within genomes. while the insdc consortium provides full open access to sequences, the gi-said initiative [13, 37] was created in 2008 with the explicit purpose of offering an alternative to traditional public-domain data archives, as many scientists hesitated to share influenza data due to their legitimate concern about not being properly acknowledged, among others. gisaid hosts epiflu tm , a large sequence database, which started its mission for influenza data and is now expanding with epicov tm having a particular focus on the sars-cov2 pandemic (12,645 sequences for sars-cov2 on april 27th). some interesting portals have become interfaces to gisaid data with particular focuses: nextstrain [18] overviews emergent viral outbreaks based on the visualization of sequence data integrated with geographic information, serology, and host species; cov-glue, 9 part of the glue suite [38] , contains a database of replacements, insertions and deletions observed in sequences sampled from the pandemic. many other resources link to viral sequence data, including: drug databases, particularly interesting as they provide information about clinical studies (see clinicaltrials 10 ), protein sequences databases (e.g., uniprotkb/swiss-prot [32] ), and cell lines databases (e.g., cellosaurus [3] ). we previously proposed the genomic conceptual model (gcm, [6] ), an entity-relationship diagram that recognizes a common organization for a limited set of concepts supported by most genomic data sources, although with different names and formats. the model is centered on the item entity, representing an elementary experimental file of genomic regions and their attributes. four views depart from the central entity, recalling a classic star-schema organization that is typical of data warehouses [7] ; they respectively describe: i) the biological elements involved in the experiment: the sequenced sample and its preparation, the donor or patient; ii) the technology used in the experiment, including a specific assay (i.e., technique); iii) the management aspects: the projects/organizations involved in the preparation and production; iv) the extraction parameters used for internal selection and organization of items. gcm is employed as a driver of integration pipelines for genomic datasets, fuelling user search-interfaces such as genosurf [8] . lessons learnt from that experience include the benefits of having: a central fact entity that helps structuring the search; a number of surrounding dimensions capturing organization, biological and experimental conditions to describe the facts; a data layout that is easy to learn for first-time users and that helps the answering of practical questions (as demonstrated in [4] ). we hereby propose the viral conceptual model (vcm), which is influenced by our past experience with human genomes. there are significant differences between the two conceptual models. the human dna sequence is long (3 billions of base pairs) and has been understood in terms of reference genomes (named h19 and grch38) to which all other information is referred, including genetic and epigenetic signals. instead, viruses are many, their sequences are short (order of thousands of base pairs) and each virus has its own reference sequence; moreover, virus sequences are associated to a host sample of another species. with a bird eye's view, the vcm conceptual model is centered on the sequence entity that describes individual virus sequences; sequences are analyzed from the biological perspective (hostsample and virus), the technological perspective (experimenttype), and the organizational perspective (sequencingproject). two other entities, annotation and variant, represent an analytical perspective of the sequence, allowing to analyze its characteristics, its sub-parts, and the differences with respect to reference sequences for the specific virus species. we next illustrate the central entity and the four perspectives. central entity. a viral sequence can regard either dna or rna; in either cases, databases and sequencing data write the sequence as a dna nucleotidesequence (i.e., guanine (g), adenine (a), cytosine (c), and thymine (t) 11 ) that has a specific strand (positive or negative), length (typically thousands), and a percentage of read g and c bases (gc% ). each sequence is uniquely identified by an accessionid, which is retrieved directly from the source database (genbank's are usually formed by two capital letters, followed by six digits, gi-said by the string "epi isl " and six digits). sequences can be complete or partial (as encoded by the boolean flag iscomplete) and they can be a reference sequence (stored in refseq) or a regular one (encoded by isreference). in the latter case, sequences have a corresponding strainname assigned by the sequencing lab, somehow hard-coding relevant information (e.g., hcov-19/nepal/61/2020 or 2019-ncov ph ncov 20 026). technological perspective. the sequence derives from one experiment or assay, described in the experimenttype entity (cardinality is 1:n from the dimension towards the fact). it is performed on biological material analysed with a given sequencingtechnology platform (e.g., illumina miseq) and an as-semblymethod, collecting algorithms that have been applied to obtain the final sequence, for example: bwa-mem, to align sequence reads against a large reference genome; bcftools, to manipulate variant calls; megahit, to assemble ngs reads. another technical measure is captured by coverage (e.g., 100x or 77000x). biological perspective. each sequence belongs to a specific virus, which is described by a complex taxonomy. the most precise definition is the species-name (e.g., severe acute respiratory syndrome coronavirus 2), corresponding to a speciestaxonid (e.g., 2697049), related to a simpler genbankacronym (e.g., sars-cov2) and to many comparable forms, contained in the equiv-alentlist (e.g., 2019-ncov, covid-19, sars-cov-2, sars2, wuhan coronavirus, wuhan seafood market pneumonia virus, ...). the species belongs to a genus (e.g., betacoronavirus), part of a subfamily (e.g., orthocoronavirinae), finally falling under the most general category of family (e.g., coronaviridae). each virus species corresponds to a specific moleculetype (e.g., genomic rna, viral crna, unassigned dna), which has either double-or single-stranded structure; in the second case the strand may be either positive or negative. these possibilities are encoded within the issinglestranded and ispositivestranded boolean variables. an assay is performed on a tissue extracted from an organism that has hosted the virus for an amount of time; this information is collected in the hostsample entity. the host is defined by a species, corresponding to a speciestaxonid, usually represented using the ncbi taxonomy [14] (e.g., 9606 for homo sapiens). the sample is extracted on a collectiondate, from an isola-tionsource that is a specific host tissue (e.g., nasopharyngeal or oropharyngeal swab, lung), in a certain location identified by the quadruple originatinglab (when available), region, country, and geogroup (such as continent). both entities of this perspective are in 1:n cardinality with the sequence. organizational perspective. the entity sequencingproject describes the management aspects of the production of the sequence. each sequence is connected to a number of studies, usually represented by a research publication (with authors, title, journal, publicationdate and eventually a pubmedid referring to the most important biomedical literature portal 12 ). when a study is not available, just the sequencinglab and submissiondate are provided. in rare occasions, a project is associated with a popset number, which identifies a collection of related sequences derived from population studies (submitted to genbank), or with a bioprojectid (an identifier to the bioproject external database 13 ). we also include the name of databasesource, denoting the organization that primarily stores the sequence. in this perspective all cardinalities are 1:n as sequences can be part of multiple projects; conversely, sequencing projects contain various sequences. analytical perspective. this perspective allows to store information that are useful during the secondary analysis of genomic sequences. annotations include a number of sub-sequences representing segments (defined by start and stop coordinates) of the original sequence with a particular featuretype (e.g., gene, peptide, coding dna region, or untranslated region, molecule patterns such as stem loops and so on), the recognized genename to which it belongs (e.g., gene "e"), the product it concurs to produce (e.g., leader protein, nsp2 protein, rna-dependent rna polymerase, membrane glycoprotein, envelope protein...), and eventually an externalreference when the protein is present in a separate database such as uniprotkb. the variant entity contains subsequences of the main sequence that differ from the reference sequence of the same virus species. they can be identified with respect to the reference one, just by using the altsequence (i.e., the nucleotides used in the analyzed sequence at position start coordinate for an arbitrary length, typically just equal to 1) and a specific type, which can correspond to insertion (ins), deletion (del), single-nucleotide polymorphism (snp) or others. the content of the attributes of this entity is not retrieved from existing databases; instead it is computed in-house by our procedures. indeed, we use the well known dynamic programming algorithm of needleman-wunsch [26] , that computes the optimal alignment between two sequences. from a technical point of view, we compute the pair-wise alignment of every sequence to the reference sequence of refseq (nc 045512); from such alignment we then extract all insertions, deletions, and substitutions that transform (edit) the reference sequence into the considered sequence. a similar computation is performed within cov-glue (http://covglue.cvr.gla.ac.uk/). after the spread of covid-19 pandemia, several informal consortia have been created to foster international cooperation among researchers. we participate to the covid-19 host genetics initiative, 14 aiming at bringing together the human genetics community to generate, share and analyze data to learn the genetic determinants of covid-19 susceptibility, severity and outcomes. in this setting, we are coordinating the production of a data dictionary for the phenotype definition, which will be used as a reference by participating institutions, hosted by ega [15] , the european genome-phenome archive of embl-ebi the dictionary, illustrated in fig. 3 , contains patient phenotype information, collected at admission and during the course of hospitalizations (hosted by a given hospital); each patient can be connected to a virus sequence (in that case, she is the host organism providing the hostsample of vcs) and can have multiple encounters. for ease of visualization, attributes are clustered within attribute groups, indicated with white squares instead of black circles. note that the dictionary representation deviates from a classic entity-relationship diagram as some attribute groups would typically deserve the role of entity; however, this simple format allows an easy mapping of the dictionary to questionnaires and an implementation by ega in the form of spreadsheet. attribute groups of patients describe: demography&exposure, riskfactors, comorbidities, admissionsymptoms, hospitalizationcourse; attribute groups of encounters describe: encountersymptoms, treatments, laboratoryresults. attributes within groups can be further clustered within subgroups; for instance, comorbidities include the subgroups immunesystem, respiratory, genitourinary, cardiovascular, neurological, cancer. the data dictionary includes two possible uses in further analysis (i.e., the course of hospitalization and longitudinal studies); for these uses we set each attribute to either mandatory or optional. in addition to very general questions that can be easily asked through our conceptual model (e.g., retrieve all viruses with given characteristics), in the following we propose a list of interesting application studies that could be backed by the use of our conceptual model. in particular, they refer to sars-cov2 virus as it is receiving most of the attention of the scientific community. fig. 4 represents the reference sequence of sars-cov2, 16 highlighting the major structural sub-sequences that are relevant for the encoding of proteins and other functions. it has 56 region annotations, of which fig. 4 represents only the 11 genes (orf1ab, s, orf3a, e, m, orf6, orf7a, orf7b, orf8, n, orf10) plus the rna-dependent rna polymerase enzyme, with approximate indication of the corresponding coordinates. we next describe biological queries supported by vcm, from the easiest to the most complex ones, typically suggested by existing studies. q1. the most common variants found in sars-cov2 sequences can be selected for us patients; the query can be performed only on specific genes. country is in blue as samples will be distributed according to such field. q3. according to [9] , e and rdrp genes are highly mutated and thus crucial in diagnosing covid-19 disease; first-line screening tools of 2019-ncov should perform an e gene assay, followed by confirmatory testing with the rdrp gene assay. conceptual queries are concerned with retrieving all sequences with mutations within genes e or rdrp and relating them to given hosts, e.g. humans affected in china. q4. tang et al. [41] claim that there are two clearly definable "major types" (s and l) of sars-cov2 in this outbreak, that can be differentiated by transmission rates. intriguingly, the s and l types can be clearly distinguished by just two tightly linked snps at positions 8,782 (within the orf1ab gene from c to t) and 28,144 (within orf8 from t to c). then, queries can correlate these snps to other variants or the outbreak of covid-19 in specific countries (e.g., [16] ). q5. to inform sars-cov2 vaccine design efforts, it may be needed to track antigenic diversity. typically, pathogen genetic diversity is categorised into distinct clades (i.e., a monophyletic group on a phylogenetic tree). these clades may refer to 'subtypes', 'genotypes', or 'groups', depending on the taxonomic level under investigation. in [16] , specific sequence variants are used to define clades/haplogroups (e.g., the a group is characterized by the 20,229 and 13,064 nucleotides, originally c mutated to t, by the 18,483 nucleotide t mutated to c, and by the 8,017, from a to g). vcm supports all the information required to replicate the definition of sars-cov2 clades requested in the study. fig. 6 illustrates the conjunctive selection of sequences with all four variants corresponding to the a clade group defined in [16] and the resulting retrieved sequences. q6. morais junior at al. [21] propose a subdivision of the global sars-cov2 population into sixteen subtypes, defined using "widely shared polymorphisms" identified in nonstructural (nsp3, nsp4, nsp6, 27 nsp12, nsp13 and nsp14) cistrons, structural (spike and nucleocapsid), and accessory (orf8) genes. vcm supports all the information required to replicate the definition of such subtypes. the above examples of complex queries refer to virus sequences and can be answered by vcm (fig. 2) . due to the pressing interest on sars-cov2, we are currently doing an effort to collect sars-cov2 sequences and provide a search interface for a first release of a vcm-based query engine. even more interesting queries will be enabled by combining phenotypes with virus sequences; along this direction, we also contributed to the data dictionary effort (fig. 3) . when both datasets will be accessible, other more powerful studies will be possible. some early findings have been already published connecting virus sequences with phenotypes, so far with very small datasets (e.g., [22] with only 5 patients, [24] with 9 patients, and [41] with 103 sequenced sars-cov2 genomes). as reaffirmed by these works, there is need for additional comprehensive studies linking the viral sequences of sars-cov2 to the phenotype of patients affected by covid-19. we are confident that in the near future there will be many more studies like [22, 24, 41] . the use of conceptual modeling to describe genomics databases dates back to the late nineties, including a functional model for dna databases named "associative information structure" [27] ; a model representing genomic sequences [25] ; and a set of data models for describing transcription/translation processes [30] . later on, a stream of works on conceptual modeling-based data warehouses includes the gedaw uml conceptual schema [17] , driving the construction of a gene-centric data warehouse for microarray expression measurements; the genomics unified schema [2] ; the genome information management system [10] , a genome-centric data warehouse; and the genemapper warehouse [12] , integrating expression data from a number of genomic sources. more recently, there has been a solid stream of works dedicated to data quality-oriented conceptual modeling: [33] presents the human genome con-ceptual model and [29] applies it to uncover relevant information hidden in genomics data lakes. conceptual modeling has been mainly concerned with aspects of the human genome, even when more general approaches were adopted; in [6] we presented the genomic conceptual model (gcm), describing the metadata associated with genomic experimental datasets available for humans or other model organisms; gcm was essential for driving the data integration pipeline and building search interfaces [8] . in the variety of types of genomic databases [11] , several resources are dedicated to viruses [36] ; however, very few works relate to conceptual data modeling. among them, [40] considers host information and normalized geographical location, and [23] focuses on influenza a viruses. the closest work to us, described in [38] , is a flexible software system for querying virus sequences; it includes a basic conceptual model 17 . in comparison, vcm covers more dimensions, that are very useful for supporting research queries on virus sequences. this paper responds to an urgent need, understanding the conceptual properties of sars-cov2 so as to facilitate research studies. however, the model applies to any type of virus, and will be at the basis for the development of new instruments. in the past, we first presented the conceptual model for human genomics [6] , then we developed the web-based search system genosurf [8] ; our ongoing effort is to develop a search system for viral conceptual schemas, inspired by genosurf. while the need for data is pressing, there is also a need of conceptually wellorganized information. in our broad vision, the availability of conceptual models for both human genomics and viruses will provide important opportunities for research, amplified to the maximum when human and viral sequences will be interconnected by the same human being, playing the role of host of a given virus sequence as well as provider of genomic and phenotype information. in the future we will continue our modeling and integration efforts for virus genetics in the context of humans, by interacting with the community of scholars who study viruses. we may add more discovery-oriented entities to the model, that could be of use in a future scenario, e.g., a new pandemic offspring. a user researching on diagnosis could ask, for example, what sequence patterns are unique to the whole or sub-part of the database (i.e., do not appear in viruses within the database). whereas, a user working on vaccine development could be interested in what are the epitopes (i.e., antigen parts to which antibodies attach) that cover the whole database or a partition of it, for mhc types prevalent in different infected humans. possibly, other dimensions will be necessary, such as drug resistance information and drug resistance-associated mutations. the european nucleotide archive in 2019 gus the genomics unified schema a platform for genomics databases the cellosaurus, a cell-line knowledge resource exploiting conceptual modeling for searching genomic metadata: a quantitative and qualitative empirical study from a conceptual model to a knowledge graph for genomic datasets conceptual modeling for genomics: building an integrated repository of open data designing data marts for data warehouses genosurf: metadata driven semantic search system for integrated genomic datasets detection of 2019 novel coronavirus (2019-ncov) by real-time rt-pcr gims: an integrated data storage and analysis environment for genomic and functional data a summary of genomic databases: overview and discussion flexible integration of molecular-biological annotation data: the genmapper approach data, disease and diplomacy: gisaid's innovative contribution to global health the ncbi taxonomy database the european genotype archive: background and implementation spread of sars-cov-2 in the icelandic population integrating and warehousing liver gene expression data and related biomedical resources in gedaw nextstrain: real-time tracking of pathogen evolution virus variation resource-improved response to emergent viral outbreaks viralzone: a knowledge resource to understand virus diversity the global population of sars-cov-2 is composed of six major subtypes clinical and virological data of the first cases of covid-19 in europe: a case series influenza a virus informatics: genotype-centered database and genotype annotation genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding imagene: an integrated computer environment for sequence annotation and analysis a general method applicable to the search for similarities in the amino acid sequence of two proteins formal design and implementation of an improved ddbj dna database with a new schema and object-oriented library reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation a method to identify relevant genome data: conceptual modeling for the medicine of precision conceptual modelling of genomic information vipr: an open bioinformatics database and analysis resource for virology research uniprot: a worldwide hub of protein knowledge applying conceptual modeling to better understand the human genome the e-utilities in-depth: parameters, syntax and more. entrez programming utilities help unraveling the web of viroinformatics: computational tools and databases in virus research gisaid: global initiative on sharing all influenza data-from vision to reality glue: a flexible software system for virus sequence data virusite-integrated database for viral genomics named entity linking of geospatial and host metadata in genbank for advancing biomedical research on the origin and continuing evolution of sars-cov-2 acknowledgements. this research is funded by the erc advanced grant 693174 geco (data-driven genomic computing), 2016-2021. the authors thank prof. limsoon wong for his precious suggestions and inspiration for future works. key: cord-296691-cg463fbn authors: wang, ren; xu, sheng; jiang, yumei; jiang, jingwei; li, xiaodan; liang, lijian; he, jia; peng, feng; xia, bing title: de novo sequence assembly and characterization of lycoris aurea transcriptome using gs flx titanium platform of 454 pyrosequencing date: 2013-04-09 journal: plos one doi: 10.1371/journal.pone.0060449 sha: doc_id: 296691 cord_uid: cg463fbn background: lycoris aurea, also called golden magic lily, is an ornamentally and medicinally important species of the amaryllidaceae family. to date, the sequencing of its whole genome is unavailable as a non-model organism. transcriptomic information is also scarce for this species. in this study, we performed de novo transcriptome sequencing to produce the first comprehensive expressed sequence tag (est) dataset for l. aurea using high-throughput sequencing technology. methodology and principal findings: total rna was isolated from leaves with sodium nitroprusside (snp), salicylic acid (sa), or methyl jasmonate (meja) treatment, stems, and flowers at the bud, blooming, and wilting stages. equal quantities of rna from each tissue and stage were pooled to construct a cdna library. using 454 pyrosequencing technology, a total of 937,990 high quality reads (308.63 mb) with an average read length of 329 bp were generated. clustering and assembly of these reads produced a non-redundant set of 141,111 unique sequences, comprising 24,604 contigs and 116,507 singletons. all of the unique sequences were involved in the biological process, cellular component and molecular function categories by go analysis. potential genes and their functions were predicted by kegg pathway mapping and cog analysis. based on our sequence analysis and published literatures, many putative genes involved in amaryllidaceae alkaloids synthesis, including pal, tydc omt, nmt, p450, and other potentially important candidate genes, were identified for the first time in this lycoris. furthermore, 6,386 ssrs and 18,107 high-confidence snps were identified in this est dataset. conclusions: the transcriptome provides an invaluable new data for a functional genomics resource and future biological research in l. aurea. the molecular markers identified in this study will provide a material basis for future genetic linkage and quantitative trait loci analyses, and will provide useful information for functional genomic research in future. the genus lycoris is an important group of amaryllidaceae composed of approximately 20 species of flowering plants native to the moist warm temperate woodlands of eastern and southern asia, of which 15 (10 endemic) are distributed in china. most of the lycoris species are commonly cultivated in china, korea, japan and vietnam as bulbous plants [1, 2] . in comparison with other well-known bulb flowers, such as narcissi and lilies, lycoris has its own characteristics and merits. it comes into flower at a time when few other bulbous plants are active. the flowers are characterized by their pastel and plentiful colors as well as by beautiful and varied shapes [1] . so the lycoris species are all very popular with considerable acceptance as ornamental plant [3] and most of them have been successfully cultivated. in the past several decades, some of the lycoris species, cultivars, and hybrids such as lycoris radiata and lycoris aurea have been used worldwide. meanwhile, the demand for lycoris as a commercial horticultural product has been increasing steadily, so the breeding of varieties with new flower forms and/or colors has become desirable for lycoris. moreover, lycoris species are all of medical values. the bulbs of lycoris have been used in traditional chinese medicine (tcm) for a long time and some amaryllidaceae-type alkaloids isolated from these plants have been reported to exhibit immunostimulatory, anti-tumor, anti-viral and anti-malarial activities [4] [5] [6] [7] . for example, lycorine, a pyrrolophenanthridine alkaloid, has been demonstrated to suppress cell growth of the human leukemia cell line hl-60 [8] as well as the multiple myeloma cell line km3 [9] by arresting the cell cycle, subsequently inducing apoptosis of tumor cells. more recently, lycorine causes a rapid turnover of protein levels of myeloid cell leukemia-1 (mcl-1), which may play an important survival role in a variety of tumor cells including leukemia were reported [10] . so lycorine might be a good candidate therapeutic agent against leukemia. additionally, it has also been reported that lycorine was an active component in the alkaloid portion and a good candidate for the development of new antiviral medicine in the treatment of severe acute respiratory syndrome (sars) [11] . galanthamine, another major amaryllidaceae alkaloid, has also been widely used in medicine as a strong reversible inhibitor of cholinesterase to increase acetylcholine sensitivity [12] . it is a specific remedy for myasthenia gravis and poliomyelitis sequela and has also been used in the therapy of glaucoma [13] and alzheimer's disease [13] [14] [15] . hence, galanthamine also has important medicinal value and broad application prospect [16] . at the same time, because of their several biological activities and their potential diversity in pharmacology, amaryllidaceae alkaloids have also attracted great interest of synthetic organic chemists [17] [18] [19] [20] [21] [22] [23] [24] [25] . it is well known that the generation of large-scale expressed sequenced tags (ests) is a very useful approach to describe the gene expression profile and sequence of mrna from a specific organism and stage (especially in non-model species). ests represent a valuable sequence resource for research and breeding, as they provide comprehensive information regarding the transcriptome [26] . they have played significant roles in functional genomics research for discovery of novel genes together with identifying different protein groups (e.g. proteins with signal peptides) other than the whole genome [27] [28] [29] , developing ssrs and snps markers [30] [31] [32] [33] [34] , allowing large-scale expression analysis [35] , improving genome annotation [36] , and elucidating phylogenetic relationships [37] . next-generation sequencing (ngs) technologies such as the illumina solexa, roche 454, and abi solid platforms have greatly decreased the cost and time required for receiving genomic and transcriptomic data [38] . by generating sufficiently long sequence reads, roche 454 pyrosequencing using genome sequencing (gs) flx technology makes it possible to compensate for the lack of a reference genome during de novo sequence assembly with the concurrent improvements of de novo assembly software [39] . meanwhile, it is particularly useful as a shotgun method for generating est data and a powerful method for whole genome transcriptome analysis and gene discovery with pyrosequencing of uncloned cdnas [40] . so far, a large number of plants [26, 34, [40] [41] [42] including arabidopsis [43] , artemisia annua [44] , cucumber [45] , medicago [46] , maize [47] , barley [48] and jatropha curcas (barbados nut) [49] have been performed for transcriptome analyses by roche 454 pyrosequencing. also, many est libraries of a wide range of plant species have been constructed for genes involved in plant growth and differentiation [46, 50] , biochemical pathways [47, 48] , secondary metabolism [51, 52] as well as responses to environmental stresses and pathogen attack [53] . the goal of this study was to characterize the transcriptome of l. aurea in lycoris species using high-throughput roche 454 pyrosequencing. as one of the amaryllidaceae plants, l. aurea is an indigenous and popular ornamental herb in china [54] . it is wellknown not only for the high economic value in horticulture but also for the alkaloids it produces, among which galanthamine and lycorine are the major ingredients [55, 56] . in recent years, studies have reported that l. aurea is a good material for extraction of galantamine and other alkaloids [55, 57] . however, little research has been performed to address the amaryllidaceae alkaloids biosynthesis-related genes (especially for galanthamine biosynthesis). additionally, to date, there are only less than 9,000 ests available for lycoris. and limited by the availability of genomic information, studies of lycoris have mainly focused on karyotypes analysis [1,3,58,59], morphology [60] , medicine [11, [13] [14] [15] , and molecular aspect [2, [61] [62] [63] [64] . hence, determination of the genetic pathways and specific genes involved in amaryllidaceae alkaloids biosynthesis and some other aspects of lycoris could be beneficial for humans and enrich our knowledge and understanding of functional genomics and biological research. transcriptome sequencing might provide such a useful tool. after preparing a cdna library by pooling total rna from various organs and tissues, including leaves with sodium nitroprusside (snp), salicylic acid (sa), or methyl jasmonate (meja) treatment, stems and flowers at the bud, blooming, and wilting stages, we sequenced ests from this library. the transcriptome sequences were then annotated by blasting against public databases. subsequently, the annotated sequences were clustered into putative functional categories using the gene ontology (go) framework and grouped into pathways using the kyoto encyclopedia of genes and genomes (kegg). this transcriptome dataset represents the first exploration of l. aurea and provides an invaluable new resource for functional genomics and biological research in l. aurea. the results described herein provide a material basis for future genetic linkage and quantitative trait loci (qtl) analysis and may serve to guide further gene express and functional genomic research in future. in order to achieve l. aurea transcriptome, total rna was extracted from a variety of adult organs and tissues, including the leaves, stem, and flowers. it has been reported that in some plants of amaryllidaceae family, the improved production of galanthamine was examined in meja-treated tissues and 1-aminocyclopropane-1-carboxylic acid (acc)-treated somatic embryos respectively [65, 66] . and in our previous study, we also found that the content of galanthamine in lycoris chinensis and lycoris radiata seedlings would have been affected after treating with sodium nitroprusside (snp), salicylic acid (sa), or meja [67, 68] . for the purpose of improving mrna abundance of genes related to amaryllidaceae alkaloids biosynthesis, the leaves were treated with those abiotic elicitors for rna extraction. quality of the rna as determined by agarose gel electrophoresis and od 260 /od 280 ratio (2.0 6 0.10) was found to be suitable for cdna synthesis. after that, equal quantities of rna from different samples were mixed together and normalized cdna was synthesised. it has been reported that normalization of the cdna greatly reduces the frequency of abundant transcripts, and increases the rate recovery of unique transcripts [69] . after subjecting to quality control experiment, the normalized cdna was used to construct a cdna library. then the library was sequenced by a roche 454 gs flx. one-plate 454 pyrosequencing reaction of the normalized cdna was done using gs flx titanium platform. the reads produced by the roche 454 gs flx were used for clustering and de novo assembly. after eliminating primer and adapter sequences and filtering out the low-quality reads, a total of 937,990 highquality transcriptomic raw sequence reads with a total size of 308,633,593 bp were obtained. size distribution of these reads is shown in figure 1a . length of these reads ranged between 150 and 854 bases with an average length of 329 bp per read ( figure 1a ). clustering and assembly of these raw reads was done using gs de novo assembler [70, 71] . this assembler can assemble the data under genomic or cdna option. after clustering and assembly, a non-redundant set of 141,111 expressed sequence tags (ests), comprising 24,604 contigs and 116,507 singletons, respectively (table 1) were obtained. most of these contigs (95.04%) were distributed in the 200,1,400 bp region ( figure 1b ). and most of these singletons fell between 161 and 500 bp in length ( figure 1c ). so far, the number of ests that are available from lycoris is less than 9,000. recently, by sequencing clones from three non-normalized cdna libraries, 32,521 est sequences were obtained and most of them were used for floral transcription factors prediction from lycoris longituba [64] . therefore, this transcriptome dataset provides a useful resource for future analyses of genes related to amaryllidaceae alkaloids synthesis. to the best of our knowledge, this is the first comprehensive study of the transcriptome of l. aurea. the gc content (ratio of guanine and cytosine) of all unique sequences of l. aurea was determined. the content of gc was 40.42% and 39.58% in contigs and singletons, respectively, giving rise to an overall gc content of 40.03%, indicating a low gc content in the cdna of l. aurea. the l. aurea contigs were further assembled into 16,828 isogroups (all the splice variants of individual transcripts). more contigs than isogroups were found because some contigs (called isotigs, 24,463) are attributed to the same isogroups due to alternative splicing. a large number of alternative splicing could improve the utilization rate of the encoding genes. alternative splicing is an important mechanism for regulating gene expression in eukaryotic cells, and it contributes to protein diversity. similarity search for all the unique sequences was done against genbank non-redundant protein sequences database (nr) using blastx. a total of 66,197 (46.91%) l. aurea unigenes including 18,397 contigs and 47,800 singletons were significantly matched to known genes in the public databases (with an e-value of 10 26 ) (table s1 ), representing putative functional identifications for almost half of the assembled sequences. previous studies have shown that approximately 87% of arabidopsis 454-derived ests could be aligned to predicted genes [43] , while 72% could be similarly identified in cucumber [45] and 54.9% in bamboo [50] . as such, our results succeeded in assigning putative identification to a significant proportion of the discovered l. aurea transcripts given the lack of genomic information for this species. amongst the unique sequences derived from contigs and singletons, coding sequences with homology to 'nadh dehydrogenase', 'cytochrome c oxidase', 'atp synthase', 'splicing factor', 'cytochrome p450', 'ubiquitin-protein ligase', and 'zinc finger protein' were the most abundant. although our research mainly focused on finding putative genes related to amaryllidaceae alkaloids synthesis, other putative functional transcripts identified here could provide a foundation for future investigations of the roles of stress response, reproduction and defense reaction. the transcriptomic findings could also be the best source for deciphering the putative functions of novel genes, but further studies would need to be conducted to understand their molecular functions. go provides a structured and controlled vocabulary for describing gene products in three categories: molecular function, biological process and cellular component [72] . we added go terms using blast2go [73] , which is based on the automated annotation of each unigene using blast results against the genbank non redundant protein database (nr) from ncbi. according to the database, a total of 36,188 unigenes could be assigned to one or more ontologies based on their similarity to sequences with previously known functions, including 43,970 sequences assigned to the molecular function category, 72,628 to the biological process category and 79,853 to the cellular component category. the assigned sequences were divided into 58 functional terms (table s2 ). because several of the sequences were assigned to more than one go term, the total number of go terms obtained in our dataset was bigger than the total number of the unique sequences. in total, 196, 451 go terms were retrieved, 22.38%, 40.65% and 36.97% in the molecular function, in the we used the go annotations to assign each unigene to a set of go slims of the three categories, which are a list of go terms providing a broad overview of the ontology content. a summary with the number and percentage of unigenes annotated in each go slim term is shown (figure 2 ). go annotations for the unigenes showed fairly consistent sampling of functional classes. in the molecular function category, 'binding', 'catalytic activity', 'transporter activity' and 'structural molecule activity' comprised the largest proportion, accounting for 93.35% of the total. whilst the cellular component category showed that many unique sequences were to likely possess 'cell' (29.88%), 'cell part' (29.88%) and 'organelle' (21.38%) functions. moreover, 'metabolic processes' (27.75%) and 'cellular process' (27.29%) were among the most highly represented groups under biological functions category. this might be indicating the analyzed tissues were undergoing rapid growth and extensive metabolic activities. genes involved in other important biological processes such as biological regulation (6.59%), regulation of biological process (6.27%) and response to stimulus (5.83%) were also identified ( figure 2 ). in summary, these terms account for a large fraction of the overall assignments in l. aurea transcriptomic dataset. understandably, genes encoding these functions may be more conserved across different species and are thus easier to annotate in the database. assignments of cog were used to predict and classify possible functions of the unique sequences. based on sequence homology, 2,142 unique sequences had a cog functional classification. these sequences were classified into 23 cog categories (figure 3 ). 'translation, ribosomal structure and biogenesis' represented the most common category (426, 19.89%), followed by 'posttranslational modification, protein turnover, chaperones' (362, 16.90%) and 'general function prediction only' (254, 11.86%). 'cell motility' (1, 0.05%), 'defense mechanisms' (2, 0.09%) and 'cell wall/membrane/envelope biogenesis' (7, 0.93%) were the smallest cog categories. besides go analysis, kegg [74] pathway mapping based on enzyme commission numbers for assignments was also carried out for the assembled sequences, which is an alternative approach to categorize genes functions with the emphasis on biochemical pathways. ortholog assignment and mapping of the contigs and singletons to the biological pathways were performed using kegg automatic annotation server (kaas). according to the kegg results, 21,274 l. aurea unigenes comprising 7,097 contigs and 14,177 singletons were mapped onto a total of 295 predicted metabolic pathways, representing compound biosynthesis, degradation, utilization and metabolism (table s3) . it also assigned ec numbers for 3,222 contigs and singletons, and they were mapped to respective pathways. transcripts identified as related to the following global map or cellular processes were the most abundant: metabolic pathways (6,048 unigenes), biosynthesis of secondary metabolites (2,606), ribosome (1,444), microbial metabolism in diverse environments (1,305) and protein processing in endoplasmic reticulum (793). the largest category was metabolism (13,923) which included carbohydrate metabolism (3,541), energy metabolism (2,289), amino acid metabolism (2,044), lipid metabolism (1,647), nucleotide metabolism (875), metabolism of cofactors and vitamins (659), biosynthesis of other secondary metabolites (625) and other subcategories (figure 4 ). in the secondary metabolism category, the most represented subcategories were phenylpropanoid biosynthesis (226), terpenoid backbone biosynthesis (161), tropane, piperidine and pyridine alkaloid biosynthesis (112), metabolism of xenobiotics by cytochrome p450 (102), carotenoid biosynthesis (99), limonene and pinene degradation (96), flavonoid biosynthesis (84), stilbenoid, diarylheptanoid and gingerol biosynthesis (76) , and chloroalkane and chloroalkene degradation (69) was also classified. in addition to metabolism pathways, genetic information processing genes (6,850) were highly represented categories. transcription, sorting and degradation, replication and repair, folding, and translation were included in these categories. kegg pathway analysis and cog analysis are helpful for predicting potential genes and their functions at a whole transcriptome level. the predicted metabolic pathways, together with the cog analysis, are useful for further investigations of gene function in future studies. the transcriptome of l. aurea was primarily examined to identify a wide range of candidate genes that might be functionally associated with amaryllidaceae alkaloids biosynthesis. since the isolation of the first alkaloid, lycorine, from narcissus pseudonarcissus in 1877, substantial progress has been made in examining the amaryllidaceae plants, although they still remain a relatively untapped phytochemical source. at present, over 100 alkaloids have been isolated from different amaryllidaceae plants [75] , although their structures vary considerably, these alkaloids are considered to be biogenetically related. mainly, the large numbers of structurally diverse amaryllidaceae alkaloids are classified into 9 skeleton types, for which the representative alkaloids are: norbelladine, lycorine, homolycorine, crinine, haemanthamine, arciclasine, tazettine, montanine and galanthamine. most of the biosynthetic research done on amaryllidaceae alkaloids was carried out in 1960s and 1970s. since then, studies have been reported that the biosynthesis of amaryllidaceae alkaloids belongs to different ring type subgroups [75] [76] [77] [78] . and the noteworthy study could be the biosynthesis of galanthamine and related alkaloids [76] . for example, it has been considered that l-phenylalanine (l-phe) and l-tyrosine (l-tyr) would be the precursors of amaryllidaceae alkaloids biosynthesis. although lphe and l-tyr are closely related in chemical structure, they are not interchangeable in plants. the presence of the enzyme phenylalanine ammonia-lyase (pal) has been demonstrated in amaryllidaceae plants [68, 79] and the elimination of ammonia mediated by this enzyme is known to occur in an antiperiplanar manner to give trans-cinnamic acid, with loss of the b-pro-s hydrogen [80] . besides, it has been proposed that amaryllidaceae alkaloids could be regarded as derivatives of the key intermediate 49-o-methylnorbelladine [77] . there are three different groups of amaryllidaceae alkaloids that are biosynthesized by three modes of intramolecular oxidative phenol coupling (para-para9, ortho-para9 and para-para9) [75, 76, 78] . moreover, plant cytochromes p450 (p450s), as one of the biggest gene superfamilies in plant genomes, might also be involved in the amaryllidaceae alkaloids biosynthesis. it has been well-known that p450s catalyze a wide variety of monooxygenation/hydroxylation reactions in primary and secondary metabolism. they participate in a variety of biochemical pathways to produce primary and secondary metabolites such as phenylpropanoids, alkaloids, terpenoids, lipids, cyanogenic glycosides, and glucosinolates, as well as plant hormones [81] [82] [83] [84] . for example, in some kinds of plants, several p450s in the cyp80 and cyp719 families, known to catalyze reactions (such as c-o and c-c phenol-coupling reaction) atypical for p450s, function in benzylisoquinoline alkaloids (bias) biosynthesis [85] [86] [87] [88] . although little is known about the relationship between p450s and amaryllidaceae alkaloids biosynthesis, it could also be postulated that p450s might catalyze the stereospecific reactions in some steps of amaryllidaceae alkaloids biosynthesis pathways. additionally, omethyltransferase (omt) acts as an important enzyme could also have participated in the galanthamine biosynthesis [76] . according to our sequence analysis and published literatures, many genes might be involved in amaryllidaceae alkaloids synthesis, including phenylalanine ammonia-lyase (pal), tyrosine decarboxylase (tydc), omt, p450s, n-methyltransferase (nmt), and other potential candidates were identified (table 2) . for example, 26 unique sequences were identified as pal1, pal2, and pal3 with similarities ranging from 62%,100%, respectively. 91 unique sequences were identified as omt with similarities ranging from 51%,100%, respectively. additionally, only 6 unique sequences were identified as tydc with similarities ranging from 55%,88%, respectively. to the best of our knowledge, these putative tydc genes are first reported in l. aurea. ssrs, or microsatellites, are neutral molecular markers that widely distribute in a genome. they consist of repeated core sequences of 2,6 base pairs in length. among the various molecular markers, ssrs have been proven to be an efficient tool for performing qtl analysis, constructing genetic linkage and evaluating the level of genetic variation in a species because of the high diversity, abundance, neutrality and co-dominance of microsatellite dna [31] [32] [33] . in total, 9,740 ssrs were obtained from the transcriptomic dataset. of these, the most frequent repeat motifs were trinucleotides, which accounted for 68.37% of all ssrs, followed by di-nucleotide repeats (19.83%), tetranucleotides (6.98%), pentanucleotides (2.77%), and hexanucleotides (2.05%) ( figure 5 ). based on the distribution of ssr motifs, (ga/ag) n , (ct/tc) n and (ca/ac) n were the three predominant types among the dinucleotide repeats motifs, with frequencies of 31.12%, 27.76% and 15.12%, respectively. in the 20 types of tri-nucleotide repeats, ctt (19.39%) was the most common motif, followed by aag (13.47%), gat (8.50%) and atc (7.94%). to date, only a few microsatellites have been available for l. aurea from ncbi. thus, the development of ssrs for this species is highly desirable. snps were identified from alignments of multiple sequences used for contig assembly. by excluding those that had mutation frequency of bases lower than 1%, we obtained a total of 55,800 snps, of which 5,160 were putative indels (in), 32,440 were putative transitions (ts) and 18,220 were putative transversions (tv), giving a mean in: ts: tv ratio of 1:6.29:3.53 across the transcriptome of l. aurea ( figure 6 ). the ag/ga, ct/ tc and at/ta snp types were the most common. in contrast, gc/cg types were the smallest snp types because of the differences in the base structure and the number of hydrogen bonds between different bases. multiple sequence alignment also identified a total of 5,160 indels across the transcriptome. it should be treated with caution because of technical problems associated with roche 454 gs flx pyrosequencing [42] . in this study, de novo transcriptome sequencing for l. aurea using the 454 gs flx was performed for the first time. a total of 937,990 high-quality transcriptomic reads were obtained, giving rise to an average of 329 bp per read. a significant number of putative metabolic pathways and functions associated with the unique sequences were identified. moreover, a large number of snps and ssrs were predicted and can be used for subsequent marker development, genetic linkage and qtl analysis. many candidate genes that are potentially involved in amaryllidaceae alkaloids synthesis were identified for the first time and are worthy of further investigation. our study provides the largest number of ests to date and lays the initial groundwork for indepth, functional transcriptomic profiling of l. aurea. l. aurea used in this study were collected from institute of botany, jiangsu province & chinese academy of sciences, nanjing, china. in order to achieve l. aurea transcriptome, samples were collected from a variety of adult organs and tissues, including the stem, flowers, and leaves. the stem and flowers collected for the rna extraction were at their bud, blooming, and wilting stages respectively. for the leaves collection, the seedlings grown in illuminating incubator (25 6 1uc ; 14/10 h photoperiod) were treated with 500 mm sodium nitroprusside (snp), 250 mm salicylic acid (sa), or 100 mm methyl jasmonate (meja) for 1, 6, 12, 24, and 48 h. at above indicated time point of treatment, the samples were harvested. all of the samples were immediately frozen in liquid nitrogen and stored at -80uc until use. total rna was extracted from these materials using trizol reagent (invitrogen, usa) according to the manufacturer's instructions. the quality of total rna was determined using a nanodrop spectrophotometer (thermo, usa) and rna samples with a 260 of 280 ratio from 1.9 to 2.1were selected for the next analysis. after that, equal quantities of total rna from each sample (,0.35 mg total rna) were mixed together and delivered it to shanghai majorbio bio-pharm biotechnology co., ltd. (shanghai, china) for the construction of the cdna library. the cdna library was constructed using the creator tm smart tm cdna library construction kit (clontech laboratories inc., mountain view, ca, usa) and following the manufacturer's protocol step-by-step. with agarose gel electrophoresis and extraction of dna from gels, dna bands (500,800 bp) were purified, blunt ended followed by ligation with adapters and finally immobilized on beads. the quality control of a double dna library was performed using high sensitivity chip (agilent technologies). the concentration and the proper ligation of the adapters were examined by using tbs 380 fluorometer. after the examination, one-plate, whole-run sequencing was performed on roche 454 gs flx titanium chemistry (roche diagnostics, indianapolis, in, usa) by shanghai majorbio bio-pharm biotechnology co., ltd. following the manufacturer's protocol. the initial assembly comprised 937,990 reads. for each sequence, low-quality bases and the sequencing adapter were trimmed using lucy (http://lucy.sourceforge.net/) and seq-clean (http://compbio.dfci.harvard.edu). the remained sequencing reads were assembled using the newbler software package (a de novo sequence assembly software) with the ''extend low depth overlaps'' parameter. all of the ests from the roche 454 were used to run the final assembly of l. aurea. blastx searches [89] of the genbank nr database hosted by ncbi (http://www.ncbi.nlm.nih.gov/) were performed on all unique sequences to identify the putative mrna functions. additionally, go terms (http://www.geneontology.org) were extracted from the best hits obtained from the blastx against the nr database using blast2go. these results were then sorted by go categories using in-house perl scripts. blastx was also used to align unique sequences to the swiss-prot database (http://web. expasy.org/docs/swiss-prot_guideline.html), kyoto encyclopedia of genes and genomes (kegg) and clusters of orthologous groups (cog) (http://www.ncbi.nlm.nih.gov/cog/) (with the e-value of 10 26 ) to predict possible functional classifications and molecular pathways [90, 91] . the unique sequences were screened for microsatellites using software mreps (http://bioinfo.lifl.fr/mreps/) with default parameters. perfect di-, tri-, tetra-, penta-, and hexa-nucleotide motifs were detected, and all ssr types required a minimum of 6 repeats. potential snps were extracted using varscan (http://varscan. sourceforge.net) with the default parameter only when both alleles were detected from 454 reads. since no reference sequences were available, snps were identified as superimposed nucleotide peaks where two or more reads contained polymorphisms at the variant allele. the roche 454 reads of l. aurea were submitted to ncbi sequence read archive under the accession number of srp018374. synopsis of the genus lycoris (amaryllidaceae) phylogenetic relationships and possible hybrid origin of lycoris species (amaryllidaceae) revealed by its sequences karyotypes of six populations of lycoris radiata and discovery of the tetraploid amaryllidaceae and sceletium alkaloids amaryllidaceae and sceletium alkaloids amaryllidaceae and sceletium alkaloids ethanol extract of lycoris radiata induces cell death in b16f10 melanoma via p38-mediated ap-1 activation effects of lycorine on hl-60 cells via arresting cell cycle and inducing apoptosis apoptosis induced by lycorine in km3 cells is associated with the g0/g1 cell cycle arrest lycorine induces apoptosis and down-regulation of mcl-1 in human leukemia cells identification of natural compounds with antiviral activities against sars-associated coronavirus physicochemical methods for the analysis of galanthamine (review) the pharmacology of galanthamine and its analogues pharmacological evaluation of novel alzheimer's disease therapeutics: acetylcholinesterase inhibitors related to galanthamine plants used in chinese and indian traditional medicine for improvement of memory and cognitive function galantamine for alzheimer's disease phenol oxidation and biosynthesis. part v. the synthesis of galanthamine the total synthesis of (6)-lycoramine. part i the total synthesis of (6)-lycoramine. part ii total synthesis of dl-lycoramine general methods for alkaloid synthesis. total synthesis of racemic lycoramine total synthesis of racemic lycoramine a short stereospecific synthesis of (dl)-lycoramine. control of relative stereochemistry by dipole effects oxidative intramolecular phenolic coupling reaction induced by a hypervalent iodine (iii) reagent: leading to galanthamine-type amaryllidaceae alkaloids an efficient total synthesis of (6)-lycoramine transcriptome characterization and high throughput ssrs and snps discovery in cucurbita pepo (cucurbitaceae) generation and analysis of expressed sequence tags from the tender shoots cdna library of tea plant (camellia sinensis) analysis of expressed sequence tags in apomictic guineagrass (panicum maximum) the molecular ecologist's guide to expressed sequence tags the first set of est resource for gene discovery and marker development in pigeonpea (cajanus cajan l.) exploiting est databases for the development and characterization of gene-derived ssr-markers in barley (hordeum vulgare l.) development of est-ssrs in cucumis sativus from sequence database analysis of expressed sequence tags from grapevine flower and fruit and development of simple sequence repeat markers de novo assembly of chickpea transcriptome using short reads for gene discovery and marker identification melogen: an est database for melon functional genomics a populus est resource for plant functional genomics generation and analysis of expressed sequence tags from six developing xylem libraries in pinus radiata d. don sequencing technologies-the next generation evaluating characteristics of de novo assembly software on 454 transcriptome data: a simulation approach transcriptome analysis of sarracenia, an insectivorous plant transcriptomic signatures of ash (fraxinus spp.) phloem transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery sampling the arabidopsis transcriptome with massively parallel pyrosequencing global characterization of artemisia annua glandular trichome transcriptome using 454 pyrosequencing transcriptome sequencing and comparative analysis of cucumber flowers with different sex types sequencing medicago truncatula expressed sequenced tags using 454 life sciences technology deep sampling of the palomero maize transcriptome by a high throughput strategy of pyrosequencing sequencing put to the test using the complex genome of barley de novo assembly and transcriptome analysis of five major tissues of jatropha curcas l. using gs flx titanium platform of 454 pyrosequencing de novo sequencing and characterization of the floral transcriptome of dendrocalamus latiflorus (poaceae: bambusoideae) rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing deep sequencing of the camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds analysis of the pythium ultimum transcriptome using sanger and pyrosequencing approaches growth and photosynthetic responses of three lycoris species to levels of irradiance alkaloids from the bulbs of lycoris aurea extracts of lycoris aurea induce apoptosis in murine sarcoma s180 cells photosynthetic characteristics of lycoris aurea and monthly dynamics of alkaloid contents in its bulbs a new chromosome number and karyotype in l. radiata variation and evolution in the karyotype of lycoris, amaryllidaceae. iv. intraspecific variation in the karyotype of l. radiata (l'herit.) herb. and the origin of this triploid species natural variation in petal color in lycoris longituba revealed by anthocyanin components polymorphic microsatellite loci for the genetic analysis of lycoris radiata (amaryllidaceae) and cross-amplification in other congeneric species genome differentiation in lycoris species (amaryllidaceae) identified by genomic in situ hybridization genetic variations in the chloroplast genome and phylogenetic clustering of lycoris species analysis of floral transcription factors from lycoris longituba improved production of galanthamine and related alkaloids by methyl jasmonate in narcissus confusus shoot-clumps effects of ethylene on somatic embryogenesis and galanthamine content in leucojum aestivum l. cultures effect of abiotic and biotic elicitors on growth and alkaloid accumulation of lycoris chinensis seedlings molecular cloning and characterization of a phenylalanine ammonia-lyase gene (lrpal) from lycoris radiata gene discovery from jatropha curcas by sequencing of ests from normalized and full-length enriched cdna library from developing seeds genome sequencing in microfabricated high-density picolitre reactors comparing de novo assemblers for 454 transcriptome data gene ontology: tool for the unification of biology. the gene ontology consortium blast2go: a comprehensive suite for functional analysis in plant genomics kegg: kyoto encyclopedia of genes and genomes chemistry, biology, and medicinal potential of narciclasine and its congeners biosynthesis of the amaryllidaceae alkaloid galanthamine phenol oxidation and biosynthesis. part vi. the biogenesis of amaryllidaceae alkaloids the biosynthesis of plant alkaloids and nitrogenous microbial metabolites biogenesis of the amaryllidaceae alkaloids. ii. studies with whole plants, floral primordia and cell free extracts studies of enzyme-mediated reactions. part i. syntheses of deuterium-or tritium-labelled (3s)-and (3r)-phenylalanines: stereochemical course of the elimination catalysed by l-phenylalanine ammonia-lyase molecular-genetic analysis of plant cytochrome p450-dependent monooxygenases comparison of cytochrome p450 genes from six plant genomes functional genomics of p450s diversification of p450 genes during land plant evolution molecular cloning and characterization of cyp719, a methylenedioxy bridgeforming enzyme that belongs to a novel p450 family, from cultured coptis japonica cells molecular cloning and characterization of cyp80g2, a cytochrome p450 that catalyzes an intramolecular c-c phenol coupling of (s)-reticuline in magnoflorine biosynthesis, from cultured coptis japonica cells molecular cloning and heterologous expression of a cdna encoding berbamunine synthase, a c-o phenol-coupling cytochrome p450 from the higher plant berberis stolonifera cyp719b1 is salutaridine synthase, the c-c phenol-coupling enzyme of morphine biosynthesis in opium poppy gapped blast and psi-blast: a new generation of protein database search programs from genomics to chemical genomics: new developments in kegg kegg for linking genomes to life and the environment we thank dr. zhengzhi zhang (south dakota state university, south dakota, united states of america) for his kindly help in writing this manuscript. we thank yan cheng (shanghai majorbio bio-pharm biotechnology co., ltd.) for her kindly help in sequencing and bioinformatics analysis. key: cord-193356-hqbstgg7 authors: widrich, michael; schafl, bernhard; ramsauer, hubert; pavlovi'c, milena; gruber, lukas; holzleitner, markus; brandstetter, johannes; sandve, geir kjetil; greiff, victor; hochreiter, sepp; klambauer, gunter title: modern hopfield networks and attention for immune repertoire classification date: 2020-07-16 journal: nan doi: nan sha: doc_id: 193356 cord_uid: hqbstgg7 a central mechanism in machine learning is to identify, store, and recognize patterns. how to learn, access, and retrieve such patterns is crucial in hopfield networks and the more recent transformer architectures. we show that the attention mechanism of transformer architectures is actually the update rule of modern hopfield networks that can store exponentially many patterns. we exploit this high storage capacity of modern hopfield networks to solve a challenging multiple instance learning (mil) problem in computational biology: immune repertoire classification. accurate and interpretable machine learning methods solving this problem could pave the way towards new vaccines and therapies, which is currently a very relevant research topic intensified by the covid-19 crisis. immune repertoire classification based on the vast number of immunosequences of an individual is a mil problem with an unprecedentedly massive number of instances, two orders of magnitude larger than currently considered problems, and with an extremely low witness rate. in this work, we present our novel method deeprc that integrates transformer-like attention, or equivalently modern hopfield networks, into deep learning architectures for massive mil such as immune repertoire classification. we demonstrate that deeprc outperforms all other methods with respect to predictive performance on large-scale experiments, including simulated and real-world virus infection data, and enables the extraction of sequence motifs that are connected to a given disease class. source code and datasets: https://github.com/ml-jku/deeprc transformer architectures (vaswani et al., 2017) and their attention mechanisms are currently used in many applications, such as natural language processing (nlp), imaging, and also in multiple instance learning (mil) problems . in mil, a set or bag of objects is labelled rather than objects themselves as in standard supervised learning tasks (dietterich et al., 1997) . examples for mil problems are medical images, in which each sub-region of the image represents an instance, video a pooling function f is used to obtain a repertoire-representation z for the input object. finally, an output network o predicts the class labelŷ. b) deeprc uses stacked 1d convolutions for a parameterized function h due to their computational efficiency. potentially, millions of sequences have to be processed for each input object. in principle, also recurrent neural networks (rnns), such as lstms (hochreiter et al., 2007) , or transformer networks (vaswani et al., 2017) may be used but are currently computationally too costly. c) attention-pooling is used to obtain a repertoire-representation z for each input object, where deeprc uses weighted averages of sequence-representations. the weights are determined by an update rule of modern hopfield networks that allows to retrieve exponentially many patterns. classification, in which each frame is an instance, text classification, where words or sentences are instances of a text, point sets, where each point is an instance of a 3d object, and remote sensing data, where each sensor is an instance (carbonneau et al., 2018; uriot, 2019) . attention-based mil has been successfully used for image data, for example to identify tiny objects in large images (ilse et al., 2018; pawlowski et al., 2019; tomita et al., 2019; kimeswenger et al., 2019) and transformer-like attention mechanisms for sets of points and images . however, in mil problems considered by machine learning methods up to now, the number of instances per bag is in the range of hundreds or few thousands (carbonneau et al., 2018; lee et al., 2019 ) (see also tab. a2). at the same time the witness rate (wr), the rate of discriminating instances per bag, is already considered low at 1% − 5%. we will tackle the problem of immune repertoire classification with hundreds of thousands of instances per bag without instance-level labels and with extremely low witness rates down to 0.01% using an attention mechanism. we show that the attention mechanism of transformers is the update rule of modern hopfield networks (krotov & hopfield, 2016 demircigil et al., 2017) that are generalized to continuous states in contrast to classical hopfield networks (hopfield, 1982) . a detailed derivation and analysis of modern hopfield networks is given in our companion paper (ramsauer et al., 2020) . these novel continuous state hopfield networks allow to store and retrieve exponentially (in the dimension of the space) many patterns (see next section). thus, modern hopfield networks with their update rule, which are used as an attention mechanism in the transformer, enable immune repertoire classification in computational biology. immune repertoire classification, i.e. classifying the immune status based on the immune repertoire sequences, is essentially a text-book example for a multiple instance learning problem (dietterich et al., 1997; maron & lozano-pérez, 1998; wang et al., 2018) . briefly, the immune repertoire of an individual consists of an immensely large bag of immune receptors, represented as amino acid sequences. usually, the presence of only a small fraction of particular receptors determines the immune status with respect to a particular disease (christophersen et al., 2014; emerson et al., 2017) . this is because the immune system has already acquired a resistance if one or few particular immune receptors that can bind to the disease agent are present. therefore, classification of immune repertoires bears a high difficulty since each immune repertoire can contain millions of sequences as instances with only a few indicating the class. further properties of the data that complicate the problem are: (a) the overlap of immune repertoires of different individuals is low (in most cases, maximally low single-digit percentage values) (greiff et al., 2017; elhanati et al., 2018) , (b) multiple different sequences can bind to the same pathogen (wucherpfennig et al., 2007) , and (c) only subsequences within the sequences determine whether binding to a pathogen is possible (dash et al., 2017; glanville et al., 2017; akbar et al., 2019; springer et al., 2020; fischer et al., 2019) . in summary, immune repertoire classification can be formulated as multiple instance learning with an extremely low witness rate and large numbers of instances, which represents a challenge for currently available machine learning methods. furthermore, the methods should ideally be interpretable, since the extraction of class-associated sequence motifs is desired to gain crucial biological insights. the acquisition of human immune repertoires has been enabled by immunosequencing technology (georgiou et al., 2014; brown et al., 2019) which allows to obtain the immune receptor sequences and immune repertoires of individuals. each individual is uniquely characterized by their immune repertoire, which is acquired and changed during life. this repertoire may be influenced by all diseases that an individual is exposed to during their lives and hence contains highly valuable information about those diseases and the individual's immune status. immune receptors enable the immune system to specifically recognize disease agents or pathogens. each immune encounter is recorded as an immune event into immune memory by preserving and amplifying immune receptors in the repertoire used to fight a given disease. this is, for example, the working principle of vaccination. each human has about 10 7 -10 8 unique immune receptors with low overlap across individuals and sampled from a potential diversity of > 10 14 receptors (mora & walczak, 2019) . the ability to sequence and analyze human immune receptors at large scale has led to fundamental and mechanistic insights into the adaptive immune system and has also opened the opportunity for the development of novel diagnostics and therapy approaches (georgiou et al., 2014; brown et al., 2019) . immunosequencing data have been analyzed with computational methods for a variety of different tasks (greiff et al., 2015; shugay et al., 2015; miho et al., 2018; yaari & kleinstein, 2015; wardemann & busse, 2017) . a large part of the available machine learning methods for immune receptor data has been focusing on the individual immune receptors in a repertoire, with the aim to, for example, predict the antigen or antigen portion (epitope) to which these sequences bind or to predict sharing of receptors across individuals (gielis et al., 2019; springer et al., 2020; jurtz et al., 2018; moris et al., 2019; fischer et al., 2019; greiff et al., 2017; sidhom et al., 2019; elhanati et al., 2018) . recently, jurtz et al. (2018) used 1d convolutional neural networks (cnns) to predict antigen binding of t-cell receptor (tcr) sequences (specifically, binding of tcr sequences to peptide-mhc complexes) and demonstrated that motifs can be extracted from these models. similarly, konishi et al. (2019) use cnns, gradient boosting, and other machine learning techniques on b-cell receptor (bcr) sequences to distinguish tumor tissue from normal tissue. however, the methods presented so far predict a particular class, the epitope, based on a single input sequence. immune repertoire classification has been considered as a mil problem in the following publications. a deep learning framework called deeptcr (sidhom et al., 2019) implements several deep learning approaches for immunosequencing data. the computational framework, inter alia, allows for attention-based mil repertoire classifiers and implements a basic form of attention-based averaging. ostmeyer et al. (2019) already suggested a mil method for immune repertoire classification. this method considers 4-mers, fixed sub-sequences of length 4, as instances of an input object and trained a logistic regression model with these 4-mers as input. the predictions of the logistic regression model for each 4-mer were max-pooled to obtain one prediction per input object. this approach is characterized by (a) the rigidity of the k-mer features as compared to convolutional kernels (alipanahi et al., 2015; zhou & troyanskaya, 2015; zeng et al., 2016) , (b) the max-pooling operation, which constrains the network to learn from a single, top-ranked k-mer for each iteration over the input object, and (c) the pooling of prediction scores rather than representations (wang et al., 2018) . our experiments also support that these choices in the design of the method can lead to constraints on the predictive performance (see table 1 ). our proposed method, deeprc, also uses a mil approach but considers sequences rather than k-mers as instances within an input object and a transformer-like attention mechanism. deeprc sets out to avoid the above-mentioned constraints of current methods by (a) applying transformer-like attention-pooling instead of max-pooling and learning a classifier on the repertoire rather than on the sequence-representation, (b) pooling learned representations rather than predictions, and (c) using less rigid feature extractors, such as 1d convolutions or lstms. in this work, we contribute the following: we demonstrate that continuous generalizations of binary modern hopfield-networks (krotov & hopfield, 2016 demircigil et al., 2017) have an update rule that is known as the attention mechanisms in the transformer. we show that these modern hopfield networks have exponential storage capacity, which allows them to extract patterns among a large set of instances (next section). based on this result, we propose deeprc, a novel deep mil method based on modern hopfield networks for large bags of complex sequences, as they occur in immune repertoire classification (section "deep repertoire classification). we evaluate the predictive performance of deeprc and other machine learning approaches for the classification of immune repertoires in a large comparative study (section "experimental results") exponential storage capacity of continuous state modern hopfield networks with transformer attention as update rule in this section, we show that modern hopfield networks have exponential storage capacity, which will later allow us to approach massive multiple-instance learning problems, such as immune repertoire classification. see our companion paper (ramsauer et al., 2020) for a detailed derivation and analysis of modern hopfield networks. we assume patterns x 1 , . . . , x n ∈ r d that are stacked as columns to the matrix x = (x 1 , . . . , x n ) and a query pattern ξ that also represents the current state. the largest norm of a pattern is m = max i x i . the separation ∆ i of a pattern x i is defined as its minimal dot product difference to any of the other patterns: we consider a modern hopfield network with current state ξ and the energy function for energy e and state ξ, the update rule is proven to converge globally to stationary points of the energy e, which are local minima or saddle points (see (ramsauer et al., 2020) , appendix, theorem a2 ). surprisingly, the update rule eq. (1) is also the formula of the well-known transformer attention mechanism. to see this more clearly, we simultaneously update several queries ξ i . furthermore the queries ξ i and the patterns x i are linear mappings of vectors y i into the space r d . for matrix notation, we set x i = w t k y i , ξ i = w t q y i and multiply the result of our update rule with w v . using y = (y 1 , . . . , y n ) t , we define the matrices and the patterns are now mapped to the hopfield space with dimension d = d k . we set β = 1/ √ d k and change softmax to a row vector. the update rule eq. (1) multiplied by w v performed for all queries simultaneously becomes in row vector notation: this formula is the transformer attention. if the patterns x i are well separated, the iterate eq. (1) converges to a fixed point close to a pattern to which the initial ξ is similar. if the patterns are not well separated the iterate eq.(1) converges to a fixed point close to the arithmetic mean of the patterns. if some patterns are similar to each other but well separated from all other vectors, then a metastable state between the similar patterns exists. iterates that start near a metastable state converge to this metastable state. for details see ramsauer et al. (2020) , appendix, sect. a2. typically, the update converges after one update step (see ramsauer et al. (2020) , appendix, theorem a8) and has an exponentially small retrieval error (see ramsauer et al. (2020) , appendix, theorem a9). our main concern for application to immune repertoire classification is the number of patterns that can be stored and retrieved by the modern hopfield network, equivalently to the transformer attention head. the storage capacity of an attention mechanism is critical for massive mil problems. we first define what we mean by storing and retrieving patterns from the modern hopfield network. definition 1 (pattern stored and retrieved). we assume that around every pattern x i a sphere s i is given. we say x i is stored if there is a single fixed point x * i ∈ s i to which all points ξ ∈ s i converge, for randomly chosen patterns, the number of patterns that can be stored is exponential in the dimension d of the space of the patterns (x i ∈ r d ). theorem 1. we assume a failure probability 0 < p 1 and randomly chosen patterns on the sphere with radius m = k √ d − 1. we define a := 2 d−1 (1 + ln(2 β k 2 p (d − 1))), b := 2 k 2 β 5 , and c = b w0(exp(a + ln(b)) , where w 0 is the upper branch of the lambert w function and ensure then with probability 1 − p, the number of random patterns that can be stored is examples are c ≥ 3.1546 for β = 1, k = 3, d = 20 and p = 0.001 (a + ln(b) > 1.27) and c ≥ 1.3718 for β = 1 k = 1, d = 75, and p = 0.001 (a + ln(b) < −0.94). see ramsauer et al. (2020) , appendix, theorem a5 for a proof. we have established that a modern hopfield network or a transformer attention mechanism can store and retrieve exponentially many patterns. this allows us to approach mil with massive numbers of instances from which we have to retrieve a few with an attention mechanism. deep repertoire classification problem setting and notation. we consider a mil problem, in which an input object x is a bag of n instances x = {s 1 , . . . , s n }. the instances do not have dependencies nor orderings between them and n can be different for every object. we assume that each instance s i is associated with a label y i ∈ {0, 1}, assuming a binary classification task, to which we do not have access. we only have access to a label y = max i y i for an input object or bag. note that this poses a credit assignment problem, since the sequences that are responsible for the label y have to be identified and that the relation between instance-label and bag-label can be more complex (foulds & frank, 2010) . a modelŷ = g(x) should be (a) invariant to permutations of the instances and (b) able to cope with the fact that n varies across input objects (ilse et al., 2018) , which is a problem also posed by point sets (qi et al., 2017) . two principled approaches exist. the first approach is to learn an instance-level scoring function h : s → [0, 1], which is then pooled across instances with a pooling function f , for example by average-pooling or max-pooling (see below). the second approach is to construct an instance representation z i of each instance by h : s → r dv and then encode the bag, or the input object, by pooling these instance representations (wang et al., 2018) via a function f . an output function o : r dv → [0, 1] subsequently classifies the bag. the second approach, the pooling of representations rather than scoring functions, is currently best performing (wang et al., 2018) . in the problem at hand, the input object x is the immune repertoire of an individual that consists of a large set of immune receptor sequences (t-cell receptors or antibodies). immune receptors are primarily represented as sequences s i from a space s i ∈ s. these sequences act as the instances in the mil problem. although immune repertoire classification can readily be formulated as a mil problem, it is yet unclear how well machine learning methods solve the above-described problem with a large number of instances n 10, 000 and with instances s i being complex sequences. next we describe currently used pooling functions for mil problems. pooling functions for mil problems. different pooling functions equip a model g with the property to be invariant to permutations of instances and with the ability to process different numbers of instances. typically, a neural network h θ with parameters θ is trained to obtain a function that maps each instance onto a representation: z i = h θ (s i ) and then a pooling function z = f ({z 1 , . . . , z n }) supplies a representation z of the input object x = {s 1 , . . . , s n }. the following pooling functions are typically used: average-pooling: where e m is the standard basis vector for dimension m and attention-pooling: z = n i=1 a i z i , where a i are non-negative (a i ≥ 0), sum to one ( n i=1 a i = 1), and are determined by an attention mechanism. these pooling functions are invariant to permutations of {1, . . . , n } and are differentiable. therefore, they are suited as building blocks for deep learning architectures. we employ attention-pooling in our deeprc model as detailed in the following. modern hopfield networks viewed as transformer-like attention mechanisms. the modern hopfield networks, as introduced above,have a storage capacity that is exponential in the dimension of the vector space and converge after just one update (see (ramsauer et al., 2020) , appendix).additionally, the update rule of modern hopfield networks is known as key-value attention mechanism, which has been highly successful through the transformer (vaswani et al., 2017) and bert (devlin et al., 2019) models in natural language processing. therefore using modern hopfield networks with the key-value-attention mechanism as update rule is the natural choice for our task. in particular, modern hopfield networks are theoretically justified for storing and retrieving the large number of vectors (sequence patterns) that appear in the immune repertoire classification task. instead of using the terminology of modern hopfield networks, we explain our deeprc architecture in terms of key-value-attention (the update rule of the modern hopfield network), since it is well known in the deep learning community. the attention mechanism assumes a space of dimension d k in which keys and queries are compared. a set of n key vectors are combined to the matrix k. a set of d q query vectors are combined to the matrix q. similarities between queries and keys are computed by inner products, therefore queries can search for similar keys that are stored. another set of n value vectors are combined to the matrix v . the output of the attention mechanism is a weighted average of the value vectors for each query q. the i-th vector v i is weighted by the similarity between the i-th key k i and the query q. the similarity is given by the softmax of the inner products of the query q with the keys k i . all queries are calculated in parallel via matrix operations. consequently, the attention function att(q, k, v ; β) maps queries q, keys k, and values v to d v -dimensional outputs: att(q, k, v ; β) = softmax(βqk t )v (see also eq. (2)). while this attention mechanism has originally been developed for sequence tasks (vaswani et al., 2017) , it can be readily transferred to sets ye et al., 2018) . this type of attention mechanism will be employed in deeprc. the deeprc method. we propose a novel method deep repertoire classification (deeprc) for immune repertoire classification with attention-based deep massive multiple instance learning and compare it against other machine learning approaches. for deeprc, we consider immune repertoires as input objects, which are represented as bags of instances. in a bag, each instance is an immune receptor sequence and each bag can contain a large number of sequences. note that we will use z i to denote the sequence-representation of the i-th sequence and z to denote the repertoire-representation. at the core, deeprc consists of a transformer-like attention mechanism that extracts the most important information from each repertoire. we first give an overview of the attention mechanism and then provide details on each of the sub-networks h 1 , h 2 , and o of deeprc. attention mechanism in deeprc. this mechanism is based on the three matrices k (the keys), q (the queries), and v (the values) together with a parameter β. values. deeprc uses a 1d convolutional network h 1 (lecun et al., 1998; hu et al., 2014; kelley et al., 2016) that supplies a sequence-representation z i = h 1 (s i ), which acts as the values v = z = (z 1 , . . . , z n ) in the attention mechanism (see figure 2 ). keys. a second neural network h 2 , which shares its first layers with h 1 , is used to obtain keys k ∈ r n ×d k for each sequence in the repertoire. this network uses 2 self-normalizing layers (klambauer et al., 2017) with 32 units per layer (see figure 2 ). query. we use a fixed d k -dimensional query vector ξ which is learned via backpropagation. for more attention heads, each head has a fixed query vector. with the quantities introduced above, the transformer attention mechanism (eq. (2)) of deeprc is implemented as follows: where z ∈ r n ×dv are the sequence-representations stacked row-wise, k are the keys, and z is the repertoire-representation and at the same time a weighted mean of sequence-representations z i . the attention mechanism can readily be extended to multiple queries, however, computational demand could constrain this depending on the application and dataset. theorem 1 demonstrates that this mechanism is able to retrieve a single pattern out of several hundreds of thousands. attention-pooling and interpretability. each input object, i.e. repertoire, consists of a large number n of sequences, which are reduced to a single fixed-size feature vector of length d v representing the whole input object by an attention-pooling function. to this end, a transformer-like attention mechanism adapted to sets is realized in deeprc which supplies a i -the importance of the sequence s i . this importance value is an interpretable quantity, which is highly desired for the immunological problem at hand. thus, deeprc allows for two forms of interpretability methods. (a) a trained deeprc model can compute attention weights a i , which directly indicate the importance of a sequence. (b) deeprc furthermore allows for the usage of contribution analysis methods, such as integrated gradients (ig) (sundararajan et al., 2017) or layer-wise relevance propagation (montavon et al., 2018; arras et al., 2019) . see sect. a8 for details. classification layer and network parameters. the repertoire-representation z is then used as input for a fully-connected output networkŷ = o(z) that predicts the immune status, where we found it sufficient to train single-layer networks. in the simplest case, deeprc predicts a single target, the class label y, e.g. the immune status of an immune repertoire, using one output value. however, since deeprc is an end-to-end deep learning model, multiple targets may be predicted simultaneously in classification or regression settings or a mix of both. this allows for the introduction of additional information into the system via auxiliary targets such as age, sex, or other metadata. table 1 with sub-networks h 1 , h 2 , and o. d l indicates the sequence length. network parameters, training, and inference. deeprc is trained using standard gradient descent methods to minimize a cross-entropy loss. the network parameters are θ 1 , θ 2 , θ o for the sub-networks h 1 , h 2 , and o, respectively, and additionally ξ. in more detail, we train deeprc using adam (kingma & ba, 2014) with a batch size of 4 and dropout of input sequences. implementation. to reduce computational time, the attention network first computes the attention weights a i for each sequence s i in a repertoire. subsequently, the top 10% of sequences with the highest a i per repertoire are used to compute the weight updates and prediction. furthermore, computation of z i is performed in 16-bit, others in 32-bit precision to ensure numerical stability in the softmax. see sect. a2 for details. in this section, we report and analyze the predictive power of deeprc and the compared methods on several immunosequencing datasets. the roc-auc is used as the main metric for the predictive power. methods compared. we compared previous methods for immune repertoire classification, (ostmeyer et al., 2019) ("log. mil (kmer)", "log. mil (tcrb)") and a burden test (emerson et al., 2017) , as well as the baseline methods logistic regression ("log. regr."), k-nearest neighbour ("knn"), and support vector machines ("svm") with kernels designed for sets, such as the jaccard kernel ("j") and the minmax ("mm") kernel (ralaivola et al., 2005) . for the simulated data, we also added baseline methods that search for the implanted motif either in binary or continuous fashion ("known motif b.", "known motif c.") assuming that this motif was known (for details, see sect. a4). datasets. we aimed at constructing immune repertoire classification scenarios with varying degree of difficulties and realism in order to compare and analyze the suggested machine learning methods. to this end, we either use simulated or experimentally-observed immune receptor sequences and we implant signals, specifically, sequence motifs or sets thereof weber et al., 2020) , at different frequencies into sequences of repertoires of the positive class. these frequencies represent the witness rates and range from 0.01% to 10%. overall, we compiled four categories of datasets: (a) simulated immunosequencing data with implanted signals, (b) lstm-generated immunosequencing data with implanted signals, (c) real-world immunosequencing data with implanted signals, and (d) real-world immunosequencing data with known immune status, the cmv dataset (emerson et al., 2017) . the average number of instances per bag, which is the number of sequences per immune repertoire, is ≈300,000 except for category (c), in which we consider the scenario of low-coverage data with only 10,000 sequences per repertoire. the number of repertoires per dataset ranges from 785 to 5,000. in total, all datasets comprise ≈30 billion sequences or instances. this represents the largest comparative study on immune repertoire classification (see sect. a3). hyperparameter selection. we used a nested 5-fold cross validation (cv) procedure to estimate the performance of each of the methods. all methods could adjust their most important hyperparameters on a validation set in the inner loop of the procedure. see sect. a5 for details. table 1 : results in terms of auc of the competing methods on all datasets. the reported errors are standard deviations across 5 cross-validation (cv) folds (except for the column "simulated"). real-world cmv: average performance over 5 cv folds on the cmv dataset (emerson et al., 2017) . real-world data with implanted signals: average performance over 5 cv folds for each of the four datasets. a signal was implanted with a frequency (=witness rate) of 1% or 0.1%. either a single motif ("om") or multiple motifs ("mm") were implanted. lstm-generated data: average performance over 5 cv folds for each of the 5 datasets. in each dataset, a signal was implanted with a frequency of 10%, 1%, 0.5%, 0.1%, or 0.05%, respectively. simulated: here we report the mean over 18 simulated datasets with implanted signals and varying difficulties (see tab. a9 for details). the error reported is the standard deviation of the auc values across the 18 datasets. results. in each of the four categories, "real-world data", "real-world data with implanted signals", "lstm-generated data", and "simulated immunosequencing data", deeprc outperforms all competing methods with respect to average auc. across categories, the runner-up methods are either the svm for mil problems with minmax kernel or the burden test (see table 1 and sect. a6). results on simulated immunosequencing data. in this setting the complexity of the implanted signal is in focus and varies throughout 18 simulated datasets (see sect. a3). some datasets are challenging for the methods because the implanted motif is hidden by noise and others because only a small fraction of sequences carries the motif, and hence have a low witness rate. these difficulties become evident by the method called "known motif binary", which assumes the implanted motif is known. the performance of this method ranges from a perfect auc of 1.000 in several datasets to an auc of 0.532 in dataset '17' (see sect. a6). deeprc outperforms all other methods with an average auc of 0.846 ± 0.223, followed by the svm with minmax kernel with an average auc of 0.827 ± 0.210 (see sect. a6). the predictive performance of all methods suffers if the signal occurs only in an extremely small fraction of sequences. in datasets, in which only 0.01% of the sequences carry the motif, all auc values are below 0.550. results on lstm-generated data. on the lstm-generated data, in which we implanted noisy motifs with frequencies of 10%, 1%, 0.5%, 0.1%, and 0.05%, deeprc yields almost perfect predictive performance with an average auc of 1.000 ± 0.001 (see sect. a6 and a7). the second best method, svm with minmax kernel, has a similar predictive performance to deeprc on all datasets but the other competing methods have a lower predictive performance on datasets with low frequency of the signal (0.05%). results on real-world data with implanted motifs. in this dataset category, we used real immunosequences and implanted single or multiple noisy motifs. again, deeprc outperforms all other methods with an average auc of 0.980 ± 0.029, with the second best method being the burden test with an average auc of 0.883 ± 0.170. notably, all methods except for deeprc have difficulties with noisy motifs at a frequency of 0.1% (see tab. a11) . results on real-world data. on the real-world dataset, in which the immune status of persons affected by the cytomegalovirus has to be predicted, the competing methods yield predictive aucs between 0.515 and 0.825 (see table 1 ). we note that this dataset is not the exact dataset that was used in emerson et al. (2017) . it differs in pre-processing and also comprises a different set of samples and a smaller training set due to the nested 5-fold cross-validation procedure, which leads to a more challenging dataset. the best performing method is deeprc with an auc of 0.831 ± 0.002, followed by the svm with minmax kernel (auc 0.825 ± 0.022) and the burden test with an auc of 0.699 ± 0.041. the top-ranked sequences by deeprc significantly correspond to those detected by emerson et al. (2017) , which we tested by a mann-whitney u-test with the null hypothesis that the attention values of the sequences detected by emerson et al. (2017) would be equal to the attention values of the remaining sequences (p-value of 1.3 · 10 −93 ). the sequence attention values are displayed in tab. a14. we have demonstrated how modern hopfield networks and attention mechanisms enable successful classification of the immune status of immune repertoires. for this task, methods have to identify the discriminating sequences amongst a large set of sequences in an immune repertoire. specifically, even motifs within those sequences have to be identified. we have shown that deeprc, a modern hopfield network and an attention mechanism with a fixed query, can solve this difficult task despite the massive number of instances. deeprc furthermore outperforms the compared methods across a range of different experimental conditions. impact on machine learning and related scientific fields. we envision that with (a) the increasing availability of large immunosequencing datasets (kovaltsuk et al., 2018; corrie et al., 2018; christley et al., 2018; zhang et al., 2020; rosenfeld et al., 2018; shugay et al., 2018) , (b) further fine-tuning of ground-truth benchmarking immune receptor datasets (weber et al., 2020; olson et al., 2019; marcou et al., 2018) , (c) accounting for repertoire-impacting factors such as age, sex, ethnicity, and environment (potential confounding factors), and (d) increased gpu memory and increased computing power, it will be possible to identify discriminating immune receptor motifs for many diseases, potentially even for the current sars-cov-2 (covid-19) pandemic minervina et al., 2020; galson et al., 2020) . such results would greatly benefit ongoing research on antibody and tcr-driven immunotherapies and immunodiagnostics as well as rational vaccine design (brown et al., 2019) . in the course of this development, the experimental verification and interpretation of machine-learningidentified motifs could receive additional focus, as for most of the sequences within a repertoire the corresponding antigen is unknown. nevertheless, recent technological breakthroughs in highthroughput antigen-labeled immunosequencing are beginning to generate large-scale antigen-labeled single-immune-receptor-sequence data thus resolving this longstanding problem (setliff et al., 2019) . from a machine learning perspective, the successful application of deeprc on immune repertoires with their large number of instances per bag might encourage the application of modern hopfield networks and attention mechanisms on new, previously unsolved or unconsidered, datasets and problems. impact on society. if the approach proves itself successful, it could lead to faster testing of individuals for their immune status w.r.t. a range of diseases based on blood samples. this might motivate changes in the pipeline of diagnostics and tracking of diseases, e.g. automated testing of the immune status in regular intervals. it would furthermore make the collection and screening of blood samples for larger databases more attractive. in consequence, the improved testing of immune statuses might identify individuals that do not have a working immune response towards certain diseases to government or insurance companies, which could then push for targeted immunisation of the individual. similarly to compulsory vaccination, such testing for the immune status could be made compulsory by governments, possibly violating privacy or personal self-determination in exchange for increased over-all health of a population. ultimately, if the approach proves itself successful, the insights gained from the screening of individuals that have successfully developed resistances against specific diseases could lead to faster targeted immunisation, once a certain number of individuals with resistances can be found. this might strongly decrease the harm done by e.g. pandemics and lead to a change in the societal perception of such diseases. consequences of failures of the method. as common with methods in machine learning, potential danger lies in the possibility that users rely too much on our new approach and use it without reflecting on the outcomes. however, the full pipeline in which our method would be used includes wet lab tests after its application, to verify and investigate the results, gain insights, and possibly derive treatments. failures of the proposed method would lead to unsuccessful wet lab validation and negative wet lab tests. since the proposed algorithm does not directly suggest treatment or therapy, human beings are not directly at risk of being treated with a harmful therapy. substantial wet lab and in-vitro testing and would indicate wrong decisions by the system. leveraging of biases in the data and potential discrimination. as for almost all machine learning methods, confounding factors, such as age or sex, could be used for classification. this, might lead to biases in predictions or uneven predictive performance across subgroups. as a result, failures in the wet lab would occur (see paragraph above). moreover, insights into the relevance of the confounding factors could be gained, leading to possible therapies or counter-measures concerning said factors. furthermore, the amount of data available with respec to relevant confounding factors could lead to better or worse performance of our method. e.g. a dataset consisting mostly of data from individuals within a specific age group might yield better performance for that age group, possibly resulting in better or exclusive treatment methods for that specific group. here again, the application of deeprc would be followed by in-vitro testing and development of a treatment, where all target groups for the treatment have to be considered accordingly. all datasets and code is available at https://github.com/ml-jku/deeprc. the cmv dataset is publicly available at https://clients.adaptivebiotech.com/pub/emerson-2017-natgen. in section a2 we provide details on the architecture of deeprc, in section a3 we present details on the datasets, in section a4 we explain the methods that we compared, in section a5 we elaborate on the hyperparameters and their selection process. then, in section a6 we present detailed results for each dataset category in tabular form, in section a7 we provide information on the lstm model that was used to generate antibody sequences, in section a8 we show how deeprc can be interpreted, in section a9 we show the correspondence of previously identified tcr sequences for cmv immune status with attention values by deeprc, and finally we present variations and an ablation study of deeprc in section a10. input layer. for the input layer of the cnn, the characters in the input sequence, i.e. the amino acids (aas), are encoded in a one-hot vector of length 20. to also provide information about the position of an aa in the sequence, we add 3 additional input features with values in range [0, 1] to encode the position of an aa relative to the sequence. these 3 positional features encode whether the aa is located at the beginning, the center, or the end of the sequence, respectively, as shown in figure a1 . we concatenate these 3 positional features with the one-hot vector of aas, which results in a feature vector of size 23 per sequence position. each repertoire, now represented as a bag of feature vectors, is then normalized to unit variance. since the cytomegalovirus dataset (cmv dataset) provides sequences with an associated abundance value per sequence, which is the number of occurrences of a sequence in a repertoire, we incorporate this information into the input of deeprc. to this end, the one-hot aa features of a sequence are multiplied by a scaling factor of log(c a ) before normalization, where c a is the abundance of a sequence. we feed the sequences with 23 features per position into the cnn. sequences of different lengths were zero-padded to the maximum sequence length per batch at the sequence ends. 1d cnn for motif recognition. in the following, we describe how deeprc identifies patterns in the individual sequences and reduces each sequence in the input object to a fixed-size feature vector. deeprc employs 1d convolution layers to extract patterns, where trainable weight kernels are convolved over the sequence positions. in principle, also recurrent neural networks (rnns) or transformer networks could be used instead of 1d cnns, however, (a) the computational complexity of the network must be low to be able to process millions of sequences for a single update. additionally, (b) the learned network should be able to provide insights in the recognized patterns in form of motifs. both properties (a) and (b) are fulfilled by 1d convolution operations that are used by deeprc. we use one 1d cnn layer (hu et al., 2014) with selu activation functions (klambauer et al., 2017) to identify the relevant patterns in the input sequences with a computationally light-weight operation. the larger the kernel size, the more surrounding sequence positions are taken into account, which influences the length of the motifs that can be extracted. we therefore adjust the kernel size during hyperparameter search. in prior works (ostmeyer et al., 2019) , a k-mer size of 4 yielded good predictive performance, which could indicate that a kernel size in the range of 4 may be a proficient choice. for d v trainable kernels, this produces a feature vector of length d v at each sequence position. subsequently, global max-pooling over all sequence positions of a sequence reduces the sequence-representations z i to vectors of the fixed length d v . given the challenging size of the input data per repertoire, the computation of the cnn activations and weight updates is performed using 16-bit floating point values. a list of hyperparameters evaluated for deeprc is given in table a3 . a comparison of rnn-based and cnn-based sequence embedding for motif recognition in a smaller experimental setting is given in sec. a10. regularization. we apply random and attention-based subsampling of repertoire sequences to reduce over-fitting and decrease computational effort. during training, each repertoire is subsampled to 10, 000 input sequences, which are randomly drawn from the respective repertoire. this can also be interpreted as random drop-out (hinton et al., 2012) on the input sequences or attention weights. during training and evaluation, the attention weights computed by the attention network are furthermore used to rank the input sequences. based on this ranking, the repertoire is reduced to the 10% of sequences with the highest attention weights. these top 10% of sequences are then used to compute the weight updates and the prediction for the repertoire. additionally, one might employ further regularization techniques, which we only partly investigated further in a smaller experimental setting in sec. a10 due to high computational demands. such regularization techniques include l1 and l2 weight decay, noise in the form of random aa permutations in the input sequences, noise on the attention weights, or random shuffling of sequences between repertoires that belong to the negative class. the last regularization technique assumes that the sequences in positive-class repertoires carry a signal, such as an aa motif corresponding to an immune response, whereas the sequences in negative-class repertoires do not. hence, the sequences can be shuffled randomly between negative class repertoires without obscuring the signal in the positive class repertoires. hyperparameters. for the hyperparameter search of deeprc for the category "simulated immunosequencing data", we only conducted a full hyperparameter search on the more difficult datasets with motif implantation probabilities below 1%, as described in table a3 . this process was repeated for all 5 folds of the 5-fold cross-validation (cv) and the average score on the 5 test sets constitutes the final score of a method. table a3 provides an overview of the hyperparameter search, which was conducted as a grid search for each of the datasets in a nested 5-fold cv procedure, as described in section a4. computation time and optimization. we took measures on the implementation level to address the high computational demands, especially gpu memory consumption, in order to make the large number of experiments feasible. we train the deeprc model with a small batch size of 4 samples and perform computation of inference and updates of the 1d cnn using 16-bit floating point values. the rest of the network is trained using 32-bit floating point values. the adam parameter for numerical stability was therefore increased from the default value of = 10 −8 to = 10 −4 . training was performed on various gpu types, mainly nvidia rtx 2080 ti. computation times were highly dependent on the number of sequences in the repertoires and the number and sizes of cnn kernels. a single update on an nvidia rtx 2080 ti gpu took approximately 0.0129 to 0.0135 seconds, while requiring approximately 8 to 11 gb gpu memory. taking these optimizations and gpus with larger memory (≥ 16 gb) into account, it is already possible to train deeprc, possibly with multi-head attention and a larger network architecture, on larger datasets (see sec. a10). our network implementation is based on pytorch 1.3.1 (paszke et al., 2019) . incorporation of additional inputs and metadata. additional metadata in the form of sequencelevel or repertoire-level features could be incorporated into the input via concatenation with the feature vectors that result from taking the maximum of the 1d cnn outputs w.r.t. the sequence positions. this has the benefit that the attention mechanism and output network can utilize the sequence-level or repertoire-level features for their predictions. sparse metadata or metadata that is only available during training could be used as auxiliary targets to incorporate the information via gradients into the deeprc model. limitations. the current methods are mostly limited by computational complexity, since both hyperparameter and model selection is computationally demanding. for hyperparameter selection, a large number of hyperparameter settings have to be evaluated. for model selection, a single repertoire requires the propagation of many thousands of sequences through a neural network and keeping those quantities in gpu memory in order to perform the attention mechanism and weight update. thus, increased gpu memory would significantly boost our approach. increased computational power would also allow for more advanced architectures and attention mechanisms, which may further improve predictive performance. another limiting factor is over-fitting of the model due to the currently relatively small number of samples (bags) in real-world immunosequencing datasets in comparison to the large number of instances per bag and features per instance. we aimed at constructing immune repertoire classification scenarios with varying degree of realism and difficulties in order to compare and analyze the suggested machine learning methods. to this end, we either use simulated or experimentally-observed immune receptor sequences and we implant signals, which are sequence motifs weber et al., 2020) , into sequences of repertoires of the positive class. it has been shown previously that interaction of immune receptors with antigens occur via short sequence stretches . thus, implantation of short motif sequences simulating an immune signal is biologically meaningful. our benchmarking study comprises four different categories of datasets: (a) simulated immunosequencing data with implanted signals (where the signal is defined as sets of motifs), (b) lstm-generated immunosequencing data with implanted signals, (c) real-world immunosequencing data with implanted signals, and (d) real-world immunosequencing data. each of the first three categories consists of multiple datasets with varying difficulty depending on the type of the implanted signal and the ratio of sequences with the implanted signal. the ratio of sequences with the implanted signal, where each sequence carries at most 1 implanted signal, corresponds to the witness rate (wr). we consider binary classification tasks to simulate the immune status of healthy and diseased individuals. we randomly generate immune repertoires with varying numbers of sequences, where we implant sequence motifs in the repertoires of the diseased individuals, i.e. the positive class. the sequences of a repertoire are also randomly generated by different procedures (detailed below). each sequence is composed of 20 different characters, corresponding to amino acids, and has an average length of 14.5 aas. in the first category, we aim at investigating the impact of the signal frequency, i.e. the wr, and the signal complexity on the performance of the different methods. to this end, we created 18 datasets, whereas each dataset contains a large number of repertoires with a large number of random aa sequences per repertoire. we then implanted signals in the aa sequences of the positive class repertoires, where the 18 datasets differ in frequency and complexity of the implanted signals. in detail, the aas were sampled randomly independent of their respective position in the sequence, while the frequencies of aas, distribution of sequence lengths, and distribution of the number of sequences per repertoire, i.e. the number of instances per bag, are following the respective distributions observed in the real-world cmv dataset (emerson et al., 2017) . for this, we first sampled the number of sequences for a repertoire from a gaussian n (µ = 316k, σ = 132k) distribution and rounded to the nearest positive integer. we re-sampled if the size was below 5k. we then generated random sequences of aas with a length of n (µ = 14.5, σ = 1.8), again rounded to the nearest positive integers. each simulated repertoire was then randomly assigned to either the positive or negative class, with 2, 500 repertoires per class. in the repertoires assigned to the positive class, we implanted motifs with an average length of 4 aas, following the results of the experimental analysis of antigenbinding motifs in antibodies and t-cell receptor sequences by . we varied the characteristics of the implanted motifs for each of the 18 datasets with respect to the following parameters: (a) ρ, the probability of a motif being implanted in a sequence of a positive repertoire, i.e. the average ratio of sequences containing the motif, which is the witness rate. in this way, we generated 18 different datasets of variable difficulty containing in total roughly 28.7 billion sequences. see table a1 for an overview of the properties of the implanted motifs in the 18 datasets. in the second dataset category, we investigate the impact of the signal frequency and complexity in combination with more plausible immune receptor sequences by taking into account the positional aa distributions and other sequence properties. to this end, we trained an lstm (hochreiter & schmidhuber, 1997 ) in a standard next character prediction (graves, 2013) setting to create aa sequences with properties similar to experimentally observed immune receptor sequences. in the first step, the lstm model was trained on all immuno-sequences in the cmv dataset (emerson et al., 2017) that contain valid information about sequence abundance and have a known cmv label. such an lstm model is able to capture various properties of the sequences, including positiondependent probability distributions and combinations, relationships, and order of aas. we then used the trained lstm model to generate 1, 000 repertoires in an autoregressive fashion, starting with a start sequence that was randomly sampled from the trained-on dataset. based on a visual inspection of the frequencies of 4-mers (see section a7), the similarity of lstm generated sequences and real sequences was deemed sufficient for the purpose of generating the aa sequences for the datasets in this category. further details on lstm training and repertoire generation are given in section a7. after generation, each repertoire was assigned to either the positive or negative class, with 500 repertoires per class. we implanted motifs of length 4 with varying properties in the center of the sequences of the positive class to obtain 5 different datasets. each sequence in the positive repertoires has a probability ρ to carry the motif, which was varied throughout 5 datasets and corresponds to the wr (see table a1 ). each position in the motif has a probability of 0.9 to be implanted and consequently a probability of 0.1 that the original aa in the sequence remains, which can be seen as noise on the motif. in the third category, we implanted signals into experimentally obtained immuno-sequences, where we considered 4 dataset variations. each dataset consists of 750 repertoires for each of the two classes, where each repertoire consists of 10k sequences. in this way, we aim to simulate datasets with a low sequencing coverage, which means that only relatively few sequences per repertoire are available. the sequences were randomly sampled from healthy (cmv negative) individuals from the cmv dataset (see below paragraph for explanation). two signal types were considered: (a) one signal with one motif. the aa motif ldr was implanted in a certain fraction of sequences. the pattern is randomly altered at one of the three positions with probabilities 0.2, 0.6, and 0.2, respectively. (b) one signal with multiple motifs. one of the three possible motifs ldr, cas, and gl-n was table a1 : properties of simulated repertoires, variations of motifs, and motif frequencies, i.e. the witness rate, for the datasets in categories "simulated immunosequencing data", "lstm-generated data", and "real-world data with implanted signals". noise types for * are explained in paragraph "real-world data with implanted signals". implanted with equal probability. again, the motifs were randomly altered before implantation. the aa motif ldr changed as described above. the aa motif cas was altered at the second position with probability 0.6 and with probability 0.3 at the first position. the pattern gl-n, wheredenotes a gap location, is randomly altered at the first position with probability 0.6 and the gap has a length of 0, 1, or 2 aas with equal probability. additionally, the datasets differ in the values for ρ, the average ratio of sequences carrying a signal, which were chosen as 1% or 0.1%. the motifs were implanted at positions 107, 109, and 114 according to the imgt numbering scheme for immune receptor sequences (lefranc et al., 2003) with probabilities 0.3, 0.35 and 0.2, respectively. with the remaining 0.15 chance, the motif is implanted at any other sequence position. this means that the motif occurrence in the simulated sequences is biased towards the middle of the sequence. we used a real-world dataset of 785 repertoires, each of which containing between 4, 371 to 973, 081 (avg. 299, 319) tcr sequences with a length of 1 to 27 (avg. 14.5) aas, originally collected and provided by emerson et al. (2017) . 340 out of 785 repertoires were labelled as positive for cytomegalovirus (cmv) serostatus, which we consider as the positive class, 420 repertoires with negative cmv serostatus, considered as negative class, and 25 repertoires with unknown status. we changed the number of sequence counts per repertoire from −1 to 1 for 3 sequences. furthermore, we exclude a total of 99 repertoires with unknown cmv status or unknown information about the sequence abundance within a repertoire, reducing the dataset for our analysis to 686 repertoires, 312 of which with positive and 374 with negative cmv status. we give a non-exhaustive overview of previously considered mil datasets and problems in table a2 . to our knowledge the datasets considered in this work pose the most challenging mil problems with respect to the number of instances per bag (column 5). table a2 : mil datasets with their numbers of bags and numbers of instances. "total number of instances" refers to the total number of instances in the dataset. the simulated and real-world immunosequencing datasets considered in this work contain a by orders of magnitudes larger number of instances per bag than mil datasets that were considered by machine learning methods up to now. we evaluate and compare the performance of deeprc against a set of machine learning methods that serve as baseline, were suggested, or can readily be adapted to immune repertoire classification. in this section, we describe these compared methods. this method serves as an estimate for the achievable classification performance using prior knowledge about which motif was implanted. note that this does not necessarily lead to perfect predictive performance since motifs are implanted with a certain amount of noise and could also be present in the negative class by chance. the known motif method counts how often the known implanted motif occurs per sequence for each repertoire and uses this count to rank the repertoires. from this ranking, the area under the receiver operator curve (auc) is computed as performance measure. probabilistic aa changes in the known motif are not considered for this count, with the exception of gap positions. we consider two versions of this method: (a) known motif binary: counts the occurrence of the known motif in a sequence and (b) known motif continuous: counts the maximum number of overlapping aas between the known motif and all sequence positions, which corresponds to a convolution operation with a binary kernel followed by max-pooling. since the implanted signal is not known in the experimentally obtained cmv dataset, this method cannot be applied to this dataset. the support vector machine (svm) approach uses a fixed mapping from a bag of sequences to the corresponding k-mer counts. the function h kmer maps each sequence s i to a vector representing the occurrence of k-mers in the sequence. to avoid confusion with the sequence-representation obtained from the cnn layers of deeprc, we denote u i = h kmer (s i ), which is analogous to z i . specifically, where #{p m ∈ s i } denotes how often the k-mer pattern p m occurs in sequence s i . afterwards, average-pooling is applied to obtain u = 1/n n i=1 u i , the k-mer representation of the input object x. for two input objects x (n) and x (l) with representations u (n) and u (l) , respectively, we implement the minmax kernel (ralaivola et al., 2005) as follows: where u (n) m is the m-th element of the vector u (n) . the jaccard kernel (levandowsky & winter, 1971 ) is identical to the minmax kernel except that it operates on binary u (n) . we used a standard c-svm, as introduced by cortes & vapnik (1995) . the corresponding hyperparameter c is optimized by random search. the settings of the full hyperparameter search as well as the respective value ranges are given in table a4a . the same k-mer representation of a repertoire, as introduced above for the svm baseline, is used for the k-nearest neighbor (knn) approach. as this method clusters samples according to distances between them, the previous kernel definitions cannot be applied directly. it is therefore necessary to transform the minmax as well as the jaccard kernel from similarities to distances by constructing the following (levandowsky & winter, 1971) : d jaccard (u (n) , u (l) ) = 1 − k jaccard (u (n) , u (l) ). (a2) the amount of neighbors is treated as the hyperparameter and optimized by an exhaustive grid search. the settings of the full hyperparameter search as well as the respective value ranges are given in table a5 . we implemented logistic regression on the k-mer representation u of an immune repertoire. the model is trained by gradient descent using the adam optimizer (kingma & ba, 2014) . the learning rate is treated as the hyperparameter and optimized by grid search. furthermore, we explored two regularization settings using combinations of l1 and l2 weight decay. the settings of the full hyperparameter search as well as the respective value ranges are given in table a6 . we implemented a burden test (emerson et al., 2017; li & leal, 2008; wu et al., 2011) in a machine learning setting. the burden test first identifies sequences or k-mers that are associated with the individual's class, i.e., immune status, and then calculates a burden score per individual. concretely, for each k-mer or sequence, the phi coefficient of the contingency table for absence or presence and positive or negative immune status is calculated. then, j k-mers or sequences with the highest phi coefficients are selected as the set of associated k-mers or sequences. j is a hyperparameter that is selected on a validation set. additionally, we consider the type of input features, sequences or k-mers, as a hyperparameter. for inference, a burden score per individual is calculated as the sum of associated k-mers or sequences it carries. this score is used as raw prediction and to rank the individuals. hence, we have extended the burden test by emerson et al. (2017) to k-mers and to adaptive thresholds that are adjusted on a validation set. the logistic multiple instance learning (mil) approach for immune repertoire classification (ostmeyer et al., 2019) applies a logistic regression model to each k-mer representation in a bag. the resulting scores are then summarized by max-pooling to obtain a prediction for the bag. each amino acid of each k-mer is represented by 5 features, the so-called atchley factors (atchley et al., 2005) . as k-mers of length 4 are used, this gives a total of 4 × 5 = 20 features. one additional feature per 4-mer is added, which represents the relative frequency of this 4-mer with respect to its containing bag, resulting in 21 features per 4-mer. two options for the relative frequency feature exist, which are (a) whether the frequency of the 4-mer ("4mer") or (b) the frequency of the sequence in which the 4-mer appeared ("tcrβ") is used. we optimized the learning rate, batch size, and early stopping parameter on the validation set. the settings of the full hyperparameter search as well as the respective value ranges are given in table a8 . for all competing methods a hyperparameter search was performed, for which we split each of the 5 training sets into an inner training set and inner validation set. the models were trained on the inner training set and evaluated on the inner validation set. the model with the highest auc score on the inner validation set is then used to calculate the score on the respective test set. here we report the hyperparameter sets and search strategy that is used for all methods. deeprc. the set of hyperparameters of deeprc is shown in table a3 . these hyperparameter combinations are adjusted via a grid search procedure. table a3 : deeprc hyperparameter search space. every 5 · 10 3 updates, the current model was evaluated against the validation fold. the early stopping hyperparameter was determined by selecting the model with the best loss on the validation fold after 10 5 updates. * : experiments for {64; 128; 256} kernels were omitted for datasets with motif implantation probabilities ≥ 1% in the category "simulated immunosequencing data". known motif. this method does not have hyperparameters and has been applied to all datasets except for the cmv dataset. the corresponding hyperparameter c of the svm is optimized by randomly drawing 10 3 values in the range of [−6; 6] according to a uniform distribution. these values act as the exponents of a power of 10 and are applied for each of the two kernel types (see table a4a ). knn. the amount of neighbors is treated as the hyperparameter and optimized by grid search operating in the discrete range of [1; max{n, 10 3 }] with a step size of 1. the corresponding tight upper bound is automatically defined by the total amount of samples n ∈ n >0 in the training set, capped at 10 3 (see table a5 ). number of neighbors {1; max{n, 10 3 }} type of kernel {minmax; jaccard} table a5 : settings used in the hyperparameter search of the knn baseline approach. the number of trials (per type of kernel) is automatically defined by the total amount of samples n ∈ n >0 in the training set, capped at 10 3 . logistic regression. the hyperparameter optimization strategy that was used was grid search across hyperparameters given in table a6. learning rate 10 −{2;3;4} batch size 4 max. updates 10 5 coefficient β 1 (adam) 0.9 coefficient β 2 (adam) 0.999 weight decay weightings {(l1 = 10 −7 , l2 = 10 −3 ); (l1 = 10 −7 , l2 = 10 −5 )} table a6 : settings used in the hyperparameter search of the logistic regression baseline approach. burden test. the burden test selects two hyperparameters: the number of features in the burden set and the type of features, see table a7 . number of features in burden set {50, 100, 150, 250} type of features {4mer; sequence} table a7 : settings used in the hyperparameter search of the burden test approach. logistic mil. for this method, we adjusted the learning rate as well as the batch size as hyperparameters by randomly drawing 25 different hyperparameter combinations from a uniform distribution. the corresponding range of the learning rate is [−4.5; −1.5], which acts as the exponent of a power of 10. the batch size lies within the range of [1; 32]. for each hyperparameter combination, a model is optimized by gradient descent using adam, whereas the early stopping parameter is adjusted according to the corresponding validation set (see table a8 ). learning rate 10 {−4.5;−1.5} batch size {1; 32} relative abundance term {4mer; tcrβ} number of trials 25 max. epochs 10 2 coefficient β 1 (adam) 0.9 coefficient β 2 (adam) 0.999 table a8 : settings used in the hyperparameter search of the logistic mil baseline approach. the number of trials (per type of relative abundance) defines the quantity of combinations of random values of the learning rate as well as the batch size. in this section, we report the detailed results on all four categories of datasets (a) simulated immunosequencing data (table a9 ) (b) lstm-generated data (table a10) , (c) real-world data with implanted signals (table a11) , and (d) real-world data on the cmv dataset (table a12) , as discussed in the main paper. ± 0.000 ± 0.000 ± 0.271 ± 0.000 ± 0.000 ± 0.218 ± 0.000 ± 0.000 ± 0.029 ± 0.000 ± 0.001 ± 0.017 ± 0.001 ± 0.002 ± 0.023 ± 0.001 ± 0.048 ± 0.013 ± 0.223 svm (minmax) 1.000 1.000 0.764 1.000 1.000 0.603 1.000 0.998 0.539 1.000 0.994 0.529 1.000 0.741 0.513 1.000 0.706 0.503 0.827 ± 0.000 ± 0.000 ± 0.016 ± 0.000 ± 0.000 ± 0.021 ± 0.000 ± 0.002 ± 0.024 ± 0.000 ± 0.004 ± 0.016 ± 0.000 ± 0.024 ± 0.006 ± 0.000 ± 0.013 ± 0.013 ± 0.013 ± 0.013 ± 0.014 ± 0.011 ± 0.009 ± 0.007 ± 0.008 ± 0.011 ± 0.012 ± 0.012 ± 0.007 ± 0.014 ± 0.017 ± 0.010 ± 0.020 ± 0.012 ± 0.016 ± 0.016 ± 0.074 known motif b. 1.000 1.000 0.973 1.000 1.000 0.865 1.000 1.000 0.700 1.000 0.989 0.609 1.000 0.946 0.570 1.000 0.834 0.532 0.890 ± 0.000 ± 0.000 ± 0.004 ± 0.000 ± 0.000 ± 0.004 ± 0.000 ± 0.000 ± 0.020 ± 0.000 ± 0.002 ± 0.017 ± 0.000 ± 0.010 ± 0.024 ± 0.000 ± 0.016 ± 0.020 ± 0.001 ± 0.014 ± 0.020 ± 0.001 ± 0.013 ± 0.017 ± 0.001 ± 0.012 ± 0.012 ± 0.001 ± 0.018 ± 0.018 ± 0.002 ± 0.010 ± 0.009 ± 0.002 ± 0.012 ± 0.013 ± 0.202 table a9 : auc estimates based on 5-fold cv for all 18 datasets in category "simulated immunosequencing data". the reported errors are standard deviations across the 5 cross-validation folds except for the last column "avg.", in which they show standard deviations across datasets. wildcard characters in motifs are indicated by z, characters with 50% probability of being removed by d . table a10 : auc estimates based on 5-fold cv for all 5 datasets in category "lstm-generated data". the reported errors are standard deviations across the 5 cross-validation folds except for the last column "avg.", in which they show standard deviations across datasets. characters affected by noise, as described in a3, paragraph "lstm-generated data", are indicated by r . table a12 : results on the cmv dataset (real-world data) in terms of auc, f1 score, balanced accuracy, and accuracy. for f1 score, balanced accuracy, and accuracy, all methods use their default thresholds. each entry shows mean and standard deviation across 5 cross-validation folds. we trained a conventional next-character lstm model (graves, 2013) based on the implementation in https://github.com/spro/practical-pytorch (access date 1st of may, 2020) using pytorch 1.3.1 (paszke et al., 2019) . for this, we applied an lstm model with 100 lstm blocks in 2 layers, which was trained for 5, 000 epochs using the adam optimizer (kingma & ba, 2014) with learning rate 0.01, an input batch size of 100 character chunks, and a character chunk length of 200. as input we used the immuno-sequences in the cdr3 column of the cmv dataset, where we repeated sequences according to their counts in the repertoires, as specified in the templates column of the cmv dataset. we excluded repertoires with unknown cmv status and unknown sequence abundance from training. after training, we generated 1, 000 repertoires using a temperature value of 0.8. the number of sequences per repertoire was sampled from a gaussian n (µ = 285k, σ = 156k) distribution, where the whole repertoire was generated by the lstm at once. that is, the lstm can base the generation of the individual aa sequences in a repertoire, including the aas and the lengths of the sequences, on the generated repertoire. a random immuno-sequence from the trained-on repertoires was used as initialization for the generation process. this immuno-sequence was not included in the generated repertoire. finally, we randomly assigned 500 of the generated repertoires to the positive (diseased) and 500 to the negative (healthy) class. we then implanted motifs in the positive class repertoires as described in section a3.2. as illustrated in the comparison of histograms given in fig. a2 , the generated immuno-sequences exhibit a very similar distribution of 4-mers and aas compared to the original cmv dataset. real-world data deeprc allows for two forms of interpretability methods. (a) due to its attention-based design, a trained model can be used to compute the attention weights of a sequence, which directly indicates its importance. (b) deeprc furthermore allows for the usage of contribution analysis methods, such as integrated gradients (ig) (sundararajan et al., 2017) or layer-wise relevance propagation (montavon et al., 2018; arras et al., 2019; montavon et al., 2019; preuer et al., 2019) . we apply ig to identify the input patterns that are relevant for the classification. to identify aa patterns with high contributions in the input sequences, we apply ig to the aas in the input sequences. additionally, we apply ig to the kernels of the 1d cnn, which allows us to identify aa motifs with high contributions. in detail, we compute the ig contributions for the aas and positional features in the kernels for every repertoire in the validation and test set, so as to exclude potential artifacts caused by over-fitting. averaging the ig values over these repertoires then results in concise aa motifs. we include qualitative visual analyses of the ig method on different datasets below. here, we provide examples for the interpretation of trained deeprc models using integrated gradients (ig) (sundararajan et al., 2017) as contribution analysis method. the following illustrations were created using 50 ig steps, which we found sufficient to achieve stable ig results. a visual analysis of deeprc models on the simulated datasets, as illustrated in tab. a13 and fig. a3 , shows that the implanted motifs can be successfully extracted from the trained model and are straightforward to interpret. in the real-world cmv dataset, deeprc finds complex patterns with high variability in the center regions of the immuno-sequences, as illustrated in figure a4 . real-world data with implanted signals extracted motif implanted motif(s) g r s r a r f r l r d r r r {l r d r r r ; c r a r s; g r l-n} motif freq. ρ 0.05% 0.1% 0.1% table a13 : visualization of motifs extracted from trained deeprc models for datasets from categories "simulated immunosequencing data", "lstm-generated data", and "real-world data with implanted signals". motif extraction was performed using integrated gradients on the 1d cnn kernels over the validation set and test set repertoires of one cv fold. wildcard characters are indicated by z, random noise on characters by r , characters with 50% probability of being removed by d , and gap locations of random lengths of {0; 1; 2} by -. larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the diseased class. contributions to positional encoding are indicated by < (beginning of sequence), ∧ (center of sequence), and > (end of sequence). only kernels with relatively high contributions are shown, i.e. with contributions roughly greater than the average contribution of all kernels. b) c) figure a3 : integrated gradients applied to input sequences of positive class repertoires. three sequences with the highest contributions to the prediction of their respective repertoires are shown. a) input sequence taken from "simulated immunosequencing data" with implanted motif sz d z d n and motif implantation probability 0.1%. the deeprc model reacts to the s and n at the 5 th and 8 th sequence position, thereby identifying the implanted motif in this sequence. b) and c) input sequence taken from "real-world data with implanted signals" with implanted motifs {l r d r r r ; c r a r s; g r l-n} and motif implantation probability 0.1%. the deeprc model reacts to the fully implanted motif cas (b) and to the partly implanted motif aas c and a at the 5 th and 7 th sequence position (c), thereby identifying the implanted motif in the sequences. wildcard characters in implanted motifs are indicated by z, characters with 50% probability of being removed by d , and gap locations of random lengths of {0; 1; 2} by -. larger characters in the sequences indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the diseased class. figure a4 : visualization of the contributions of characters within a sequence via ig. each sequence was selected from a different repertoire and showed the highest contribution in its repertoire. the model was trained on cmv dataset, using a kernel size of 9, 32 kernels and 137 repertoires for early stopping. larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the disease class. table a14 : tcrβ sequences that had been discovered by emerson et al. (2017) with their associated attention values by deeprc. these sequences have significantly (p-value 1.3e-93) higher attention values than other sequences. the column "quantile" provides the quantile values of the empiricial distribution of attention values across all sequences in the dataset. in this section we investigate the impact of different variations of deeprc on the performance on the cmv dataset. we consider both a cnn-based sequence embedding, as used in the main paper, and an lstm-based sequence embedding. in both cases we vary the number of attention heads and the β parameter for the softmax function the attention mechanism (see eq. 2 in main paper). for the cnn-based sequence embedding we also vary the number of cnn kernels and the kernel sizes used in the 1d cnn. for the lstm-based sequence embedding we use one one-directional lstm layer, of which the output values at the last sequence position (without padding) are taken as embedding of the sequence. here we vary the number of lstm blocks in the lstm layer. to counter over-fitting due to the increased complexity of these deeprc variations, we added a l2 weight penalty to the training loss. the factor with which the l2 weight penalty contributes to the training loss is varied over 3 orders of magnitudes, where suitable value ranges were manually determined on one of the training folds beforehand. to reduce the computational effort, we do not consider all numbers of kernels that were considered in the main paper. furthermore, we only compute the auc scores on 3 of the 5 cross-validation folds. the hyperparameters, which were used in a grid search procedure, are listed in tab. a15 for the cnn-based sequence embedding and tab. a16 for the lstm-based sequence embedding. results. we show performance in terms of auc score with single hyperparameters set to fixed values so as to investigate their influence in tab. a18 for the cnn-based sequence embedding and tab. a17 for the lstm-based sequence embedding. we note that due to restricted computational resources this study was conducted with fewer different numbers of cnn kernels, with the auc estimated from only 3 of the 5 cross-validation folds, which leads to a slight decrease of performance in comparison to the full hyperparameter search and cross-validation procedure used in the main paper. as can be seen in tab. a18 and a17, the lstm-based sequence embedding generalizes slightly better than the cnn-based sequence embedding. table a17 : impact of hyperparameters on deeprc with lstm for sequence encoding. mean ("mean") and standard deviation ("std") for the area under the roc curve over the first 3 folds of a 5-fold nested cross-validation for different sub-sets of hyperparameters ("sub-set") are shown. the following sub-sets were considered: "full": full grid search over hyperparameters; "beta=*": grid search over hyperparameters with reduction to specific value * of beta value of attention softmax; "heads=*": grid search over hyperparameters with reduction to specific number * of attention heads; "lstms=*": grid search over hyperparameters with reduction to specific number * of lstm blocks for sequence embedding. table a18 : impact of hyperparameters on deeprc with 1d cnn for sequence encoding. mean ("mean") and standard deviation ("std") for the area under the roc curve over the first 3 folds of a 5-fold nested cross-validation for different sub-sets of hyperparameters ("sub-set") are shown. the following sub-sets were considered: "full": full grid search over hyperparameters; "beta=*": grid search over hyperparameters with reduction to specific value * of beta value of attention softmax; "heads=*": grid search over hyperparameters with reduction to specific number * of attention heads; "ksize=*": grid search over hyperparameters with reduction to specific kernel size * of 1d cnn kernels for sequence embedding; "kernels=*": grid search over hyperparameters with reduction to specific number * of 1d cnn kernels for sequence embedding. a compact vocabulary of paratope-epitope interactions enables predictability of antibody-antigen binding predicting the sequence specificities of dna-and rna-binding proteins by deep learning explaining and interpreting lstms solving the protein sequence metric problem rank-loss support instance machines for miml instance annotation augmenting adaptive immunity: progress and challenges in the quantitative engineering and analysis of adaptive immune receptor repertoires multiple instance learning: a survey of problem characteristics and applications vdjserver: a cloud-based analysis portal and data commons for immune repertoire sequences and rearrangements tetramer-visualized gluten-specific cd4+ t cells in blood as a potential diagnostic marker for coeliac disease without oral gluten challenge ireceptor: a platform for querying and analyzing antibody/b-cell and t-cell receptor repertoire data across federated repositories support-vector networks quantifiable predictive features define epitope-specific t cell receptor repertoires on a model of associative memory with huge storage capacity bert: pre-training of deep bidirectional transformers for language understanding solving the multiple instance problem with axis-parallel rectangles predicting the spectrum of tcr repertoire sharing with a data-driven model of recombination immunosequencing identifies signatures of cytomegalovirus exposure history and hla-mediated effects on the t cell repertoire predicting antigen-specificity of single t-cells based on tcr cdr3 regions. biorxiv a review of multi-instance learning assumptions deep sequencing of b cell receptor repertoires from covid-19 evaluation and benchmark for biological image segmentation the promise and challenge of high-throughput sequencing of the antibody repertoire tcrex: detection of enriched t cell epitope specificity in full t cell receptor sequence repertoires. biorxiv identifying specificity groups in the t cell receptor repertoire generating sequences with recurrent neural networks. arxiv a bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status learning the high-dimensional immunogenomic features that predict public and private antibody repertoires improving neural networks by preventing co-adaptation of feature detectors long short-term memory fast model-based protein homology detection without alignment neural networks and physical systems with emergent collective computational abilities convolutional neural network architectures for matching natural language sentences attention-based deep multiple instance learning nettcr: sequence-based prediction of tcr binding to peptide-mhc complexes using convolutional neural networks basset: learning the regulatory code of the accessible genome with deep convolutional neural networks detecting cutaneous basal cell carcinomas in ultra-high resolution and weakly labelled histopathological images self-normalizing neural networks capturing the differences between humoral immunity in the normal and tumor environments from repertoire-seq of b-cell receptors using supervised machine learning observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires dense associative memory for pattern recognition dense associative memory is robust to adversarial inputs gradient-based learning applied to document recognition set transformer: a framework for attention-based permutation-invariant neural networks imgt unique numbering for immunoglobulin and t cell receptor variable domains and ig superfamily v-like domains distance between sets methods for detecting associations with rare variants for common diseases: application to analysis of sequence data the extended cohnkanade dataset (ck+): a complete dataset for action unit and emotion-specified expression high-throughput immune repertoire analysis with igor a framework for multiple-instance learning computational strategies for dissecting the high-dimensional complexity of adaptive immune repertoires longitudinal high-throughput tcr repertoire profiling reveals the dynamics of t cell memory formation after mild covid-19 infection. biorxiv methods for interpreting and understanding deep neural networks layer-wise relevance propagation: an overview how many different clonotypes do immune repertoires contain? current opinion in systems biology treating biomolecular interaction as an image classification problem -a case study on t-cell receptorepitope recognition prediction. biorxiv sumrep: a summary statistic framework for immune receptor repertoire comparison and model validation biophysicochemical motifs in t-cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocyte and adjacent healthy tissue pytorch: an imperative style, high-performance deep learning library needles in haystacks: on classifying tiny objects in large images interpretable deep learning in drug discovery pointnet: deep learning on point sets for 3d classification and segmentation graph kernels for chemical informatics cov-abdab: the coronavirus antibody database. biorxiv immunedb, a novel tool for the analysis, storage, and dissemination of immune repertoire sequencing data a $$k$$-nearest neighbor based algorithm for multi-instance multi-label active learning machine learning in automated text categorization high-throughput mapping of b cell receptor sequences to antigen specificity vdjtools: unifying post-analysis of t cell receptor repertoires vdjdb: a curated database of t-cell receptor sequences with known antigen specificity deeptcr: a deep learning framework for understanding t-cell receptor sequence signatures within complex t-cell repertoires prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs. biorxiv axiomatic attribution for deep networks attention-based deep neural networks for detection of cancerous and precancerous esophagus tissue on histopathological slides learning with sets in multiple instance regression applied to remote sensing attention is all you need revisiting multiple instance neural networks novel approaches to analyze immunoglobulin repertoires immunesim: tunable multi-feature simulation of b-and t-cell receptor repertoires for immunoinformatics benchmarking genome-wide protein function prediction through multiinstance multi-label learning rare-variant association testing for sequencing data with the sequence kernel association test polyspecificity of t cell and b cell receptor recognition practical guidelines for b-cell receptor repertoire sequencing analysis learning embedding adaptation for few-shot learning convolutional neural network architectures for predicting dna-protein binding pird: pan immune repertoire database multi-instance multi-label learning with application to scene classification predicting effects of noncoding variants with deep learning-based sequence model the ellis unit linz, the lit ai lab and the institute for machine learning are supported by the land oberösterreich, lit grants deeptoxgen ( in the following, the appendix to the paper "modern hopfield networks and attention for immune key: cord-326225-crtpzad7 authors: neill, john d.; bayles, darrell o.; ridpath, julia f. title: simultaneous rapid sequencing of multiple rna virus genomes date: 2014-06-01 journal: j virol methods doi: 10.1016/j.jviromet.2014.02.016 sha: doc_id: 326225 cord_uid: crtpzad7 comparing sequences of archived viruses collected over many years to the present allows the study of viral evolution and contributes to the design of new vaccines. however, the difficulty, time and expense of generating full-length sequences individually from each archived sample have hampered these studies. next generation sequencing technologies have been utilized for analysis of clinical and environmental samples to identify viral pathogens that may be present. this has led to the discovery of many new, uncharacterized viruses from a number of viral families. use of these sequencing technologies would be advantageous in examining viral evolution. in this study, a sequencing procedure was used to sequence simultaneously and rapidly multiple archived samples using a single standard protocol. this procedure utilized primers composed of 20 bases of known sequence with 8 random bases at the 3′-end that also served as an identifying barcode that allowed the differentiation each viral library following pooling and sequencing. this conferred sequence independence by random priming both first and second strand cdna synthesis. viral stocks were treated with a nuclease cocktail to reduce the presence of host nucleic acids. viral rna was extracted, followed by single tube random-primed double-stranded cdna synthesis. the resultant cdnas were amplified by primer-specific pcr, pooled, size fractionated and sequenced on the ion torrent pgm platform. the individual virus genomes were readily assembled by both de novo and template-assisted assembly methods. this procedure consistently resulted in near full length, if not full-length, genomic sequences and was used to sequence multiple bovine pestivirus and coronavirus isolates simultaneously. next generation sequencing technologies are used to screen environmental and clinical samples for the presence of viral pathogens. this has resulted in discovery of new, previously unknown viruses that are often uncultivable (victoria et al., 2008; li et al., 2009; phan et al., 2012; reuter et al., 2012) . these technologies have allowed screening of samples collected from domestic animals (shan et al., 2011; phan et al., 2011; reuter et al., 2012) and from animals in the wild to determine if they are carriers of potentially zoonotic viral pathogens (donaldson et al., 2010; li et al., 2010; phan et al., 2011; wu et al., 2012) . additionally, deep sequence analysis was applied to screening of commercially available vaccines for the presence of contaminants and variants . methods to sequence complete viral genomes have been developed. these include methodologies based on pcr amplification of viral sequences, both in fragments (rao et al., 2013) or fulllength genome amplification (christenbury et al., 2010) . these have primarily focused on specific, closely related viruses. the sequence-independent, single primer amplification (sispa) procedure was first used to amplify dna populations through the use of random priming and ligation of adaptors for pcr amplification (reyes and kim, 1991) . this was modified for amplification of viral sequences from serum to include a step where dnase i was used to first degrade host dna (allander et al., 2001) . this protocol has since been used in the identification of a novel pestivirus (kirkland et al., 2007) , bocaviruses (cheng et al., 2010) and identification of viruses associated with cytopathic effect in cell cultures of unknown origin (abed and boivin, 2009) . many virological laboratories, especially those associated with diagnostic laboratories, have freezers that contain large numbers of virus isolates collected over varying time spans. often these isolates have clinical histories associated with them that detail where and when the isolations were made and descriptions of the disease state of the host. there is a wealth of information in these isolates, but up till now, it has been time consuming and expensive to sequence these viral genomes, often requiring sets of strain-specific primers for pcr amplification and sequencing. the protocol outlined in this study provides the means to rapidly sequence archival collections of viruses to study the evolution of specific viruses over time. this information is important in determining the relevance of virus strains used in currently marketed vaccines. viral isolates passaged in cell culture and stored frozen at −80 • c were used in this study. the two virus groups sequenced were 21 bovine viral diarrhea virus (bvdv) isolates and 19 bovine coronavirus (bocv) isolates. after isolation/propagation of the viruses on appropriate host cells, the medium was frozen/thawed twice, cell debris removed by centrifugation, and aliquots frozen at −80 • c. routinely, the libraries were constructed from groups of 10 viruses. the virus stock was thawed at room temperature and 180 l was mixed with 20 l of 10× dnase i buffer (100 mm tris-hcl (ph 7.6), 25 mm mgcl 2 , 5 mm cacl 2 ). a cocktail of nucleases was added to degrade host nucleic acids as previously described (victoria et al., 2008) . the cocktail consisted of (per virus sample) 5 units dnase i (life technologies, grand island, ny, usa), 10 units turbo-dnase (life technologies), 5 units base-line zero dnase (epicenter, madison, wi, usa), 25 units benzonase (emd millipore, billerica, ma, usa), 2 g rnase a/5 units rnase t1 (thermo scientific, waltham, ma, usa), and 4 units rnase one (promega, madison, wi, usa). the virus stock/nuclease mixture was incubated at 37 • c for at least 90 min. the viral rna was purified using the minelute virus spin filter kit (qiagen, valencia, ca, usa) according to manufacturer's specifications. the rna was eluted in 22 l of ultra-pure, rnasefree h 2 o (final yield ∼20 l) and immediately placed on ice. twenty microliters of rna were mixed with 100 pmol of a 28mer primer consisting of 20 nucleotides of a known sequence followed by 8 random nucleotides as previously described . table 1 shows the sequence of the 20 primers used in this study. these primers were developed so that the 20 base known sequence was used for pcr amplification of the library as well as served as a barcode for identifying each viral library following pooling and sequencing. the rna/primer mix was heated at 75 • c for 5 min and immediately placed on ice. first strand synthesis was done in a reaction mix consisting of 0.5 mm dntps, 1× first strand buffer, 2 units rnasin (promega) and 200 units superscript ii reverse transcriptase (life technologies). the first strand reactions were incubated at 42 • c for 1 h, heated at 95 • c for 2 min and immediately quenched on ice. the second strand reaction was carried out using sequenase 2.0 (affymetrix, santa clara, ca, usa) and the nucleotides and primers still present following the first strand reaction (yozwiak et al., 2012 ). an equal volume of 2× sequenase buffer with 4 units of sequenase per reaction was added on ice. the reaction tubes were placed in a thermocycler that was prechilled to 4 • c and were ramped from 4 • c to 37 • c at 0.1 • c/s. the reaction was continued at 37 • c for 10 min. the tubes were heated to 95 • c for 2 min and placed immediately on ice. a second aliquot of sequenase was added (4 units/reaction) and the ramping and incubation repeated with the exception that the 37 • c reaction was done for 30 min. the double stranded cdna was purified using minelute pcr spin columns (qiagen) and eluted with 13 l of ultra-pure h 2 o. the cdna was amplified by pcr using primers consisting of the first 20 known bases of the 28mer used in the cdna synthesis . the pcr reactions consisted of 10 l of double-stranded cdna, 1× pfx50 buffer, 400 m dntps, 100 pm of appropriate 20mer primer and 5 units of pfx50 polymerase (life technologies). the reactions were cycled at 95 • c for 2 min and 35 cycles of 95 • c for 30 s, 55 • c for 30 s and 72 • c for 1 min. the number of cycles was empirically determined so as not to over amplify the cdna, where 35 cycles worked well. the dna was again purified and eluted in 10 l ultra-pure h 2 o, and 1 l subjected to agilent bioanalyzer analysis (agilent technologies, santa clara, ca, usa). the percentage of the dna that was 300-400 bp in length was determined for each reaction as was the total dna concentration in each sample using picogreen (life technologies). all ten reactions were pooled so that the 300-400 bp cdnas from each were present in equimolar amounts. the pooled dna was placed in an end-repair reaction (ion plus fragment library kit, life technologies) followed by ligation of ion torrent sequencing adaptors. the dna was size fractionated using agencourt ampure beads (beckman coulter, indianapolis, in, usa) using 1.6 volumes of the ampure beads and was done according to manufacturer specifications. this step removed dna of 100 bp and less which eliminated adaptor dimers and unligated adaptors. at this point, two 10-genome libraries, for a total of 20 genomes, were pooled for each sequencing run. the 300-400 bp fraction was isolated from the dna pool using the pippin prep (sage science, beverly, ma, usa). the 300-400 bp dna was subjected to sequence analysis with the ion torrent pgm and standard chemistries (life technologies) using the 316 sequencing chip. the raw data files that contained the sequence data from each sequencing run of the pgm were demultiplexed using the 20 base barcode sequence. the raw sequencing reads obtained from the ion torrent sequencer were exported as sff files. barcode sequences unique to this study were used to populate a custom configuration file that was used in conjunction with the roche "sfffile" tool to separate the reads into individual barcode-specific sff files. all the sequences in each barcode-specific sff file was extracted to a fastq file by applying the "sff extract" utility (http://bioinf. comav.upv.es/sff extract/index.html) and specifying the "-clip" option to remove the sequencing tags, barcode sequences, and low quality sequences. the "fastx trimmer" program (part of the fastx toolkit; http://hannonlab.cshl.edu/fastx toolkit/index.html) was used to remove the random primer (8-mer) by trimming an additional eight nucleotides from the 5 -end of each read. as a final post-trimming step, a custom script was used to remove all reads that were less than 40 nucleotides in length. the resulting sequence files were used to assemble full length genomic sequences using seqmanngen and were edited with the seqman software of the lasergene 10 package (dnastar, inc., madison, wi, usa) using related viral sequences obtained from genbank as the assembly reference. assembled genomic sequences were further edited using aligner (codoncode, inc., centerville, ma, usa). the sequences from each library were submitted to the ncbi sequence read archive with the bioproject number prjna237789, biosamples samn02630193, and samn02639534 through samn02639572. the ion torrent raw sequence data was trimmed of sequencing adaptors, barcodes, the 8 base random primer sequence, low quality sequences and sequences less than 40 bases. after preassembly processing of the reads, an average of 82% of initial reads remained for downstream analysis. the bvdv and bocv sequencing runs produced 954,798 and 2,268,798 individual sequences, respectively. the percentage of viral sequences in each barcoded library varied but was as high as 87% ( table 2 ). the ratio of virus:host sequences was highly dependent on the titer of virus in the stock sample. the number of sequences for each barcoded sample in the library varied, even following pooling at concentrations that should have resulted in equimolar amounts of each sample in the library. to test the accuracy of the ion torrent sequencing, a virus previously sequenced by pcr amplification was included to compare sequences generated by this method and by sequencing pcr amplicons. this virus, a bvdv 1b strain isolated from alpaca (genbank accession jx297520.1; table 2 , library 3, barcode 10), was assembled from ion torrent data and was found to have only 1 base difference from the sequence determined earlier (data not shown). this base was ambiguous in the amplicon sequenced by sanger chemistry (equal peak height of a and g) but was clearly a g by ion torrent sequencing. the sequences from two libraries were analyzed to determine source of any non-viral sequences. the remaining non-viral sequences were assembled into contigs and were used in blast searches of genbank. this revealed that the non-viral sequences were derived from host nucleic acids, both dna and rna. the rna sequences were primarily ribosomal rnas but some were from more abundant mrna transcripts. assembly of genomic sequences was done using templateassisted assembly. bvdv and bocv genomic sequences used as templates for assembly were obtained from genbank (accession numbers jn380086.1 and ef424617.1, respectively). all of the individual virus library sequences resulted in assembled sequences that spanned at least 94% of the genome from which they were derived (table 2 ). in libraries 1 and 2, bvdv genomic sequences were determined for 20 virus isolates. the percentage of viral sequences for each barcoded library ranged from 34.8 to 87.4%. bvdv assemblies contained no internal gaps but the extreme 5 and 3 termini were not present, with the exception of 3 sequences of four viruses ( table 2 ). the number of nucleotides missing from the termini varied but ranged from 1 to 98 bases. one virus, library 1, barcode 9, had only 658 viral sequence reads but 94.4% of the genome was assembled. this genomic assembly had 8 internal gaps, ranging from 10 to 219 nucleotides. the other 19 viruses had no assembly gaps and genomic coverages of 99.1-99.9%, depending on variable coverage of the genome termini. the low number of sequences for library 1, barcode 9 was believed to be a result of a pipetting error during the quantitation or pooling of the 10 genomic libraries. the results of sequencing of 19 bocv isolates are listed in table 2 , under libraries 3 and 4. the coronavirus genomic rna is 2.5 times longer than that of bvdv, roughly 31,000 nucleotides versus 12,300. thus, it required more sequences overall to assemble the complete genome to the same coverage depth. additionally, because bocv typically grows to a lower titer, there were lower ratios of virus:host sequences, ranging from 2.7 to 37.8% of the total sequences. a major difference in genome assembly between bvdv and bocv was that, with only a few exceptions, both the 5 and 3 terminal sequences were determined for each bocv isolate. also, due to the longer length of the bocv genome, more assembly gaps were observed. however, in all cases, assembled sequences spanned more than 97% of the viral genome. more assembly gaps were observed in individual virus libraries having lower total sequences. the bvdv isolate sequenced in library 3, barcode 10 gave results similar to that observed with bvdv isolates sequenced in libraries 1 and 2. the advent of new dna sequencing technologies has led to the development of improved sequencing protocols for detection and characterization of viruses in environmental and diagnostic samples. many of these protocols use deep sequencing techniques to detect and identify novel, often uncultivable, viruses. this sequencing protocol was designed to sequence viruses rapidly that were isolated from clinical specimens and archived. often, a clinical report and history is available for isolates making it possible to begin association of genotype with specific phenotypes and time of isolation. particularly valuable is that this protocol confers the ability to sequence archival collections of specific viruses rapidly in order to study the evolution of the pathogen over time and determine the genotype of the viruses currently in circulation. this information is useful in determining relevance of vaccine strains based on sequences of currently circulating viruses. the viruses used for sequence analysis were all isolated and passaged in cell culture. thus, the abundance of host nucleic acids posed a problem in obtaining sufficient numbers of viral sequences to assemble the viral genomes. it was necessary to pre-digest virus samples with high levels of nucleases before purification of viral rna. the viral rna was protected within the intact virions while the host rna and dna were degraded. the rna purification procedure included a protease digestion step aiding in the removal of the nucleases before the viral rna was released from the virion. the nuclease digestion step aided in reaching as much as 87% of all sequences in a barcoded library being of viral origin. again, the titer of the virus in the original sample was also important in the number of viral sequences obtained. viruses that grow routinely to low titers require less than 20 libraries in the sequencing run to ensure sufficient reads in each barcoded library to assemble the genomes. an important aspect of this procedure was sequence independence, requiring no previous knowledge of sequence or pathogens present in a sample. the use of 20 base barcode sequence with 8 base random nucleotides (victoria et al., 2008 ) served both to prime cdna synthesis and to differentiate each library following sequencing. the 20 base known sequence was also used for pcr amplification of the individual libraries. the ion torrent platform was chosen because it allowed variation in the depth of sequencing based on the size of the chip that was used in the sequencing run. for the most part, the 316 chip worked well, routinely giving sufficient sequencing depth to assemble the genomes of 20 viruses in each sequencing run. the ion torrent is known to have indels in the sequence data, especially associated with homopolymeric runs of bases (bragg et al., 2013) . use of template assisted assembly of the genomes greatly assisted in producing accurate genome assemblies. limited numbers of de novo assemblies were done and low numbers of errors were found that were attributed to indels. the accuracy of sequences generated by this method was demonstrated by the comparing sequences to those previously generated by sanger pcr amplicon sequencing. only 1 nucleotide difference was detected in sequences generated using the two different methods demonstrating the accuracy of the sequencing using the protocol described in this paper. the genomic sequences of the bvdv isolates were readily assembled from the sequence data, partially owing to the relatively small size of the genomic rna. however, the extreme 5 and 3 terminal nucleotides were difficult to obtain, most likely due to the high degree of secondary structure known to be present in these sequences. this may have made priming of first strand cdna synthesis near the termini more difficult. with the exception of one virus, all bvdv genomes were assembled with no internal gaps in the sequence. the bvdv genome that was not fully assembled had only 658 total viral sequences, yet this was sufficient to assemble greater than 94% of the genome. the low number of sequences was most like due to a pipetting error during the pooling of the 10 individual genomic libraries. this illustrates the importance of accurate quantitation and pipetting of the individual libraries in making the dna pool used in the template preparation for sequencing. the bocv assemblies tended to have more assembly gaps, due in part to the longer genomic rna of these viruses. conversely, the genome termini of the bocv were determined more often then the bvdv, the reasons for which are unclear. the assemblies containing the most gaps were those that had lower numbers of total sequences. still, the sequences spanned greater than 97% of the total genome of all bocv isolates examined in this study. these gaps were easily filled by either resequencing the library or by pcr amplicon sequencing of selected regions. routinely, there was sufficient double-stranded dna to be included in a second pooled library if additional sequence data was necessary. in addition to sequencing archived viruses, this protocol can readily be applied to diagnostic applications. with modifications of the sample preparation, most clinical samples submitted to diagnostic laboratories can be used for analysis. one difference in a diagnostic application is that complete genome assembly would not be necessary but detection of specific viral sequences would be indicative of the presence of the virus. this would require, at least initially, de novo assembly and blast analysis to determine if viral sequences were present. preliminary studies showed that viruses can be successfully detected, and in some cases, near full-length genomic sequences assembled, from serum and fecal samples (data not shown). additionally, with minor modification of the initial double-stranded cdna synthesis step, both dsrna and dna viruses can be sequenced. a deep sequencing method for rapid determination of genomic sequences of multiple viruses simultaneously is described. the practical application of this protocol was demonstrated by the generation of full-length or near full length genomic sequences of archived pestivirus and coronavirus isolates. with minor modifications, this protocol could be used to sequence multiple dsrna and dna viruses. this procedure is sequence independent, requiring no previous knowledge of sequence or pathogens present in a sample. additionally, this shows promise for the rapid identification of multiple viral pathogens in diagnostic settings and comparison of viruses currently in circulation with those in vaccines. molecular characterization of viruses from clinical respiratory samples producing unidentified cytopathic effects in cell culture a virus discovery method incorporating dnase treatment and its application to the identification of two bovine parvovirus species shining a light on dark sequencing: characterising errors in ion torrent pgm data identification and nearly full-length genome characterization of novel porcine bocaviruses a method for full genome sequencing of all four serotypes of the dengue virus metagenomic analysis of the viromes of three north american bat species: viral diversity among different bat species that share a common habitat identification of a novel virus in pigs-bungowannah virus: a possible new species of pestivirus genomic characterization of novel human parechovirus type bat guano virome: predominance of dietary viruses from insects and plants plus novel mammalian viruses the fecal viral flora of wild rodents acute diarrhea in west-african children: diverse enteric viruses and a novel parvovirus genus deep sequencing as a method of typing bluetongue virus isolates identification of a novel astrovirus in domestic sheep in hungary sequence-independent, single-primer amplification (sispa) of complex dna populations the fecal virome of pigs on a high-density farm rapid identification of known and new rna viruses from animal tissues metagenomic analyses of viruses in stool samples from children with acute flaccid paralysis viral nucleic acids in live-attenuated vaccines: detection of minority variants and an adventitious virus virome analysis for identification of novel mammalian viruses in bat species from chinese provinces virus identification in unknown tropical febrile illness cases using deep sequencing the authors would like to thank kathryn mcmullen, renae lesan and patricia federico for excellent technical assistance. additional thanks go to kerrie franzen and mary lea killian of the national veterinary services lab, aphis, usda for their help in sequencing of the libraries. key: cord-330067-ujhgb3b0 authors: huang, yi; lau, susanna k. p.; woo, patrick c. y.; yuen, kwok-yung title: covdb: a comprehensive database for comparative analysis of coronavirus genes and genomes date: 2007-10-02 journal: nucleic acids res doi: 10.1093/nar/gkm754 sha: doc_id: 330067 cord_uid: ujhgb3b0 the recent sars epidemic has boosted interest in the discovery of novel human and animal coronaviruses. by july 2007, more than 3000 coronavirus sequence records, including 264 complete genomes, are available in genbank. the number of coronavirus species with complete genomes available has increased from 9 in 2003 to 25 in 2007, of which six, including coronavirus hku1, bat sars coronavirus, group 1 bat coronavirus hku2, groups 2c and 2d coronaviruses, were sequenced by our laboratory. to overcome the problems we encountered in the existing databases during comparative sequence analysis, we built a comprehensive database, covdb (http://covdb.microbiology.hku.hk), of annotated coronavirus genes and genomes. covdb provides a convenient platform for rapid and accurate batch sequence retrieval, the cornerstone and bottleneck for comparative gene or genome analysis. sequences can be directly downloaded from the website in fasta format. covdb also provides detailed annotation of all coronavirus sequences using a standardized nomenclature system, and overcomes the problems of duplicated and identical sequences in other databases. for complete genomes, a single representative sequence for each species is available for comparative analysis such as phylogenetic studies. with the annotated sequences in covdb, more specific blast search results can be generated for efficient downstream analysis. coronaviruses are found in a wide variety of animals and are associated with respiratory, enteric, hepatic and neurological diseases of varying severity. based on genotypic and serological characterization, coronaviruses were divided into three distinct groups (1) (2) (3) . as a result of the unique mechanism of viral replication, coronaviruses have a high frequency of recombination (2, 4) . the recent severe acute respiratory syndrome (sars) epidemic, the discovery of sars coronavirus (sars-cov) and identification of sars-cov-like viruses from himalayan palm civets and a raccoon dog from wild live markets in china have led to a boost in interest on discovery of novel coronaviruses in both humans and animals (5-9) ( figure 1 ). for human coronaviruses, a novel group 1 human coronavirus, human coronavirus nl63 (hcov-nl63) was reported in 2004 (10, 11) , while we described the discovery, complete genome sequence and genetic diversity of a novel group 2 human coronavirus, coronavirus hku1 (cov-hku1) in 2005 (4, (12) (13) (14) . as for animal coronaviruses, six group 1 (15) (16) (17) , four group 2, including bat sars-cov and two new subgroups of group 2 coronaviruses (6, 8, 18, 19) , and 11 group 3 (20-23) coronaviruses have recently been described. by july 2007, more than 3000 coronavirus sequence records, including a total of 264 complete genomes, are available in genbank (24) . among the 25 coronavirus species with complete genome sequence available, six were sequenced by our group, including cov-hku1 and bat sars-cov (13, 16, 18, 19) . furthermore, we defined two novel subgroups of group 2 coronavirus (18) . during the process of batch sequence retrieval for comparative genome analysis of the coronavirus genomes that we sequenced, we encountered several major problems about the coronavirus sequences in genbank as well as other coronavirus databases (coronaviridae bioinformatics resource, http://athena.bioc.uvic.ca/database.php?db= coronaviridae; patric http://patric.vbi.vt.edu) (25) . first, in genbank, the non-structural proteins in the polyprotein encoded by orf1ab were not annotated. second, in all databases, for the non-structural proteins encoded by orfs downstream to orf1ab, the annotations are often confusing because they are not annotated using a standardized system. third, multiple accession numbers are often present for reference sequences (26) . these problems often lead to confusion when sequence retrieval is performed. fourth, coronaviruses, especially sars-cov, amplified from different specimens may contain the same genome or gene sequences. these sequences usually lead to redundant work when they are analyzed. in view of these problems, we started to develop our own database for coronavirus gene and genome sequences in 2005. in this database, covdb, we sought to create a user-friendly platform for efficient batch sequence retrieval, which is crucial for comparative genome analysis. in this article, we describe this comprehensive database of annotated coronavirus genes and genomes, which provides a central source of information about coronaviruses. to further increase the usefulness of covdb, commonly used bioinformatics tools were also included for analysis of the sequence data. sequence data. covdb is a web-based coronavirus database. data of covdb is stored and managed by mysql database management system. by july 2007, covdb contains 3982 coronavirus sequences and one torovirus genome sequence. two hundred and sixty-four of them are complete genomes and the rest are partial genomes or genes. all data were retrieved from genbank using modules of bioperl. we annotated sequences without gene information or non-structural protein boundary and labeled the 5 0 and 3 0 untranslated regions (utrs) of the genomes. by july 2007, covdb contains 12 344 genes and utrs. information on coronavirus genome characteristics. in addition to the two sequence retrieval pages, covdb collects information on coronavirus sequence characteristics, including genome organization, a brief description on each complete coronavirus genome, gc content, polyprotein cleavage sites, transcription regulatory sequences, acidic tandem repeat sequences and known rna structures. these pieces of information can be accessed by clicking 'genome' in the top menu bar of covdb. in the 'tools' page, blast similarity search (27) against annotated coronavirus sequences in covdb can be performed and other commonly used tools are also provided. batch sequence retrieval. the main goal for setting up covdb is to provide a convenient and efficient platform for retrieving batches of coronavirus gene sequences. the interfaces of the database are simple and user friendly. all genes and genomes contain links to genbank and/or pubmed. covdb contains two main pages for sequence retrieval. from the homepage, one can enter the first main page for retrieval of complete genomes and their genes by clicking 'covdb' (figure 2a) . from this page, users can obtain genes from specific coronavirus species by selecting the corresponding check boxes. we defined one representative genome from each species as the 'type strain'. most of the time, this 'type strain' is the one assigned as the reference sequence in genbank. by choosing the 'type strain only' option, users can obtain one gene sequence per species and construct phylogenetic tree or perform other comparisons. an example of retrieving complete genome or a specific gene of complete genome of selected species is shown in figure 2b and c. from the page for retrieval of complete genomes and their genes, one can enter the second main page for retrieval of all complete and/or incomplete genes of a coronavirus ( figure 3a ) by clicking 'from all groups of genes'. in this page, all the gene sequences are grouped vertically according to which coronavirus group and subgroup they belong to, and horizontally by the names of the genes. the option 'exclude partial cds' can be used if only complete genes are required. an example of retrieving all the sequence of a particular gene for a group of coronavirus is shown in figure 3b . if the translated sequence of a selected gene has more than one stop codon which is probably due to sequencing error, the number in the 'length' column of this gene will be marked in red. polyprotein annotation. in all coronavirus genomes, orf1ab occupies two-thirds of the genome and it is translated as a polyprotein. this polyprotein is posttranslationally cleaved by 3c-like protease (3cl pro ) and papain-like protease (pl pro ) into 15-16 non-structural proteins. some of the non-structural proteins, such as rna-dependent rna polymerase, helicase, 3cl pro and pl pro are essential for replication or virulence of the coronavirus, although the functions of others are still unclear. due to the essentiality of the non-structural proteins, these sequences are often used for evolutionary analysis, primer design, etc. however, except for the reference sequences, detailed cleavage site information is not provided for the non-structural proteins in other sequences in genbank. since it has been shown that 3cl pro and pl pro of coronavirus cleave at conserved specific amino acids, the putative cleavage sites of the 15-16 non-structural proteins can be predicted by multiple sequence alignment. using these pieces of information, we have annotated these non-structural proteins in all the coronavirus sequences for easy retrieval in covdb. protein/gene name unification. by convention, all nonstructural proteins in the polyprotein encoded by orf1ab are named as 'nsp', with each protein numbered consecutively starting from the 5 0 end (nsp1-nsp16). the structural proteins after the polyprotein are hemagglutinin esterase (he, in group 2a coronaviruses), spike glycoprotein (s), envelope protein (e), membrane protein (m) and nucleocapsid protein (n). however, there is no unified naming system for the non-structural proteins encoded by orfs downstream to orf1ab. this lack of a unified system greatly reduces the stability and accuracy of ortholog retrieval. in covdb, with the aim of facilitating gene retrieval, we tried to unify the naming of these non-structural proteins from different groups of coronaviruses. on the other hand, we have also tried to avoid radical changes in the names that may lead to confusion. in covdb, these non-structural proteins are named as ns2a, ns3x, ns4x, ns5x and ns7x (x = a, b, c,. . .). ns2a denotes the orf between orf1ab and he of group 2a coronaviruses. ns3x denotes the orfs between s and e of groups 1, 2c, 2d and 3 coronaviruses. in most of these coronaviruses, there are two ns3x, named ns3a and ns3b. however, in group 1 coronaviruses, the genomes of some members (e.g. hcov-nl63, pedv) contain only one orf between s and e. when we compared their putative amino acid sequences to the corresponding ones in other group 1 coronavirus genomes using blast, as well as searching for conserved domains using motifscan, results showed that the putative proteins encoded by these orfs belonged to a protein family in pfam originally assigned as 'corona_ns3b' (accession number pf03053). therefore, we named these orfs as ns3b. ns4x denotes the orfs between s and e of group 2a coronaviruses. ns5x denotes the orfs between m and n of group 3 coronaviruses. one exception is ns5a of group 2a coronaviruses. traditionally, this name denotes an orf upstream of e in group 2a coronaviruses. therefore, we have kept this name for that orf in covdb. ns7x denotes the orfs downstream of n gene. it is important to note that due to variations in genome organizations among different groups of coronaviruses (table 1) , ns genes with the same name in different coronavirus groups may not be orthologs of each other. the complete genome gene search page of covdb contains a link to a gene synonyms page, which includes a list of synonymous names of the various genes in the coronavirus genomes. identical sequence labeling. sequence redundancy is another problem of coronavirus sequences in public nucleotide databases. different strains of the same species from samples collected in different locations or at different times may possess completely or partially identical sequences. these sequences, though containing important epidemiological information, increase the workload during sequence analysis. in covdb, we compared all nucleotide sequences and labeled the identical ones to mitigate this problem. users can choose to show or not to show strains with identical sequences by clicking on the check boxes to the left of the page (figure 3b ). blast similarity search. during the process of coronavirus gene sequences analysis, we encountered a major problem when coronavirus gene sequences, especially those of orf1ab, were used for blast search against genbank or any other coronavirus databases. when part of the orf1ab gene (e.g. nsp5) is used as the query sequence, instead of getting the gene for the specific non-structural protein that the query sequence is homologous to, the results will only show that the hits are within orf1ab, or in some cases, shown to be within the entire coronavirus genome. much time will be needed for further analyzing the results manually in order to locate the positions of the cleavage sites of the corresponding genes for the nonstructural proteins, making it very inefficient for further downstream work. this problem has been overcome by the annotated sequences in covdb. the blast search page of covdb is an interface for facilitating coronavirus similarity search. the background support program, blastall, is from the ncbi blast package. the blast search page can be entered by clicking 'tools' in the top menu bar in any page of covdb. since all sequences in covdb are annotated, they can be grouped into different datasets for blast search. users can choose one of the three nucleotide and two protein sequence datasets as the database for comparison (figure 4) . the three nucleotide sequence datasets are: cov genes (nsp + genes after 1ab), cov genes (1ab + genes after 1ab) and cov genbank strains, which are the original sequences retrieved from genbank. the two protein sequence datasets are the translated sequences of the first two nucleotide datasets: cov proteins (nsp + aa after 1ab) and cov proteins (1ab + aa after 1ab). myblast. 'myblast' employs the same blast program as the blast page mentioned above. however, instead of selecting a predefined nucleotide or amino acid sequence database, multiple sequences can be pasted into the second sequence input box to generate a temporary sequence database. one or more query sequences can be pasted into the first sequence input box for blastn or blastp search against the temporary sequence database. orf finder for coronavirus. this orf finder is specifically designed for coronavirus genome analysis. the result page shows the positions and lengths of each putative orf and the position of the putative ribosomal frameshift site for translation of orf1ab. the nucleotide or amino acid sequences of the orfs can be shown by selecting the corresponding check boxes. to facilitate genome comparison and annotation, the most closely related coronavirus, which had been annotated in covdb, can be chosen from a pull-down list for comparison using blast search. this function is particularly useful for determining the range of nsp in orf1ab. rapid and accurate batch sequence retrieval is both the cornerstone and bottleneck for comparative gene or genome analysis. during the process of complete genome sequencing and comparative analysis of the various novel human and animal coronavirus genomes in the past 2 years, we have developed a comprehensive the first column is covdb gene id. in the uniq column, 'uniq' will be shown if there is no other identical sequence in covdb. otherwise, gene id of the sequences identical to it will be shown. database, covdb, of annotated coronavirus genes and genomes, which offers efficient batch sequence retrieval and analysis. as shown by our experience in using covdb for comparative genome analysis of novel coronaviruses we have discovered (4, 13, 16, 18, 19) , we find that covdb is more rapid and efficient than other existing coronavirus databases for batch sequence retrieval for the following reasons. first, we have performed annotation on all non-structural proteins in the polyprotein encoded by orf1ab of every single sequence. second, annotation was performed for the non-structural proteins encoded by orfs downstream to orf1ab using a standardized system, with some exceptions given to some names that have been used for a long time so as to minimize confusion. third, all sequences with identical nucleotide sequences were labeled where one can choose to show or not to show strains with identical sequences. fourth, covdb contains not only complete coronavirus genome sequences, but also incomplete genomes and their genes. some genes of coronaviruses, such as pol, spike and nucleocapsid are sequenced much more frequently than others because they are either most conserved or least conserved. these gene sequences are particularly important for evolutionary analysis, single nucleotide polymorphism studies and design of primers for rt-pcr or quantitative rt-pcr amplification. covdb is constructed by the department of microbiology, the university of hong kong. it is available at no charge at http://covdb.microbiology.hku.hk. coronavirus genome structure and replication the molecular biology of coronaviruses molecular biology of severe acute respiratory syndrome coronavirus comparative analysis of 22 coronavirus hku1 genomes reveals a novel genotype and evidence of natural recombination in coronavirus hku1 isolation and characterization of viruses related to the sars coronavirus from animals in southern china the genome sequence of the sars-associated coronavirus coronavirus as a possible cause of severe acute respiratory syndrome characterization of a novel coronavirus associated with severe acute respiratory syndrome relative rates of non-pneumonic sars coronavirus infection and sars coronavirus pneumonia a previously undescribed coronavirus associated with respiratory disease in humans identification of a new human coronavirus in silico analysis of orf1ab in coronavirus hku1 genome reveals a unique putative cleavage site of coronavirus hku1 3c-like protease characterization and complete genome sequence of a novel coronavirus, coronavirus hku1, from patients with pneumonia clinical and molecular epidemiological features of coronavirus hku1-associated community-acquired pneumonia molecular diversity of coronaviruses in bats complete genome sequence of bat coronavirus hku2 from chinese horseshoe bats revealed a much smaller spike gene with a different evolutionary lineage from the rest of the genome prevalence and genetic screenshot of blast similarity search page. five datasets can be chosen as the database for comparison. diversity of coronaviruses in bats from china comparative analysis of twelve genomes of three novel group 2c and group 2d coronaviruses reveals unique group and subgroup features severe acute respiratory syndrome coronavirus-like virus in chinese horseshoe bats coronaviruses from pheasants (phasianus colchicus) are genetically closely related to coronaviruses of domestic fowl (infectious bronchitis virus) and turkeys coronavirus infection of spotted hyenas in the serengeti ecosystem molecular identification and characterization of novel coronaviruses infecting graylag geese (anser anser), feral pigeons (columbia livia) and mallards (anas platyrhynchos) isolation of avian infectious bronchitis coronavirus from domestic peafowl patric: the vbi pathosystems resource integration center ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins basic local alignment search tool conflict of interest statement. none declared. key: cord-302161-ytr7ds8i authors: lutz, mirjam; steiner, aline r.; cattori, valentino; hofmann-lehmann, regina; lutz, hans; kipar, anja; meli, marina l. title: fcov viral sequences of systemically infected healthy cats lack gene mutations previously linked to the development of fip date: 2020-07-24 journal: pathogens doi: 10.3390/pathogens9080603 sha: doc_id: 302161 cord_uid: ytr7ds8i feline infectious peritonitis (fip)—the deadliest infectious disease of young cats in shelters or catteries—is induced by highly virulent feline coronaviruses (fcovs) emerging in infected hosts after mutations of less virulent fcovs. previous studies have shown that some mutations in the open reading frames (orf) 3c and 7b and the spike (s) gene have implications for the development of fip, but mainly indirectly, likely also due to their association with systemic spread. the aim of the present study was to determine whether fcov detected in organs of experimentally fcov infected healthy cats carry some of these mutations. viral rna isolated from different tissues of seven asymptomatic cats infected with the field strains fcov zu1 or fcov zu3 was sequenced. deletions in the 3c gene and mutations in the 7b and s genes that have been shown to have implications for the development of fip were not detected, suggesting that these are not essential for systemic viral dissemination. however, deletions and single nucleotide polymorphisms leading to truncations were detected in all nonstructural proteins. these were found across all analyzed orfs, but with significantly higher frequency in orf 7b than orf 3a. additionally, a previously unknown homologous recombination site was detected in fcov zu1. feline coronaviruses (fcovs) are endemic in the domestic cat population and are found worldwide with high seroprevalences. fcovs are positive-stranded enveloped rna viruses with a non-segmented genome of~29 kilobases (kb) encoding for a polymerase (replicase complex 1a and 1b), four structural proteins (spike (s), membrane/matrix (m), envelope (e), and nucleocapsid (n)), and five nonstructural/accessory proteins (nsp) 3abc and 7ab [1] . due to infidelity of the rna polymerase, fcovs show high mutation rates upon replication. this leads to the formation of quasispecies, a cloud-like appearance of multiple genetic virus variants that are linked by mutations [2, 3] . virus variants that acquire high virulence can lead to feline infectious peritonitis (fip), an immune-mediated disease that is currently the most frequent fatal infectious disease in young pedigree cats and cats in shelters [4, 5] . however, the pathogenesis of fip is still not fully understood [6] . fcovs occur as two pathotypes, the low virulence so-called feline enteric coronaviruses (fecv) and the virulent fip viruses (fipv). the former dominates since fcov infection is via the fecal-oral route and primarily targets the intestine; it does, if at all, only induce mild enteritis. however, 4-5% of adult cats and 5-10% of kittens develop fip at some point after infection [7, 8] as a consequence of de novo occurrence of highly virulent fcovs that arise from low virulent fcovs by mutations in the individual infected cat [9] ; these are generally not horizontally transmitted. it was initially thought that the essential process for the development of fip would be a broadening of the viral target cell spectrum from enterocytes to also include monocytes/macrophages, which would be mediated by specific viral mutations and would allow systemic infection, leading to fip [10] . in vitro studies subsequently revealed that it is effective and sustainable replication in macrophages rather than the capacity to enter the cells that confers virulence of fcovs [11, 12] . also, it was soon shown that healthy carrier cats become systemically infected and carry low amounts of virus in different organs [13, 14] . irrespective of the pathotype, fcovs can be classified into two serotypes based on their neutralization reactivity with s protein-specific monoclonal antibodies [15, 16] . serotype ii fcovs have arisen from a double rna recombination event between canine coronavirus (ccov) and serotype i fcovs [17] . only type ii fcov can be efficiently propagated in cell cultures [5] . however, viruses of both serotypes can cause fip. previous studies found deletions in open reading frame (orf) 3c and mutations in orf 7b in fcovs from organs of cats that succumbed to fip [9, 18] . these were not found in fecal fcov isolates from healthy carrier cats and were hence considered to be linked to the fip pathotype and, consequently, to the development of fip. in fact, an intact orf 3c was shown to be critical for replication in enterocytes and thus, shedding of the low virulent pathotype fcovs [19, 20] . the fact that the deletions and mutations were unique to each cat supported the de novo mutation theory [21] . fcovs found in fip often exhibit mutations in orf 3c that lead to truncation of the protein; however, a significant proportion (up to 40%) of fipv still have an intact orf 3c [9, 18, 20, 22, 23] . hence, a virulence switch cannot be explained with orf 3c deletions alone [18, 20] . nevertheless, mutations that affect the expression of the accessory proteins might contribute to increased viral replication in monocytes/macrophages [24, 25] and, thereby, to the development of fip. conflicting evidence exists regarding the role of orf 7a and b. while some studies found deletions in orf7 to be associated with reduced in vivo virulence or in vitro replication [24] [25] [26] [27] , others did not detect such links [28] [29] [30] . a high n protein diversity was identified as evidence of quasispecies formation, but not as a marker of pathotypes [31] . mutations in the e or m genes were not found to be critical, neither in vivo [21] nor for in vitro replication [30] . the fcov s glycoprotein mediates viral cell entry through receptor binding and is thus responsible for cell tropism [32] . early in vitro studies found that macrophage tropism can be promoted by introducing the s protein of a highly virulent type ii fcov strain into its low virulent counterpart, suggesting that this protein plays a role in virulence. however, the study did not associate a distinct mutation with the new features of the recombinant strain [30] . specific s gene mutations were only identified in a more recent full genome sequencing approach, analyzing fcovs of both pathotypes. the majority of fipv pathotype fcovs were found to carry a mutation in a nucleotide position that was associated with an amino acid change in the putative fusion peptide of the s protein [33] . the same mutation plus a substitution leading to an amino acid change in the heptad repeat 1 (hr1) region, possibly leading to an altered fusogenic activity of the s protein and thus an altered cellular tropism, was found in another study [34] . however, it was subsequently shown that the amino acid change described by chang and coworkers [33] correlates with systemic spread of the virus and not directly with fip [35] . another study examined a furin cleavage site between the receptor binding and fusion domain of the s gene. the site was found to be conserved in less virulent fcovs, whereas fip pathotype fcovs exhibited a higher variability in the core motif and surrounding residues. these mutations alter cleavability of the s protein by furin. depending on the exact position and the exchanged nucleotide, cleavage was either enhanced or suppressed [36] . again, the study design did not consider systemically infected healthy carrier cats in this context. reverse genetics approaches with recombinant chimeric viruses in which genes of an avirulent (fcov black) strain were successively substituted by genes from a highly virulent serotype ii fipv strain (79-1146) led to seroconversion and systemic infection, but did pathogens 2020, 9, 603 3 of 20 not induce fip [37, 38] . despite the fact that many comparative studies have described mutations only present in fip pathotype fcovs, their relevance for the development of fip remains unclear. based on the hypothesis that certain mutations are essential for the capacity of fcovs to spread systemically, the present study investigated a cohort of systemically infected healthy carrier cats at different time points post experimental infection for the presence of a range of mutations in the genes encoding for the s protein, nsp 3abc, and nsp 7b, which have been shown to have implications for the development of fip. it also assessed the changes in the swarm composition of the challenge stock virus present in the organs to determine if its complexity would increase or rather decrease with time. to track viral sequence mutations in organs of healthy fcov carrier cats, we investigated fcov sequences detected in the colon, liver, and thymus, as well as feces of seven experimentally fcov infected cats. in two animals, the thymus was found to be negative by fcov reverse transcription quantitative real-time pcr (rt-qpcr) and lung and tonsil were included as alternatives (table 1) . fcov rt-qpcr was performed from undiluted and 10-fold diluted rna samples to detect any potential pcr inhibition and the rna dilutions with the lower cycle threshold (ct) value, corresponding to the higher viral load, were selected for analysis by one-step rt-pcr (table 1) for further cloning and sequencing. due to low tissue viral loads and/or lack of primer hybridization due to possible sequence mismatches, in five of the 28 samples, the analysis of mutations in either orfs 3abc or 7b was not possible (cat 3257b: amplicon nsp7ab from liver and tonsil; cat d4: amplicon nsp7ab from the liver; cats y1 and 3138b: amplicons s-e and s-m from the thymus, respectively). in total, 51 sequences were obtained. the plan was to sequence three colonies from each amplicon. due to the often low concentrations of eluted viral cdna, this was not possible for all samples. however, sequences for 129 of the expected 168 amplification reactions were obtained. we determined the sequence of the putative fusion region of the s protein from at least one amplicon from each organ of five cats (3132b, 3138b, 3312b, d4, y1). the one-step rt-pcr for the s gene already yielded an amplicon of the expected length (615 bp) from the colon of five cats (3132b, 3138b, 3312b, d4, y1) and the liver of cat 3138b, and after the nested pcr, amplicons with the expected length (134 bp) were obtained from the liver of four cats (3132b, 3312b, y1, 3257b), the thymus of four cats (3132b, 3138b, 3312b, y1), and the tonsil of cat 3257b. from one cat (y2), no amplicon was obtained. to determine how a virus swarm changes in its composition and viral sequences during in vivo infection and with systemic spread, the genomic sequences of fcovs in the tissue and fecal samples were compared to those of challenge stock virus. the number of sequences identical to the consensus was always higher in the latter than in the tissue/fecal counterparts of the same strain and orf (table 2 ). the number of single nucleotide polymorphisms (snps) per nucleotides analyzed did not differ significantly between challenge stock virus and tissue/fecal viruses. however, a tendency towards higher mutation frequencies was observed for orf7b in the tissue/fecal viruses compared to the challenge stock virus (table 2, figure 1 ). interestingly, zu3 orf3 showed higher mutation frequencies in the stock virus than in the tissue/fecal viruses. to determine whether there was a higher selective pressure on a certain gene after in vivo infection, we compared the mutation frequencies and resulting amino acid changes of genes 3a, 3b, 3c, and 7b ( table 3 ). the variability per gene of the tissue/fecal fcovs of each cat and in the challenge to determine whether there was a higher selective pressure on a certain gene after in vivo infection, we compared the mutation frequencies and resulting amino acid changes of genes 3a, 3b, 3c, and 7b ( table 3 ). the variability per gene of the tissue/fecal fcovs of each cat and in the challenge stock virus was small and ranged from 0.051 to 0.553%. while the 3a, 3b, and 3c genes seemed to have equal variability, the 7b gene exhibited a significantly higher mutation frequency than the 3a gene ( figure 2 ). to determine whether there was a higher selective pressure on a certain gene after in vivo infection, we compared the mutation frequencies and resulting amino acid changes of genes 3a, 3b, 3c, and 7b ( table 3 ). the variability per gene of the tissue/fecal fcovs of each cat and in the challenge stock virus was small and ranged from 0.051 to 0.553%. while the 3a, 3b, and 3c genes seemed to have equal variability, the 7b gene exhibited a significantly higher mutation frequency than the 3a gene ( figure 2 ). we wanted to determine whether the length of infection had an effect on mutation frequencies. the overall comparison of fcov gene sequences from the different cats euthanized at different time points after infection did not reveal any significant differences (table 3) . nevertheless, when the cats previous studies have provided evidence of a link between truncated 3c proteins and the development of fip [9, 18] . therefore, we investigated the effect of the mutations and deletions found in the present study on the encoded proteins. table 4 shows all deletions and snps detected in viral sequences from tissue and fecal samples and the challenge stock virus that led to a major alteration, i.e., leading to more than just one amino acid change, of the encoded protein. deletions were detected in five of the 129 viral gene tissue/fecal sequences. most mutations of viral sequences led to truncated 7b proteins. two deletions and one snp led to truncated 3a and 3c proteins. two snps and one deletion caused the start codon or the stop codon to disappear. the challenge virus only contained one truncated 3b protein. to determine the serotype of the strains fcov zu1 and fcov zu3 and their tissue/fecal derivate strains, we constructed bootstrap phylogenetic trees with fcov and ccov sequences from the ncbi database (genbank). consensus sequences of virus sequences from the same cat and origin and the previous studies have provided evidence of a link between truncated 3c proteins and the development of fip [9, 18] . therefore, we investigated the effect of the mutations and deletions found in the present study on the encoded proteins. table 4 shows all deletions and snps detected in viral sequences from tissue and fecal samples and the challenge stock virus that led to a major alteration, i.e., leading to more than just one amino acid change, of the encoded protein. deletions were detected in five of the 129 viral gene tissue/fecal sequences. most mutations of viral sequences led to truncated 7b proteins. two deletions and one snp led to truncated 3a and 3c proteins. two snps and one deletion caused the start codon or the stop codon to disappear. the challenge virus only contained one truncated 3b protein. to determine the serotype of the strains fcov zu1 and fcov zu3 and their tissue/fecal derivate strains, we constructed bootstrap phylogenetic trees with fcov and ccov sequences from the ncbi database (genbank). consensus sequences of virus sequences from the same cat and origin and the consensus sequence of the challenge stock virus were aligned to the reference sequences from the genbank for genes 3a, 3b, 3c, and 7b (figure 4a-d, respectively) . fcov zu3 and consensus sequences obtained from cat 3138b inoculated with fcov zu3 clustered with type i fcovs in all genes, whereas in tissue and fecal samples of cats inoculated with the fcov zu1 strain, only viral consensus sequences encoding the 3a gene clustered with type i fcovs. sequences related to fcov zu1 encoding genes 3c and 7b clearly clustered with type ii fcovs. while sequences of the 3c and 7b genes could undoubtedly be assigned to serotype ii, because they clustered with the sequences of the corresponding serotype, this was not the case for the 3a and 3b gene. here, sequences clustered together but could not be unquestionably assigned to a given serotype. therefore, we speculate that the 3a and/or 3b gene of the challenge virus fcov zu1 harbors a new recombination site. pathogens 2020, 9, x 8 of 21 consensus sequence of the challenge stock virus were aligned to the reference sequences from the genbank for genes 3a, 3b, 3c, and 7b (figure 4a-d, respectively) . fcov zu3 and consensus sequences obtained from cat 3138b inoculated with fcov zu3 clustered with type i fcovs in all genes, whereas in tissue and fecal samples of cats inoculated with the fcov zu1 strain, only viral consensus sequences encoding the 3a gene clustered with type i fcovs. sequences related to fcov zu1 encoding genes 3c and 7b clearly clustered with type ii fcovs. while sequences of the 3c and 7b genes could undoubtedly be assigned to serotype ii, because they clustered with the sequences of the corresponding serotype, this was not the case for the 3a and 3b gene. here, sequences clustered together but could not be unquestionably assigned to a given serotype. therefore, we speculate that the 3a and/or 3b gene of the challenge virus fcov zu1 harbors a new recombination site. based on sequence alignments with defined serotype i (fecv-ucd3 (fj943761)) and serotype ii (fipv 79-1146 (dq010921)) sequences, the recombination site could be tentatively mapped to the overlapping region between orf3a and orf3b ( figure s1 ). to evaluate the sequence evolution patterns of the fcov zu1 and zu3 challenge stock viruses for both the 3abc and 7b genes in each cat (i.e., at different time points post infection), bootstrap phylogenetic trees were constructed, with each available sequence obtained from different colonies in relation to the respective challenge stock virus consensus sequence (supplementary figures s2-s5) . in cats 3132b, 3138b, 3312b ( figures s2 and s3) , and d4 ( figures s4 and s5 ), the variability between the different sequences was very small and no subtree resulted from the analysis of either gene. the virus variation seemed to be evenly distributed across tissue and fecal samples independent of the time interval between infection and euthanasia (14, 28, or 48 days) . in cat y2, the gene 3abc based analysis revealed an equal distribution of virus variability in all examined tissue and fecal samples ( figure s4 ). however, for the 7b gene, the analysis revealed a higher variability, with viral sequences from different origin grouping to a subtree. the evolutionary pressure on the 7b gene seemed to be similar in all samples of this cat ( figure s5 ). phylogenetic analysis of cat y1 sequences based on genes 3abc and 7b showed subtrees in both genes. the sequence variability in the 3abc genes was smaller than in the 7b gene ( figures s4 and s5 ). analysis based on the 7b gene led to three different subtrees where viral sequences of the same tissue and feces grouped together and were clearly distinct from each other. also, in cat 3257b, the variability in the 7b gene was slightly higher than in the 3abc genes ( figures s4 and s5 ). trees constructed with both genes formed subtrees. two colonies sequenced from the colon of this cat grouped together when the analysis was based on the 3abc genes. analysis of the 7b gene yielded two subtrees, one consisting of two sequences from the feces and one of each feces and a colon sequence. the genetic difference of the viral sequences identified in the colon of cat 3257b to fcov zu1 (figures s4 and s5 ) was the highest of all sequences analyzed, potentially reflecting the longer time span after infection (80 days) in this animal. to assess if the fcov zu1 strain evolved divergently in different sites (tissues and/or feces) of individual cats, bootstrap phylogenetic trees were constructed with all sequences from feces and tissues. figures 5 and 6 show the analysis based on 3abc and 7b genes, with all viral sequences of the same origin. also, these trees revealed a higher variability in the viral 7b genes ( figure 6 ) compared to the 3abc genes ( figure 5 ). with the exception of the tree based on the analysis of the 3abc genes from thymus, all trees generated subtrees (figure 5: colon; feces; liver). subtrees consisted of sequences originating from the same cat, except for one subtree for liver sequences where two sequences of cat y1 and y2 grouped together ( figure 5 : liver). sequences that diverged in the analysis based on genes 3abc did not originate from the same animals and/or colonies as those diverging in the analysis based on gene 7b. sequences with mutations that led to truncated proteins did not cluster together but were rather dispersed and the majority were not represented in the subtrees ( figures 5 and 6: red dots) . when the analysis was based on genes 3abc of the colon, feces, and liver of cat 3132b, i.e., the cat that was sacrificed after the shortest time post infection (14 days), the sequences always formed a subtree that was clearly distinct from the challenge virus fcov zu1 and sequences found in other cats infected with the same virus (figure 5: colon; feces; liver). this was not seen in the 7b gene analysis (figure 6 : colon; feces; liver). the trees based on the 7b gene indicated that in cats y1, y2, d4, and 3257b, the 7b genes were generally more distant from parent fcov zu1 than from those of the other cats. orf 7b colon orf 7b feces orf 7b liver orf 7b thymus figure 6 . phylogenetic analysis based on the sequences encoding for the nonstructural protein 7b of all sequences of colon, feces, liver, or thymus, respectively (red dot, sequence carries a deletion that leads to a premature stop codon (3312b c3, y1c2, d4 c1); blue dot, snp causes the disappearance of a start codon (y1 c2, y1 c1); bar, mean number of differences per 1000 sites). unfortunately, none of the nested pcr products could be sequenced, neither by direct sequencing nor after cloning. however, sequences were obtained for five of the six samples that were positive after the first rt-pcr step. four of these sequences derived from the colon and one from the liver. all sequences had an atg codon at position 1058, resulting in a methionine residue (figure 7) . the fipv c1je strain exhibits a ttg codon at this position, resulting in a leucine residue [39] . at position 1060, all sequences had a tct codon. none of our samples exhibited the serine to alanine mutation described in a minority of fipvs at this position [33] . other snps detected in the immediate surrounding of position 1058 and position 1060 in our samples did not alter the amino acid figure 6 . phylogenetic analysis based on the sequences encoding for the nonstructural protein 7b of all sequences of colon, feces, liver, or thymus, respectively (red dot, sequence carries a deletion that leads to a premature stop codon (3312b c3, y1c2, d4 c1); blue dot, snp causes the disappearance of a start codon (y1 c2, y1 c1); bar, mean number of differences per 1000 sites). unfortunately, none of the nested pcr products could be sequenced, neither by direct sequencing nor after cloning. however, sequences were obtained for five of the six samples that were positive after the first rt-pcr step. four of these sequences derived from the colon and one from the liver. all sequences had an atg codon at position 1058, resulting in a methionine residue (figure 7) . the fipv c1je strain exhibits a ttg codon at this position, resulting in a leucine residue [39] . at position 1060, all sequences had a tct codon. none of our samples exhibited the serine to alanine mutation described in a minority of fipvs at this position [33] . other snps detected in the immediate surrounding of position 1058 and position 1060 in our samples did not alter the amino acid composition. overall, none of the mutations identified by chang and coauthors in the putative fusion region of the s protein were present [33] . composition. overall, none of the mutations identified by chang and coauthors in the putative fusion region of the s protein were present [33] . during the last years, many studies have been performed with the aim to identify mutations responsible for the virulence switch in fcovs towards the fipv pathotype. there has been evidence that mutations in accessory genes and the s gene of fcovs are associated with fip development. so far, most of these studies compared less virulent fcovs from healthy cats with fcovs in animals suffering from fip [18, 20, 22, [40] [41] [42] [43] . in the present study, we investigated the orfs 3abc and 7b and the s gene mutations of viral sequences identified in different organs and feces of healthy cats that had been experimentally infected with different fcov field strains and had developed a systemic fcov infection [14] . in the original study, the challenge virus used for experimental infection had been isolated from the feces of naturally infected cats. since fcovs show high mutation rates, due to the infidelity of their rna-dependent rna polymerase and homologous recombination events [44] , various fcov variants are transmitted in natural fcov infections [45] ; therefore, due to our experimental setup, we hypothesized a similar scenario. such virus swarms are called quasispecies, defined by a dominant nucleotide sequence and its mutant spectrum. evolutionary selection acts on the entire quasispecies rather than on single viral mutants [46] . in our study, we compared the swarm composition of the challenge stock viruses used for infection to the viruses detected in organs and feces at different time points after infection, focusing on the sequences of accessory genes. in parallel, we investigated the s genes for the presence of mutations described to have an impact on the development of fip [33] . the analysis of the accessory genes of more than 20 colonies of the challenge stock viruses confirmed the existence of one dominant sequence and several mutated derivates in each orf and strain (data not shown); this observation aligns with the quasispecies theory. in the viral genomes detected in the tissue and fecal samples of the infected cats, the extent of sequences identical to the consensus sequence was lower than in the challenge stock virus. additionally, viral sequences identified in cats that were euthanized at a later time point after infection overall showed significantly more variation than those from cats that were euthanized earlier, although due to the small number of cats analyzed, this significance could subsequently not be assigned to a specific accessory gene. a limitation for the interpretation of the results is the fact that only few cats (n = 7) were included in this study, of which just one was infected with the fcov zu3 strain. furthermore, two of these animals had been subcutaneously vaccinated with the inactivated fcov zu1 strain before challenge with the homologous strain. these cats were included in the study to add a further time point to monitor mutation frequencies over time. as the vaccine was inactivated and unable to replicate, we did not expect to find it in the tissues and feces. the animals seroconverted and started to shed virus in the feces only upon challenge and at the same time, with no differences in titers between vaccinated during the last years, many studies have been performed with the aim to identify mutations responsible for the virulence switch in fcovs towards the fipv pathotype. there has been evidence that mutations in accessory genes and the s gene of fcovs are associated with fip development. so far, most of these studies compared less virulent fcovs from healthy cats with fcovs in animals suffering from fip [18, 20, 22, [40] [41] [42] [43] . in the present study, we investigated the orfs 3abc and 7b and the s gene mutations of viral sequences identified in different organs and feces of healthy cats that had been experimentally infected with different fcov field strains and had developed a systemic fcov infection [14] . in the original study, the challenge virus used for experimental infection had been isolated from the feces of naturally infected cats. since fcovs show high mutation rates, due to the infidelity of their rna-dependent rna polymerase and homologous recombination events [44] , various fcov variants are transmitted in natural fcov infections [45] ; therefore, due to our experimental setup, we hypothesized a similar scenario. such virus swarms are called quasispecies, defined by a dominant nucleotide sequence and its mutant spectrum. evolutionary selection acts on the entire quasispecies rather than on single viral mutants [46] . in our study, we compared the swarm composition of the challenge stock viruses used for infection to the viruses detected in organs and feces at different time points after infection, focusing on the sequences of accessory genes. in parallel, we investigated the s genes for the presence of mutations described to have an impact on the development of fip [33] . the analysis of the accessory genes of more than 20 colonies of the challenge stock viruses confirmed the existence of one dominant sequence and several mutated derivates in each orf and strain (data not shown); this observation aligns with the quasispecies theory. in the viral genomes detected in the tissue and fecal samples of the infected cats, the extent of sequences identical to the consensus sequence was lower than in the challenge stock virus. additionally, viral sequences identified in cats that were euthanized at a later time point after infection overall showed significantly more variation than those from cats that were euthanized earlier, although due to the small number of cats analyzed, this significance could subsequently not be assigned to a specific accessory gene. a limitation for the interpretation of the results is the fact that only few cats (n = 7) were included in this study, of which just one was infected with the fcov zu3 strain. furthermore, two of these animals had been subcutaneously vaccinated with the inactivated fcov zu1 strain before challenge with the homologous strain. these cats were included in the study to add a further time point to monitor mutation frequencies over time. as the vaccine was inactivated and unable to replicate, we did not expect to find it in the tissues and feces. the animals seroconverted and started to shed virus in the feces only upon challenge and at the same time, with no differences in titers between vaccinated and non-vaccinated animals. for these reasons, we did not expect to see a difference in mutation frequencies due to vaccination, but we also cannot completely exclude it. only three colonies per amplicon of each tissue/fecal sample were sequenced. this sequencing approach was most likely picking the most represented sequences, which might not be enough to appropriately characterize the quasispecies swarm. a targeted high throughput sequencing study would allow for much greater sensitivity for detecting sequence variations in the virus population. nevertheless, taken together, the results point towards an expansion of the viral swarm with increasing infection time. yet, stabilization of the virus swarm over a longer period might subsequently occur. this has been shown in a previous study, which also provided evidence that persistently infected cats are likely no source of novel virus variants, since they carry highly conserved fcov swarms [45] . single nucleotide polymorphisms (snp) were found randomly scattered across all accessory genes without defined patterns (hot spots) both in the viral rna from tissues/feces and the challenge viruses. this suggests that they are a consequence of the infidelity of the viral rna-dependent rna polymerase. the present study found a generally higher mutation rate in the 7b gene than in the 3abc genes; the difference was significant for orf 3a. this confirms previous findings suggesting that the amino acid sequence of the nsp 7b is less conserved than that of other nsps [27, 47] . a bias could have been introduced since a non-proofreading taq polymerase was used for pcr amplification to increase the efficiency of the one-step rt-pcr because of the low viral loads in the tissues. but as sequencing errors are supposed to be equally randomly distributed across the genes, this would not impact on the difference of the mutation frequencies. interestingly, we found deleterious mutations and snps that led to truncated nsp 7b proteins in four viral sequences from the feces and colon of different cats infected with the fcov zu1 strain. truncated 7b proteins were previously shown to correlate with attenuated virulence: four well-characterized fcov strains (fecv 79-1683, fipv tn406-hp, fipv ucd2, and fipv df2) were found to be avirulent after they had acquired orf 7b gene deletions during in vitro passage in cell culture. in contrast, field fcov isolates consistently carried an intact 7b gene regardless of their pathotype [21] . thus, it was postulated that an intact 7b gene confers a selective advantage in natural infection but is not necessary for in vitro growth [27] . our study detected truncated 7b proteins solely in either feces or the colon, i.e., the main site of replication and persistence of less virulent fcovs, possibly representing virus variants that were indeed of low virulence and potentially even unable to infect monocytes/macrophages and to spread systemically. this would support the hypothesis that orf 7b mutations are not involved in the development of fip [29] . we also identified each two sequences each with truncated 3a, 3b, and 3c proteins, respectively. these results differ from those of a previous study which sequenced the structural and accessory genes of fcov from feces and organs of fip cats and mainly found truncated 3c proteins in the diseased tissue of fip cats [21] . the same study also claimed that orfs 3a, 3b, and 7a show the least variability among the other accessory and structural genes. the present study does not support this assumption, but rather indicates that the variability of the orf 3 genes is equally distributed at least in systemic infections with fcovs of low virulence. at the same time, the absence of orf 3c deletions and orf 7b mutations that have previously been linked to the development of fip in fcovs identified in the organs of our cats provides indirect support for the previous hypothesis that these might indeed be a feature of fipvs [19] . however, since healthy carrier cats also harbor virus in different organs, they are likely not a prerequisite of systemic spread in an infected animal. the orfs 3abc and 7b based phylogenetic analysis showed clustering of fcov zu3 with serotype i fcovs, whereas fcov zu1 sequences clustered with serotype ii fcovs. this was remarkable since both viruses were identified as type i fcovs after parts of the s gene had been sequenced in 2004 (dq256137 and dq256139) [14] . serotype definition is commonly based on the reactivity of antibodies to the s protein, hence it is generally accepted that fcov type i and ii can be distinguished based on the s gene sequences [15, 48] . the rna-dependent rna polymerase of covs is well known to be error-prone and to incorporate one mutation per 10 kb [49] ; this allows for three mutations per replication cycle of the 30 kb fcov genome and can lead to deleterious mutations that yield non-viable viruses. to overcome this problem, the viruses might rely on homologous recombination. the latter mostly occurs at specific hotspots in the genome, where secondary rna structures are formed that cause the polymerase to pause. four such hotspots, leading to double recombination events that result in type ii fcovs, have previously been identified; upstream ones in the polymerase sequence and downstream ones in genes e and m, respectively [17] . according to this configuration, after a double recombination event, the spike and nsp 3 genes should belong to type ii, whereas the nsp 7 should belong to type i. however, the exact locations of these recombination sites vary in the different strains, indicating that serotype ii fcovs continuously arise through independent recombination events [17, 50, 51] . therefore, homologous recombination can also generate new variants as long as they do not have evolutionary disadvantages [52] . this might also hold true for fcov zu1 where a recombination event in the 3a/3b gene resulted in a genome with type i fcov s gene, whereas the adjacent 3c gene as well as the 7b gene are those of a type ii fcov. controversial evidence exists concerning a set of defined s gene mutations first described in 2012 [33] in fipvs for which a role in the development of fip was suggested. two years later, two studies investigated these mutations in more detail. one study compared fecal samples of healthy fcov carriers with fecal and/or ascites samples of natural fip cases by sequencing the accessory and the s genes. this revealed a conserved methionine residue at amino acid position 1,058 in 9/10 fcovs from healthy carriers, whereas the m1058l mutation was found at this position in 5/6 ascites samples of fip cats whose feces generally displayed both genotypes. the authors also investigated the animals for orf 3c mutations/truncations and concluded that these together with mutations at amino acid position 1,058, account for the pathotype switch [22] . another study, published in the same year, concluded that the spike mutations in question are relevant for systemic spread and are not associated with fip. the authors found fcovs that carried the m1058l mutations in the majority of extra-intestinal tissue samples obtained from both cats with fip and healthy cats, whereas the majority of fcovs derived from fecal samples had a methionine residue. percentages of mutated versus conserved sequences in the different types of samples did not significantly differ between fip cats and controls [35] . we were also interested in the potential presence of these mutations in fcovs in our cohort of healthy carrier cats. sequence information covering the region of interest was obtained from four cats and two different organs: colon (four sequences, one per cat) and liver (one sequence). interestingly, none of these samples exhibited one or both mutations in question (m1058l and s1060a) [33] . these results could indicate that the above-mentioned mutations are not even markers of systemic spread per se [35] , but rather of another relevant aspect in the development of fip that, like systemic spread, is mediated by and/or occurs in monocytes/macrophages. however, our results are based on a single successfully sequenced extra-intestinal sample and are therefore of very limited value. mutations in the s gene are presumed to be associated with infection of monocytes and macrophages [53] , however unfortunately only few samples (2 out of 75) showed a monocyte-associated viremia [14] and due to the rna degradation, could not be further analyzed. at present, the relevance of the above-mentioned viral mutations for fip is still not clear. complementary transparent studies using clearly defined animal populations to obtain sufficient data for a definitive conclusion are clearly lacking and before these are available, no given mutation should be propagated as a diagnostic marker. a real-time rt-pcr has been marketed since august 2014 as a confirmatory diagnostic test for fip (fip virus realpcr™ test; idexx reference laboratories). a detailed description of the investigations on which the test validation was based seems to be lacking; test sensitivity (98.7%) and specificity (100%) were generated in a population of 186 cats "who were either healthy or had confirmed fip based on biopsy". it is questionable whether healthy cats are adequate controls for fip cases. after all, porter and coworkers found fcovs with the mutation described by chang and coworkers [33] not only in 91% of the tissue samples from cats with fip, but also in 89% of the tissue samples from fcov positive cats without fip [35] . therefore, with the current knowledge, we recommend considering the detection of the s gene mutations in question solely as further support of a diagnosis of fip, but not as definite proof. the exact role of the fcov spike protein in the pathogenesis of fip is still unknown. it is most likely the key determinant of cell tropism and is crucial for viral host cell entry [32] . substitution by recombination of the s2 region of fipv 79-1146 into fecv 79-1683 was found to be sufficient to enable the hybrid virus to effectively infect macrophages in vitro [30] . thus, it is likely not the receptor binding (via s1) but the fusion process that is critical in this context. also, for type i fcovs, which, unlike type ii fcovs, hardly grow in cell culture, the receptor is not yet known. interestingly, a cell culture adapted type i fcov strain (ucd 1) was found to carry a mutation in the furin cleavage site that renders the s protein susceptible to cleavage by heparan sulfate [54] , leading to the speculation that the amino acid composition of the furin cleavage site is critical for cell culture adaption and thus cell tropism. indeed, a strong correlation was later found between conservation of the furin cleavage site and fcov pathotype [36] . unfortunately, attempts to target the furin cleavage site at the boundary of s1 and s2 [36] did not yield conclusive results in our study (data not shown). fip pathogenesis includes several key processes, i.e., the establishment of systemic fcov infection; effective and sustainable viral replication in monocytes/macrophages; and activation of infected monocytes [6] . efficient entry into monocytes/macrophages is an essential prerequisite for all these, but there are most likely additional s protein-unrelated factors that allow the increase in viral replication in monocytes/macrophages and their subsequent activation. therefore, s gene mutations likely only represent contributing factors among multiple events, ultimately resulting in fip. in conclusion, despite some limitations mentioned above, using known field viruses and a controlled experimental setting, we were able to establish that fcovs in the tissue and fecal samples of healthy fcov infected cats do not carry the orf 3abc, 7b, and s gene mutations that have previously been linked to the development of fip. these findings support the hypothesis that these alterations are linked to fipv pathotype viruses. however, like all other studies so far, they do not answer the question whether their occurrence does directly cause fip. interestingly, the viruses detected in the tissues also lacked mutations seen in association with systemic spread of fcovs [35] , which indicates that they are not essential for this key prerequisite of fip. furthermore, we detected a homologous recombination site in the strain fcov zu1 that has not been described before. further comparative studies, and in particular, those using a ngs approach, are required to ultimately assess whether the development of fip can be attributed to virus characteristics alone or whether it represents an interplay between fipv pathotype viruses and a host prone to a specific ultimately detrimental immune response. the tissue and fecal samples used in this study originated from a study performed between 2000 and 2003 [14] . the study was approved by the swiss ethics committee (tvb66-2000). specific pathogen-free (spf) cats aged between 8 and 16 weeks were orally infected with different fcov type i strains (fcov zu1 [dq256137.1] or fcov zu3 [dq256139.1]) isolated from feces of healthy field cats or from the intestines of cats that had been experimentally infected with the same virus strains. cats d4, y1, and y2 originated from an unpublished study where the cats had been subcutaneously vaccinated with the inactivated homologous fcov strain 10, 7, and 4 weeks before challenge. cats d4 and y1 had been vaccinated with the inactivated strain mixed 1:1 with the adjuvant diluvac forte (intervet ltd., uk). for cat y1, a cpg motif, as described in a previous vaccine study [55] , had been added together with the adjuvant. cat y2 had been mock vaccinated. the challenge virus (1 ml) was administered twice, as previously described [14] . all infected cats had remained clinically healthy until euthanasia at 14, 28, 48, or 80 days after infection (table 1) . a full postmortem examination was carried out. the gross and histological examination did not reveal any pathological changes apart from lymphatic hyperplasia [14] . challenge stock virus samples of the fcov zu1 and zu3 strains originated from the same material originally prepared and aliquoted for the experimental challenge. all samples used in the current study had been stored at −80 • c since 2004. feces, colon, liver, and thymus or, when the thymus was not fcov positive by rt-qpcr, tonsils or lung, were analyzed. viral rna was isolated with the rneasy mini kit (qiagen, hombrechtikon, switzerland). approximately 25 mg of tissue was taken from the frozen samples and placed in a 2 ml eppendorf tube containing 600 µl of qiagen lysis buffer rlt, 3.5 µl β-mercaptoethanol, and a 3 mm steel bead (schieritz & hauenstein ag, arlesheim, switzerland). samples were homogenized and lysed by mixing at 30 hz for 1 min in a mixer mill 300 device (qiagen). after mixing, samples were centrifuged shortly to reduce the foam that had formed during the homogenization step. the homogenate was loaded on a qiashredder spin-column (qiagen) and centrifuged at 14,000 rounds per min (rpm) for 2 min; 350 µl of 70% ethanol were added to the flow-through and mixed by pipetting. samples were transferred to rneasy spin columns placed in 2 ml collection tubes. downstream operations were performed according to the manufacturer's instructions and rna was eluted with 50 µl nuclease free h 2 o and centrifugation at 10,000 rpm for 1 min. fcov loads were determined by a rt-qpcr assay that detects a 102 bp long amplicon of the 7b gene, modified from previously described protocols [56] . briefly, 25 µl reactions contained 12.5 µl 2× reaction buffer (one step rt qpcr mastermix plus low rox, rt-qprt-032xlr, eurogentec, seraing, belgium), 0.375 µl each of forward and reverse primers (20 µm) (microsynth, balgach, switzerland), 0.75 µl probe (microsynth), 0.125 µl euroscript reverse transcriptase & rnase inhibitor (eurogentec), 5 µl template rna, and rnase free water to 25 µl. the reaction underwent reverse transcription (rt) at 48 • c for 30 min, denaturation for 10 min at 95 • c, and 45 cycles of 95 • c for 15 s and 60 • c for 1 min. to test for any inhibition, each rna sample was tested neat and after 10-fold dilution in rnase free water. all qpcr assays were performed using an abi 7500 fast sequence detection system (applied biosystems, thermofisher scientific, reinach, switzerland). positive and negative controls were included in each rt-qpcr run. synthesis of cdna and pcr amplification were performed with the superscript iii one-step rt-pcr system with platinum taq dna polymerase (invitrogen, basel, switzerland) using specific primers. viral rna isolated from a cell culture supernatant infected with the fcov wellcome strain or from the fcov zu1 and fcov zu3 gut homogenates used for infection of the cats was always included as positive controls. also, two negative controls were included, of which one was kept open throughout the manipulation to control for airborne contaminations. pcr reactions (25 µl) contained 12.5 µl 2× reaction mix (provided in kit), 1 µl superscript iii rt/platinum taq mix, 0.5 µl each of forward and reverse primers (10 µm) (microsynth), 4 µl template rna, and autoclaved distilled water to 25 µl. three novel assays were developed. one amplified parts of the s gene, the 3abc genes, the e gene, and parts of the m gene of the strain fcov zu1. primers used for this amplicon (s-m, 1845 bp) were fcov_se_f (5 -tgc tgt tta act act ggt tgt tgt gga-3 ) and fcov_sm_r (5 -gca ccc gct ata cta agg ccg-a 3 ). the second assay amplified parts of the s gene, the 3abc genes, and parts of the e gene of the strain fcov zu3. primers used for this amplicon (s-e, 1348 bp) were fcov_snsp3b.f (5 -ctt ggt atg tgt ggc tac taa ttg g-3 ) and fcov_se_r (5 -atc aac agg agc cag aag aag aca ct-3 ). the third assay amplified parts of the 7a and the whole 7b gene of both strains. for this amplicon (nsp7ab, 855 bp), already described primers were used, namely 7a-f1 (5 -ctg cga gtg atc ttt cta g-3 ) [29] and fcov1229r (5 -aac aat cac tag atc cag acg tta gct-3 ) [56] . all assays were run on a biometra tpersonal thermocycler (labgene, chatel-st-denis, switzerland). cycling conditions were 55 • c for 30 min, 94 • c for 2 min; 40 cycles of 94 • c for 15 s, 60 • c for 30 s, and 68 • c for 2 min; 5 min at 68 • c and then cooling to 10 • c for the first assay (s-m); and 55 • c for 30 min, 94 • c for 2 min; 40 cycles of 94 • c for 15 s, 55 • c for 30 s, and 68 • c for 1 min; 5 min at 68 • c and then cooling to 10 • c for the second and third assay (s-e and nsp7ab). for the amplification of the s gene targeting the mutations m1058l and s1060a, a modified previously published nested rt-pcr assay was used [33] . the fcov-ucd1-s.3022 (5 -caa tat tac aat ggc ata atg g-3 ) forward and fcov-ucd1-s.3636 (5 -ccc tcg agt ccc gca gaa acc phylogenetic and molecular evolutionary analyses were conducted using geneious prime (biomatters ltd.). nucleotide sequences were edited, assembled, and aligned. alignments were manually adjusted when necessary. bootstrap phylogenetic trees were constructed using the neighbor-joining algorithm and the tamura-nei genetic distance model [57] . four bootstrap phylogenetic trees based on the 3a, 3b, 3c, or 7b genes were constructed with the consensus sequence of all viral rna sequences from the same tissue or feces of each cat and different fipv, fcov, and ccov strains retrieved from the genbank. additional neighbour-joining trees were constructed with either all 3abc or 7b sequences of the same cat or with all sequences from the same sample. the consensus sequences of fcov zu1 and fcov zu3 served as outgroups. variable sites and deletions of genes 3a, 3b, 3c, and 7b were analyzed with graph pad prism version 8.4.2 (san diego, ca, usa). data were tested for normal distribution with the kolmogorov-smirnov test. statistical significance was determined using one-way anova analysis of variance with bonferroni correction to compare the mutational frequency of the different genes. mean mutation frequencies in challenge stock viruses and viral sequences from tissues/feces for orf 3abc and orf 7b of fcov zu1 and zu3 were evaluated using the wilcoxon signed-rank test. the time-dependent mutation frequency between genes was analyzed using kruskal-wallis with dunn's multiple comparison posttest. data were expressed as median with boxes and whiskers showing the maximum and minimum values within a group or as means in a bar chart. a p-value < 0.05 was considered statistically significant in all cases. supplementary materials: the following are available online at http://www.mdpi.com/2076-0817/9/8/603/s1, figure s1 : nucleotide alignment of the orf3abc region between fcov zu1 and defined serotype 1 fecv-ucd3 (fj943761)) and serotype ii (fipv 79-1146 (dq010921)) viral sequences. sequences were aligned using the geneious prime software (www.geneious.com, biomatters ltd.) and the muscle method. dots depict identical nucleotides. the grey bars indicating the different orf3abc locations are based on the sequence of fcov zu1. figure s2 : phylogenetic analysis based on the sequences encoding for nonstructural proteins 3abc for cats 3132b, 3138b, and 3312b. (*, day p.i. of euthanasia; yellow dot, sequence carries snp that leads to a premature stop codon; bar, mean number of differences per 1000 sites), figure s3 : phylogenetic analysis based on the sequences encoding for nonstructural protein 7b for cats 3132b, 3138b, and 3312b. (*, day p.i. of euthanasia; red dot, sequence carries a deletion that leads to a premature stop codon; bar, mean number of differences per 1000 sites). figure s4 : phylogenetic analysis based on the sequences encoding for nonstructural proteins 3abc for cats d4, y1, y2, and 3257b. (*, day p.i. of euthanasia; red dot, sequence carries a deletion that leads to a premature stop codon (y1) or a shortened protein without frameshift (y2); green dot, snp causes disappearance of a stop codon; bar, mean number of differences per 1000 sites). figure s5 : phylogenetic analysis based on the sequences encoding for nonstructural protein 7b for cats d4, y1, y2, and 3257b. (*, day p.i. of euthanasia; red dot, sequence carries a deletion that leads to a premature stop codon; blue dot, snp causes disappearance of a start codon; bar, mean number of differences per 1000 sites). cycling conditions were 55 • c for 30 min, 94 • c for 2 min; 40 cycles of 94 • c for 15 s, 47 • c for 30 s, and 68 • c for 2 min; 5 min at 68 • c and then cooling to 10 • c. for the second nested pcr step, if needed, two newly designed primer pairs amplifying the region of interest (amplicon 134 bp), fcov-ucd1-s.3027f (5 -aat ggt gct tcc tgg ggt tg-3 ) and fcov-ucd1-s.3160r (5 -gca cct gca tag caa aag gc-3 ), and the phusion hot start ii high-fidelity dna polymerase (thermo scientific scientific) were used. then, 5 µl of one-step rt-pcr cycling conditions were 98 • c for 3 min; 40 cycles of 98 • c for 10 s, 59 • c for 30 s, and 72 • c for 2 min or alternatively, a gene ruler dna ladder mix (thermo scientific) or a 10-kilobase-pair dna ladder (eurogentec), was used for molecular size comparisons. appropriate bands were cut out with sterile razor blades and weighed cloning and sequencing the pcr amplicons were cloned into the pcr4 plasmid with the topo ta cloning kit for from each cloned rt-pcr amplicon, three colonies for each tissue/fecal sample as well as 23 and 21 colonies of the fcov zu1 and fcov zu3 challenge stock virus strains (for better characterization of the virus pool used for infection), respectively, were picked and cultured overnight at 37 • c in luria-bertani-liquid-medium containing ampicillin. cultures were centrifuged at 7900 rpm for 3 min and the pellets were used for further manipulation fcov zu1 tissue/fecal virus orf3abc fcov zu3 tissue/fecal virus orf3abc fcov zu1 tissue/fecal virus orf7b fcov zu3 tissue/fecal virus orf7b fcov zu1 & zu3 tissue virus partial s-gene: mt551867-mt551871) the genome organization of the nidovirales: similarities and differences between arteri-, toro-, and coronaviruses quasispecies theory and the behavior of rna viruses an overview of feline enteric coronavirus and infectious peritonitis virus infections a review of feline infectious peritonitis virus infection: 1963-2008 feline infectious peritonitis feline infectious peritonitis: abcd guidelines on prevention and management risk of feline infectious peritonitis in cats naturally infected with feline coronavirus feline infectious peritonitis viruses arise by mutation from endemic feline enteric coronaviruses an enteric coronavirus infection of cats and its relationship to feline infectious peritonitis replication of feline coronaviruses in peripheral blood monocytes intrinsic resistance of feline peritoneal macrophages to coronavirus infection correlates with in vivo virulence detection of feline coronaviruses by culture and reverse transcriptase-polymerase chain reaction of blood samples from healthy cats and cats with clinical feline infectious peritonitis high viral loads despite absence of clinical and pathological findings in cats experimentally infected with feline coronavirus (fcov) type i and in naturally fcov-infected cats comparison of the amino acid sequence and phylogenetic analysis of the peplomer, integral membrane and nucleocapsid proteins of feline, canine and porcine coronaviruses differentiation of feline coronavirus type i and ii infections by virus neutralization test feline coronavirus type ii strains 79-1683 and 79-1146 originate from a double recombination between feline coronavirus type i and canine coronavirus feline infectious peritonitis: insights into feline coronavirus pathobiogenesis and epidemiology based on genetic analysis of the viral 3c gene comparative in vivo analysis of recombinant type ii feline coronaviruses with truncated and completed orf3 region feline infectious peritonitis: role of the feline coronavirus 3c gene in intestinal tropism and pathogenicity based upon isolates from resident and adopted shelter cats significance of coronavirus mutants in feces and diseased tissues of cats suffering from feline infectious peritonitis mutations of 3c and spike protein genes correlate with the occurrence of feline infectious peritonitis an outbreak of feline infectious peritonitis in a taiwanese shelter: epidemiologic and molecular evidence for horizontal transmission of a novel type ii feline coronavirus the role of accessory proteins in the replication of feline infectious peritonitis virus in peripheral blood monocytes mutation of neutralizing/antibody-dependent enhancing epitope on spike protein and 7b gene of feline infectious peritonitis virus: influences of viral replication in monocytes/macrophages and virulence in cats attenuated coronavirus vaccines through the directed deletion of group-specific genes provide protection against feline infectious peritonitis the molecular genetics of feline coronaviruses: comparative sequence analysis of the orf7a/7b transcription unit of different biotypes deletions in the 7a orf of feline coronavirus associated with an epidemic of feline infectious peritonitis genetic diversity and correlation with feline infectious peritonitis of feline coronavirus type i and ii: a 5-year study in taiwan acquisition of macrophage tropism during the pathogenesis of feline infectious peritonitis is determined by mutations in the feline coronavirus spike protein sequence analysis of the nucleocapsid gene of feline coronaviruses circulating in italy mechanisms of coronavirus cell entry mediated by the viral spike protein spike protein fusion peptide and feline coronavirus virulence genotyping coronaviruses associated with feline infectious peritonitis amino acid changes in the spike protein of feline coronavirus correlate with systemic spread of virus from the intestine and not with feline infectious peritonitis mutation in spike protein cleavage site and pathogenesis of feline coronavirus a reverse genetics approach to study feline infectious peritonitis tackling feline infectious peritonitis via reverse genetics genomic rna sequence of feline coronavirus strain fcov c1je feline coronavirus 3c protein: a candidate for a virulence marker? prevalence and mutation analysis of the spike protein in feline enteric coronavirus and feline infectious peritonitis detected in household and shelter cats in western canada pathogens 2020 mutation of the s and 3c genes in genomes of feline coronaviruses detection of feline coronavirus mutations in paraffin-embedded tissues in cats with feline infectious peritonitis and controls persistence, and transmission of natural type i feline coronavirus infection unfinished stories on viral quasispecies and darwinian views of evolution quasispecies composition and phylogenetic analysis of feline coronaviruses (fcovs) in naturally infected cats a comparison of the genomes of fecvs and fipvs and what they tell us about the relationships between feline coronaviruses and their evolution direct method for quantitation of extreme polymerase error frequencies at selected single base sites in viral rna full genome analysis of a novel type ii feline coronavirus ntu156 emergence of pathogenic coronaviruses in cats by homologous recombination between feline and canine coronaviruses recombination in large rna viruses differential susceptibility of macrophages to serotype ii feline coronaviruses correlates with differences in the viral spike protein cleavage of group 1 coronavirus spike proteins: how furin cleavage is traded off against heparan sulfate binding upon cell culture adaptation immunization of cats against feline immunodeficiency virus (fiv) infection by using minimalistic immunogenic defined gene expression vector vaccines expressing fiv gp140 alone or with feline interleukin-12 (il-12), il-16, or a cpg motif one-tube fluorogenic reverse transcription-polymerase chain reaction for the quantitation of feline coronaviruses the neighbor-joining method: a new method for reconstructing phylogenetic trees this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license the authors would like to acknowledge enikö gönczi and andrea spiri for their excellent technical support. the laboratory work was performed using the logistics of the center for clinical studies at the vetsuisse faculty of the university of zurich. this work was mainly performed by mirjam lutz and aline r. steiner as partial fulfillment of their masters theses. the authors declare no conflict of interest.pathogens 2020, 9, 603 key: cord-325043-vqjhiv7p authors: gorbalenya, alexander e.; blinov, vladimir m.; donchenko, alexei p.; koonin, eugene v. title: an ntp-binding motif is the most conserved sequence in a highly diverged monophyletic group of proteins involved in positive strand rna viral replication date: 1989 journal: j mol evol doi: 10.1007/bf02102483 sha: doc_id: 325043 cord_uid: vqjhiv7p ntp-motif, a consensus sequence previously shown to be characteristic of numerous ntp-utilizing enzymes, was identified in nonstructural proteins of several groups of positive-strand rna viruses. these groups include picorna-, alpha-, and coronaviruses infecting animals and como-, poty-, tobamo-, tricorna-, hordei-, and furoviruses of plants, totalling 21 viruses. it has been demonstrated that the viral ntp-motif-containing proteins constitute three distinct families, the sequences within each family being similar to each other at a statistically highly significant level. a lower, but still valid similarity has also been revealed between the families. an overall alignment has been generated, which includes several highly conserved sequence stretches. the two most prominent of the latter contain the socalled “a” and “b” sites of the ntp-motif, with four of the five invariant amino acid residues observed within these sequences. these observations, taken together with the results of comparative analysis of the positions occupied by respective proteins (domains) in viral multidomain proteins, suggest that all the ntp-motif-containing proteins of positive-strand rna viruses are homologous, constituting a highly diverged monophyletic group. in this group the “a” and “b” sites of the ntp-motif are the most conserved sequences and, by inference, should play the principal role in the functioning of the proteins. a hypothesis is proposed that all these proteins posses ntp-binding capacity and possibly ntpase activity, performing some ntp-dependent function in viral rna replication. the importance of phylogenetic analysis for the assessment of the significance of the occurrence of the ntp-motif (and of sequence motifs of this sort in general) in proteins is emphasized. structural (sequence) motifs thought to be identifiers of certain protein activities are among the main tools used in the functional and evolutionary interpretation of protein sequence data (doolittle 1986a,b; hodgman 1986) . because these motifs are short sequence stretches, and usually include amino acid residues frequent in proteins (i.e., gly, ala, ser, and some others), the presence of such a motif in a protein sequence is, as a rule, in itself not statistically significant. thus, it is important to work out some additional criteria for evaluation of such observations. one of the most widespread sequence motifs is implicated in an activity crucial to the function of a great variety of proteins, namely purine nucleotide binding followed in most cases by hydrolysis of the 8-3' phosphate bond. this motif was first recognized by walker and coworkers in several atp-and gtputilizing enzymes (walker et al. 1982; gay and walker 1983) . it consists of two separate units designated "a" and "b" sites; the "b" site is located in the polypeptide chains c-proximally relative to the "a" site. for the "a" site the following consensus sequence was proposed: gxxxxgk-(t) xxxxxxi/v, and the "b" consensus was r/k-xxxgxxxl***d, where x stands for any amino acid residue, and * for a hydrophobic residue (walker et al. 1982) . results of subsequent analyses of a variety of ntp-utilizing proteins (reviewed by halliday 1984; m611er and amons 1985; doolittle 1986a) suggest much more liberal consensus formulas, namely g/axxxxgkt/s for the "a" site, and an asp residue preceded by five residues, three of which are hydrophobic, for the "b" site. hereafter we accept these loose definitions for the "a" and "b" consensus sequences; taken together, they are designated "ntp-motif." in fact, in recent studies, protein sequences were searched for the "a" consensus alone as the "b" consensus in its loosest form is obviously too degenerate to be unequivocally recognized, except in a family of diverged proteins (see below). for adenylate kinase, escherichia coli tu factor, p2 lras oncoprotein, sv40 t antigen, and some other proteins, there is experimental evidence that the ntp-motif, or at least a larger segment of a protein encompassing it, is involved in ntp binding and/ or cleavage (clertant and seif 1984; jurnak 1985; la cour et al. 1985; fry et al. 1986 ). more specifically, the "a" site has been implicated directly in the binding of the pyrophosphate moiety of ntp, whereas the asp residue of the "b" site appears to interact with the magnesium cation complexed with the same phosphate groups (m611er and amons 1985; bradley et al. 1987) . all these observations make it an attractive idea to search sequences of functionally uncharacterized proteins for the presence of the ntpmotif to the end of prediction of ntpase activity, or at least ntp-binding capacity. following this line, we screened the protein sequences of positive-strand rna viruses (the largest class of viruses, whose single-stranded genomic rna also serves as the mrna for the synthesis of viral proteins) and identified the "a" consensus in nonstructural proteins of several viral families (gorbalenya et al. 1985) . in some of these proteins the presence of this consensus has been independently noticed by other workers too (argos and leberman 1985; doolittle 1986a; dever et al. 1987; domier et al. 1987) . also, the ntp-binding capacity of one of these proteins, p 126 of tobacco mbsaic virus, has been demonstrated experimentally quite independently (evans et al. 1985) . we proposed that such a capacity should be characteristic of all the ntpmotif-containing proteins of positive-strand rna viruses. on the other hand, the validity of such predictions in general has been disputed (argos and leberman 1985; doolittle 1986a) . moreover, doolittle (1986a) identified the "'a'" consensus in several proteins reported to be devoid of ntp-binding properties. in the present study we undertook a more systematic investigation of the primary structures of the proteins of positive-strand rna viruses containing the "a" consensus of the ntp-motif. we demonstrate that the ntp-motif-containing proteins of positive-strand rna viruses (including 8 proteins identified in the previous paper and 13 proteins of viruses whose genomes have been sequenced since then) constitute three monophyletie families that can be brought together into a higher rank taxon. in these homologous proteins the "a" and "b" sites of the ntp-motif are the most strictly conserved sequences and, by inference, should be of principal functional importance, presumably constituting parts of ntpase catalytic centers. protein sequences were extracted from the current literature (for references see table 1 ). the initial screening of the sequences of positive-strand rna viral proteins for the presence of the "a" consensus sequence of the ntp-motif was performed by use of the program srch designed to screen protein sequences for defined amino acid residue strings (motifs). the selected sequences were further analyzed manually for the presence ofcandidate "b" sequences c-proximal to the "a" site. sequences of the viral proteins containing the ntp-motif were compared by use of the programs diagon (staden 1982) and optal (pozdnyakov and pankov 1981) ; the latter program was modified and adopted for multiple sequence alignment as described below. all the programs were written in fortran and run on an es-1060 computer. the program optal, based on the original algorithm of sankoff (1972) , performs optimal alignment of pairs of protein sequences or stepwise alignment of multiple sequences. according to the sankoff algorithm, a series of cumulative similarity matrices for the compared sequences is created. in the present work, for the calculation of the elements of these matrices, weights of amino acid residue pairs were taken from the mutation rate scoring matrix mdm78 (staden 1982) . to accelerate calculations, only those elements of the matrices enclosed within a diagonal window, the width of which was chosen to be equal to v5 of the length of the compared sequences, were computed; it has been shown that this window width is sufficient in most cases for generation of the optimal alignment (pozdnyakov and pankov 1981) . at the first step of the optimal alignment generation, a series of locally optimal alignments with q = 0, 1, 2, . . . ,qmax gaps was obtained. in practice, for sequence lengths of up to 250 residues dealt with in this work, qm~, was chosen to be equal to 15. this exceeds the gap number most frequently observed in related protein sequences (about four gaps per 100 residues; see doolittle 1981) and should guarantee the generation of the optimal alignment. for selection of the best of the alignments with a given q value and concomitant assessment of its statistical significance, the following monte carlo procedure was employed. twenty-five pairs of "random" sequences were generated by scrambling the real compared sequences, and the above alignment procedure was simulated for each pair. the mean score, sq ~d and the standard deviation, aq, were calculated separately for alignments with different gap numbers. for each of the locally optimal real alignments, the deviation ofthe observed score from the mean value for the randomized sequences was calculated in sd units: rezaian et al. 1985 ahlquist et al. 1984 cornelissen et al. 1983 goelet et al. 1982 strauss et al. 1984 takkinen 1986 gustafson and armour 1986 bouzoubaa et al. 1986 bouzoubaa et al. 1987 boursnell et al. 1987 for picorna-and potyviruses, the numbering of complete polyproteins is indicated; for alphaviruses, the numbering of the nonstructural polyproteins is indicated; for cpmv the numbering of the polyprotein encoded by rna b is indicated; and for bnyvv rna 1 the numbering of the entire high-molecular weight product is indicated. the dendrograms were designed to visualize the procedure of multiple sequence alignment in the order of decreasing similarity between proteins; they cannot be automatically regarded as evolutionary trees. for abbreviations of viruses see table 1 . the values refer to the conserved domains, as indicated in table 1 , and, for the proteins of the 1 st family, to the n-terminal subdomains. dq = sq r'l -sq~"d/aq. the alignment with the maximal d value was considered optimal for the selected window width. for multiple sequence alignment, a generalization of this procedure was employed. to align two sets of m and n prealigned sequences, cumulative similarity matrices were created as before, but for the calculation &their elements, values w~j = ~w~, i.e., the combined weights of all possible pairs of residues (m. n total) in the ith position of the set n and the jth position of the set m are used instead of the weights of individual pairs of residues. a weight of 10 was ascribed to a pair of two gaps, and a weight of zero to a pair of gaps with any residue. the procedures of alignment generation and the choice of the optimal alignment were as described, but, for the generation of "random" sequence sets, "columns" of residues occupying each position in the real sets of aligned sequences were jumbled. preliminary comparative analysis of the amino acid sequences of the ntp-motif-eontaining proteins of positive-strand rna viruses by use of the programs diagon and optal (see methods) revealed three distinct families and some additional proteins in whose close relatives the motif was not conserved. within each family, all the proteins contained stretches at least 120 residues long that were similar to each other at a statistically highly significant level. for most pairs, the observed alignment scores exin this table. question marks indicate that the real size of the respective proteins is not known; the large proteins presented in the table may in fact be processed. in those cases where sequences of several serotypes (strains) of a single virus species were available (specifically, for several picornaviruses and tmv), only one sequence was included. an exception is rhinovirus serotypes 14 and 2 with sequences that are substantially different, pr = product the ntp-motif-containing segments displaying statistically significant similarity within each family (see table 1 ) are shown in black; they were aligned by the "a" sites of the ntp-motif (see text). the tricorna-and tobamovirus proteins are multidomain proteins, with the n-terminal domains similar to each other and to alphavirus nspl protein (not indicated; ahlquist et al. 1985) . in the bottom of a and b the respective patterns ofevolutionarily conserved amino acid residues are shown (designated "consensus 1" and "consensus 2"). invariant residues are capitalized. dots stand for variable residues (or gaps) within conserved residue clusters; the lengths of variable regions between these clusters are indicated by bracketed numbers. asterisks denote the amino acid residues constituting the "a" site and the proposed mg2+-binding d residue of the "b" site. ceeded the mean scores for randomized sequences by at least 5 sd (see methods). such a level of sequence similarity between proteins is usually regarded as serious evidence for their monophyletic origin (doolittle 1981 (doolittle , 1986a dayhoffet al. 1983) . to obtain optimal group alignments for each family, the sequences were aligned in order of decreasing similarity (fig. 1) . as is evident from the figure, the significance of the multiple sequence alignments was quite high for each of the families. the families were numbered 1 st, 2nd, and 3rd in order of decreasing sequence divergence between the presently recognized members 9 for part of the proteins constituting the 1st and the 2nd families, analogous sequence comparisons (but with no reference to the ntpmotif) were performed previously by other workers and meaningful similarities have also been observed goldbach 1986) . very recently, domier et al. (1987) compared the sequences of the proteins of the 2rid and the 3rd families and noticed the presence of the "a" consensus of the ntp-motif. the i st family includes the ntp-motif-containing proteins (domains) of alpha-, tobamo-, tricorna-, furo-, and coronaviruses as well as the putative product of the hordeivirus rna ~ open reading frame (orf) 2, totalling 10 proteins (table 1) . these proteins vary greatly in their size ( fig. 2a) , genomic positions of the respective coding sequences, and modes of expression. for the proteins of this family, a statistically significant similarity was observed within a fragment of about 250 amino acid residues ( fig. 2a) . this fragment contains 21 highly conserved residues (of which 14 are invariant) divided between seven clusters of unequal size (or individual residues). the first and third conserved clusters en-compass the "a" and "b" sites of the ntp-motif, respectively. the n-proximal five clusters of conserved amino acid residues in these proteins are separated from the sixth and seventh clusters by a variable region of about 60-90 residues. in fact, the conserved domain appears to be further divided into two subdomains, the n-terminal one containing the ntp-motif, and the c-terminal one of totally unknown function. two specific points are worth noting. first and most remarkably, two segments of the furovirus genome encode two ntp-motif-containing proteins only distantly related (as compared to other members of the family) to each other; this is demonstrated by direct pairwise comparison of their sequences (data not shown). second, the inclusion of the coronavirus ntp-motif-containing domain within the 1 st family is tentative, as the level of its sequence similarity to the other proteins of this family is not much higher than that between different families (see below). also, the distance between the "a" and "b" sites of the ntp-motifis much longer in the coronavirus protein than those in the other proteins of this family. nevertheless, all the amino acid residues invarianl in the latter are conserved in the coronavirus protein also (see below), justifying its inclusion in this family. the 2nd family of ntp-motif-containing proteins includes picornaviral proteins 2c and comoviral protein p58, totalling nine proteins (table 1) . these proteins are much more uniform in their size (fig. 2b) , genomic positions of the respective genes, and mode of expression than those of the 1st family. the region of the most prominent similarity spans the central domain of about 130 amino acid residues; this domain contains 45 conserved residues (23 invariant), more or less evenly distributed (fig. 2b , consensus 2). the "a" and "b" sites of the ntpmotif are located near the n-terminus and in the middle of the conserved domain, respectively. the 3rd family includes ci proteins of two potyviruses (table 1 and fig. 2c ). these proteins are very similar to each other, having more than 50% identical amino acid residues. thus, derivation of a consensus, like those derived for the other two families, made little sense. the "a" and "'b" sites of the ntp-motifare located in the n-terminal parts of ci proteins. comparison of the prealigned sequences of the three families of ntp-motif-containing proteins by the multiple alignment version of optal yielded highly significant alignment scores for all three possible pairs (fig. 1) . however, the final alignment of the sequences of the three families generated by this 261 program (not shown) was not quite satisfactory because the "b" sites of the ntp-motif, as well as some other clusters of residues that seemed good candidates for the conserved regions, did not coincide (although it must be pointed out that the "a" sites did match). presumably, this might be due to different lengths of spacers separating these regions in the proteins of the three families. thus, an overall alignment has been generated by manual fitting of the computer alignments of the three sets of sequences so as to maximize residue coincidence conserved within individual families (fig. 3) . in this alignment five amino acid residues are strictly invariant, four additional residues are common to the 2nd and 3rd families, and three residues are conserved in the 1st and 3rd families. in addition, several positions in all, or nearly all, the sequences are occupied by functionally related residues (fig. 3, consensus) . all in all, a certain degree of conservation was observed in about 40% of the positions of the alignment (highlighted in fig. 3 and further characterized in the legend to this figure) . strikingly, four of the five invariant residues are located within the "a" and "b" sites of the ntpmotif (fig. 3) . these sites and short sequence stretches surrounding them also contain a considerable number of additional coincidences and similar sequence replacements between proteins of different families. thus, the "a" and "b" consensus sequences and short adjacent segments are the most similar portions of the ntp-motif-containing proteins of positive-strand rna viruses. of additional interest is a comparison of the positions of the ntp-motif-containing proteins (domains) in viral multidomain proteins; this approach is illustrated in fig. 4 . only two stretches of similar amino acid sequences are common to all viruses analyzed in this study: (1) the conserved region of the rna polymerase morozov and rupasov 1985; koonin et al. 1987 koonin et al. , 1988 , and (2) the ntp-motif-containing domain (this paper). viruses, with proteins that constitute the 2nd and 3rd families characterized above, possess an additional protein sequence of significant similarity, i.e., the proteases of picorna-, como-, and potyviruses franssen et al. 1984; carrington and dougherty 1987; domier et al. 1987) . in all viruses with nonsegmented genomes (with the probable exception of coronaviruses), in cpmv b polyprotein, and in the furovirus rna 1 product (p237), the proteins (domains) containing similar sequence stretches are positioned in the same order within multidomain proteins, namely n-ntp-motif-containing domain-(protease)-polymerase-c (fig. 4) . in coronaviruses the polymerase has not yet been identified. however, the results of our preliminary analysis indicate that the polymerase doan overall alignment of the evolutionarily conserved segments of the vi~l ntp-moti~containing proteins of the three families. only partial sequences of the conserved regions (table 1 ) were aligned; they encompass the n-terminal subdomains of the proteins of the 1 st family, the sequences of the 2nd family without the five n-terminal amino acid residues, and complete sequences of the 3rd family. the residue numbers shown above the alignment are arbitrary; the numbering begins from the first residues of the aligned stretches and includes gaps. the sets of sequences of the three families aligned by the program optal are separated by blank lines. dots denote conservative positions. these are defined here as positions occupied by similar amino acid residues in at least 50% of the sequences of each of any two of the three families. the upper, middle, and lower rows of dots indicate the conservative positions of the 1st, 2rid, and 3rd families, respectively. thus, if a given position in the alignment contains dots, say, in the upper and lower rows, this indicates the conservation of residues (in the above sense) between the 1st and 3rd families, and so on. similar residues are defined as those belonging to one of the following groups: a, v, i, l, m, and f (hydrophobic); f, y, and w (aromatic); g and a (small); s and t (hydroxy-); k, r, and h (basic); d, e, n, and q (acidic and their derivatives); c and p have no similar residues. the pattern of highly conserved residues is shown under the aligned sequences, designated "cons" for consensus. uppercase letters correspond to invariant residues, and lowercase letters to those conserved in two out of three families; in the latter ease, where a similar residue was conserved in the 3rd family, it was also indicated. * = a hydrophobic residue. the "a'" (positions 6-13 in the alignment) and "b" (positions 93-98) sites of the ntp-motif are denoted by horizontal bars above and below the alignment. for viruses with segmented genomes, the specific designations of the rna segments encoding the ntp-motif-containing proteins are given in parentheses. main also resides in f2, but its position relative to the ntp-motif-containing one is reversed as compared to the "canonical" array described above (unpublished observations). anyway, this single exception certainly does not invalidate the general trend for the specific positioning of these domains in viral rnultidomain proteins. comparative analysis of the amino acid sequences of all positive-strand rna virus rna polymerases provides a strong case for their monophyletic origin . we believe that the sequence similarity between the ntpmotif-containing proteins, together with their similar localization in viral multidomain proteins, indicate that they also constitute a monophyletic group. in the course of the present study we screened all the available protein sequences of positive-stand rna viruses for the presence of the ntp-motif. also, some additional searches have been made: (1) domains occupying positions similar to those of the ntp-motif-containing ones in viral multidomain proteins were searched for the possible presence of degenerate forms of the motif; and (2) partially sequenced proteins were tested for similarity to the ntp-motif-eontaining proteins. the "a" consensus sequence has been found in the c-terminal part of aimv rna polymerase, in the capsid protein of yellow fever virus (a flavivirus), in ns1 proteins of four flaviviruses, and in the f1 polyprotein of ibv; also, a second "a" sequence (besides the one included in our alignment) is present in the furovirus p237. in the first three instances the consensus sequence was not conserved in the relatives of the respective proteins, suggesting that its occurrence was most likely fortuitous. in the last two cases, the absence of other coronavirus and furovirus protein sequences precluded this type of analysis, leaving the significance of these observa. similar sequence stretches are also joined by sloped lines. the ntp-motif-containing protein ("ntpase"), the rna-dependent rna polymerase (polymerase), and the protease (the latter identified in picorna-, como-, and potyviruses) are designated. nominations of specific proteins are given above each rectangle. other designations are: ~, sites of proteolytic processing; ~, leaky termination codons [of two alphaviruses included, nsp4 is expressed only in snbv via a leaky termination codon (takkinen 1986) ]. all the information is given only for the ntp-motif-containing proteins (domains), the polymerases, and the parts of multidomain proteins enclosed between. tions uncertain. however, it should be noted that the segments of f1 and of p237 encompassing the consensus sequence bear no significant sequence similarity to the viral ntp-motif-containing domains described above (unpublished observations). flavivirus protein ns3, which occupies a position similar to that of alphavirus nsp2 in the polyproteins of these viruses, contains an "a" consensus sequence with a single deviation and a "b" sequence strikingly similar to those of the 1 st and 3rd families of viral ntp-motif-containing proteins. comparison of the three available sequences of ns3, those of yellow fever, west nile, and dengue 2 flaviviruses castle et al. 1986; yaegashi et al. 1986) , demonstrated strict conservation of these sequences. a more detailed analysis that we recently performed revealed statistically significant similarity between the putative ntp-binding domains of ns3 and those of potyviral proteins (unpublished observations). it seems quite plausible that ns3 may have some degree of evolutionary and functional relatedness to the group of viral proteins described in this paper. a striking similarity has been detected between the c-terminal sequence of the protein p 120 encoded by bsmv rna ~ [for which only a partial sequence has been reported (rupasov et al. 1986) ] and the c-terminal subdomain of the 1 st family of viral ntp-motif-containing proteins. although the n-terminal part of the p120 sequence is not yet known, in all other proteins of this family, invariably the two subdomains are observed together. thus, the hordeivirus genome, like the furovirus genome, probably encodes two ntp-motif-containing proteins in two genomic segments . in all other complete protein sequences of positive-strand rna viruses reported, namely those of black beetle virus (a nodavirus), carnation mottle virus, and rna bacteriophages, the consensus sequences of the ntp-motif have not been observed. the ntp-motif was first introduced by walker et al. (1982) and was subsequently employed for localization of putative catalytic sites and for prediction of ntp-binding capacity in numerous proteins. however, for reasons mentioned in the introduction, the validity of the whole approach remained rather uncertain. in the present study we demonstrate that in a highly diverged group including similar proteins of positive-strand rna viruses, the consensus sequences of the ntp-motif constitute the most strictly conserved stretches, encompassing four of the five invariant amino acid residues. moreover, the ntp-motif-containing domain is one of the two most conserved domains revealed upon an overall comparison of the sequences of this class of virus proteins. this strongly suggests that this protein domain possesses ntp-binding capacity and possibly ntpase activity, presumably supplying some ntp-dependent function(s) that is of vital importance for viral reproduction. this hypothesis is in agreement with the available experimental data implicating these proteins in viral rna replication and with the reported ntp-binding capacity of tmv p126 (evans et al. 1985) , although direct testing is certainly warranted. in fact, there is experimental evidence that clearly, though indirectly, demonstrates the importance of the ntp-motif in viral rna replication, recently several poliovirus mutants resistant to or dependent on guanidine, a potent inhibitor of rna replication of some picornaviruses, have been thoroughly studied (pincus et al. 1986 (pincus et al. , 1987 . they all mapped to the 2c protein, with the amino acid replacements located in the proximity of the "a" and "b" sites of the ntpmotif, or near the conserved asn residue in the 183rd position of the segments of 2c aligned in this paper (fig. 3) . dever et al. (1987) have recently proposed a consensus for gtp-binding domains that includes, in addition to the "a" and "b" sites of the ntp-motif, a third highly conserved sequence element thought to determine the specificity for guanosine. they identified this sequence in the 2c protein of one serotype of fmdv (but not of the other picornaviruses) and suggested that this protein should possess specific gtp-binding capacity, as opposed to other picornaviral 2c proteins. however, it would be unprecedented for proteins so closely related to have different specificities for nucleotides. in our 265 opinion, it is much more likely that, within groups of highly similar ntp-motif-containing proteins such as picornaviral 2c, the substrate specificities and other principal properties should be identical. on the other hand, when considering more distantly related proteins, such as those belonging to the three distinct families described above, one cannot exclude the possibility that such proteins might differ significantly in their activities and functions in viral reproduction. it seems premature to discuss at length the possible significance of the present observations for understanding the evolution of positive-strand rna viruses. two trends, however are obvious. first, ntp-motif-containing proteins are nearly ubiquitous among eukaryotic positive-strand rna viruses. the existing classification of these viruses (matthews 1982) includes about 30 families (groups), or somewhat more, taking recent developments into consideration. for 13 of these, complete, or nearly complete genomic sequences are available. proteins containing the typical ntp-motif were observed in nine families (table 1 ); in addition, viruses of one family (flaviviridae) probably possess a functionally related protein with a deviant motif. it is tempting to speculate that an ntp-dependent function supported by the amino acid residues constituting the ntp-motif may be indispensable for positivestrand viral rna replication; in some cases this function may be supplied by cellular proteins. in this context it is compelling that the rna replicase of single-stranded rna bacteriophages contains the translation elongation factor tu, an ntp-motifcontaining gtpase, as one of its subunits (reviewed by blumenthal 1979) . second, it appears that the sequence diversity of the ntp-motif-containing proteins as revealed here does not precisely reflect the "phenotypic" diversity of viruses that forms the basis for the existing classification. of the nine virus families (groups) having ntp-motif-containing proteins, six contribute members to the 1 st family of proteins (see above), two to the 2nd, and one to the 3rd family. thus, the 1st family covers a very broad range of viral groups differing greatly in their genomic strategies and biological properties. it is anticipated that sequencing ofgenomes of new viral groups will add new members to this family. the ntp-motif (or the "a'" consensus alone) has been identified in an extremely large class of ntpbinding proteins, mostly ntpases (although it should be noted that the presence of this motif is not an absolute prerequisite for ntp-binding capacity). the ntp-motif-containing proteins include a large group of gtpases, namely the ras family, g proteins, transducins, and some of translation initiation and elongation factors (dever et al. 1987 and references therein). also belonging to this class are numerous proteins involved in bacterial dna synthesis, recombination, and repair, and in membrane transport (doolittle et al. 1986; finch et al, 1986a,b; higgins et al. 1986; husain et al. 1986; yin et al. 1986; gilchrist and denhardt 1987) , proteins implicated in multidrug resistance in mammalian cells (chen et al. 1986; gros et al. 1986) , and several ntp-utilizing enzymes of dna viruses (gorbalenya et al. 1985; anton and lane 1986; doolittle 1986a; astell et al. 1987, and references therein) . from this incomplete list it is obvious that the presence of the ntp-motif brings together numerous proteins with extremely diverse functions. it must be emphasized that many ntp-motif-containing proteins do not bear statistically significant similarity to each other (of. argos and leberman 1985; doolittle 1986a ) and the existence of distinct monophyletic groups of such proteins (excluding very closely related, such as, for example, different ras species) is not obvious a priori. nevertheless, the ntp-motif-containing proteins of positive-strand rna viruses do constitute such a family, whereas the gtpases probably constitute another. although widespread, ntp-motif-containing proteins are not strictly ubiquitous in all biological species. specifically, this motif could not be found upon screening of the protein sequences of two large viral classes, negative-strand rna viruses and retroid viruses (unpublished observations). thus, the presence of proteins of this class in the majority of eukaryotic positive-strand rna viruses appears to be a nontrivial observation, given their small genome size. as for the value of the ntp-motif as a predictor of protein function, we believe that searching amino acid sequences for this motif (and conceivably for other sequence motifs of this kind) may be a very powerful methodology, if accompanied by phylogenetic analysis. during preparation and reviewing of this manuscript, important relevant information became available. genome sequences of viruses of three more groups that encode ntp-motif-containing proteins were determined. these are tobacco rattle virus [a tobravirus (hamilton et al. 1987) ], white clover mosaic virus and potato virus x [two potexviruses (forstcr et al. 1988; krayev et al. 1988 )], and tomato black ring virus [a nepovirus (c. fritsch, personal communication)]. the presumptive ntpbinding domain of the tobravirus is closely related to that of tmv and clearly belongs to the 1 st family of viral ntp-motifocontaining proteins described above. the genomes of potexviruses each encode two ntp-motif-containing proteins. these proteins also beiong to the 1 st family, but their inclusion in the alignment further loosens the consensus. interestingly, in some positions of the potexvirus proteins, residues otherwise invariant in the 1 st family are replaced by those characteristic of the 2rid family. the nepovirus ntp-motif-containing protein belongs to the 2nd family. thus, the new data appear to confirm our prediction that sequencing of genomes of viruses belonging to new groups should add members mainly to the 1st family of ntpmotif-containing proteins. also, the genome sequence of southern bean mosaic virus, a sobemovirus, has been determined (wu et al. 1987) . the authors claimed that it encoded a presumptive ntpbinding domain. however, a more detailed analysis indicates that this domain probably fulfills an entirely different function, namely the protease one, with its sequence being strikingly similar to those ofpicornaviral proteases (gorbalenya et al. 1988a ). thus, sobemoviruses may lack an ntp-motif-conraining protein, which is similar to other positivestrand rna viruses of small genome size (namely nodaviruses, carnation mottle virus, and rna phages). comparison of the sequences of the 1st family of viral ntp-motif-containing proteins with those of several bacterial helicases revealed highly significant similarity, suggesting an rna helicase function for these proteins gorbalenya et al. 1988b,c; hodgman 1988) . nucleotide sequence of the brome mosaic virus genome and its implications for viral replication sindbis virus proteins nspi and nsp2 contain homology to nonstructural proteins from several rna plant viruses thenucleotide sequence of the coding region of tobacco etch virus genomic rna: evidence for the synthesis of a single polyprotein non-structural protein 1 of parvoviruses: homology to purine nucleotide using proteins and early proteins of papovaviruses homologies and anomalies in primary structural patterns of nucleotide binding proteins similarity in gene organization and homology between proteins of animal picornaviruses and a plant comovirus suggest common ancestry of these virus families structural and functional homology ofparvovirus and papovavirus polypeptides q~ rna replicase and protein synthesis elongation factors ef-tu and ef-ts completion of the sequence of the genome of the coronavirus avian infectious bronchitis virus nucleotide sequence of beet necrotic yellow vein virus rna-2 nucleotide sequence of beet necrotic yellow vein virus rna-1 consensus topography in the atp binding site of the sv40 and polyomavirus large tumour antigens small nuclear inclusion protein encoded by a plant potyvirus genome is a protease the complete nucleotide sequence of the rna coding for the primary translation product of foot and mouth disease virus primary structure of the west nile flavivirus genome region coding for all nonstructural proteins roninsonib (1986) internal duplication and homology with bacterial transport proteins in the mdrl (p-glycoprotein)gene from multidrug-resistant human cells a common function for polyomavirus large-t and papillomavirus e 1 proteins homology between the proteins encoded by tobacco mosaic virus and two tricornaviruses complete nucleotide sequence of alfalfa mosaic virus rna 1 establishing homologies in protein sequences gtp-binding domains: three consensus sequence elements with distinct spacing the nucleotide sequence of tobacco vein mottling virus rna potyviral proteins share amino acid sequence homology with picorna-, comoand caulimoviral proteins similar amino acid sequences: chance or common ancestry? protein sequence data banks: the continuing search for related sequences of urfs and orfs. a primer on how to analyze derived amino acid sequences domainal evolution ofa prokaryotic dna repair protein and its relationship to active transport proteins photoaffinity labeling of a viral induced protein from tobacco complete nucleotide sequence of the escherichia colt recb gene complete nueleotide sequence ofrecd, the structural gene for the a subunit of exonuclease v of escherichia coil the complete nucleotide sequence of the potexvirus white clover mosaic virus zimmern d (1984) homologous sequences in non-structural proteins from cowpea mosaic virus and picornaviruses atp-binding site of adenylate kinase: mechanistic implications of its homology with ras-encoded p21, f~-atpase, and other nucleotide-binding proteins homology between human bladder eacrinoma oncogene product and mitochondrial atp-synthase escherichia colt rep gene: sequence of the gene, the encoded helicase, and its homology with uvrd nucleotide sequence of tobacco mosaic virus rna molecular evolution of plant rna viruses prediction of nucleotide-binding properties of virus-specific proteins from their primary structure two segments of barley stripe mosaic virus genomic rna encode two homologous proteins which probably possess ntpase activity sobemovirus genome appears to encode a serine protease related to cysteine proteases ofpicornaviruses a conserved ntp-motif in putative helicases a novel superfamily of nuclcoside triphosphatebinding motif containing proteins which are probably involved in duplex unwinding in dna and rna replication and recombination mammalian multidrug resistance gene: complete edna sequence indicates strong homology to bacterial transport proteins the complete nucleotide sequence of rna/3 from the type strain of barley stripe mosaic virus i984) regional homology in gtp-binding protooncogene and elongation factors the complete nucleotide sequence of tobacco rattle virus rna-1 hermodson ma (1986) a family of related atp-binding subunits coupled to many distinct biological processes in bacteria the elucidation of protein function from its amino acid sequence a new superfamily ofreplicative proteins sequences ofescherichia coli uvra gene and protein reveal two potential atp binding sites structure of the gdp domain of ef-tu and location of the amino acids homologous to ras oncogene proteins primary structural comparison of rna-dependent polymerases from plant, animal and bacterial viruses evolutionofrna-dependentrnapolymerases of positive strand rna viruses evolution of rna-dependent rna polymerases of positive strand rna viruses: a comparison of phylogenetic trees generated by different methods clarkbfc (1985) structural details of the binding ofguanosine diphosphate to elongation factor tu from escheriehia eoti as studied by x-ray crystallography genome of coxsackievirus b3 the nucleotide sequence of cowpea mosaic b rna classification and nomenclature of viruses phosphate-binding sequences in nucleotide-binding proteins on the possibility of a common origin of the genes encoding the rna polymerases of bacterial, plant and animal positive strand rna viruses primary structure and gene organization of human hepatitis a virus the nucleotide abd deduced amino acid sequences of the eneephalomyocarditis viral polyprotein coding region analysis of the complete nucleotide sequence of the picornavirus theiler's murine encephalomyelitis virus (tmev) indicates that it is closely related to cardioviruses guanidine-selected mutants of poliovirus: mapping of point mutations to polypeptide 2c guanidine-dependent mutants of poliovirus: identification of three classes with different growth requirements accelerated method for comparing amino acid sequences with allowance for possible gaps. plotting optimum correspondence paths molecular cloning of poliovirus edna and determination of the complete nucleotide sequence of the viral genome nueleotide sequence of cucumber mosaic virus rna 1 nucleotide sequence of yellow fever virus implications for flavivirus gene expression and evolution nucleotide sequence of 3'-terminal regions of barley stripe mosaic virus rnas 1 and 3 matching sequences under deletion/insertion constraints human rhinovirus 2: complete nucleotide sequence and proteolytic processing signals in the capsid protein region an interactive graphics programme for comparing and aligning nucleic acid and amino acid sequences the complete sequence of a common cold virus: human rhinovirus 14 complete nucleotide sequence of the genomic rna of sindbis virus complete nucleotide sequence of the nonstructural protein genes of semliki forest virus distantly related sequences in the a-and ~-subunits ofatp synthase, myosin, kinases and other atp-requiring enzymes and a common nucleotide binding fold sequence and organization of southern bean mosaic virus genomic rna partial sequence analysis of cloned dengue virus type 2 genome nucleotide sequence of the escherichia coli replication gene dnazx acknowledgments. the authors are deeply grateful to professor v.i. agol for constant interest and encouragement, to dr. k.m. chumakov for help with some of the computer programs, to dr. s.y. morozov for useful discussions, and to drs. c. fritsch and s.y. morozov for communicating their sequence data prior to publication. i f i f i i i f f i i i i i i f f i f i i i i i c key: cord-324021-y1vr1db0 authors: kozak, m. title: determinants of translational fidelity and efficiency in vertebrate mrnas date: 1994-12-31 journal: biochimie doi: 10.1016/0300-9084(94)90182-1 sha: doc_id: 324021 cord_uid: y1vr1db0 abstract this article reviews current knowledge on the mechanisms affecting the fidelity of initiation codon selection, and discusses the effects of structural features in the 5′-non-coding region on the efficiency of translation of messenger rna molecules. two questions about the initiation of protein synthesis in higher eukaryotes are considered here: i) how do ribosomes find the correct aug codon for initiation of translation?; and ii) how is the efficiency of translation modulated by aspects of mrna structure, especially near the 5' end? the suggested answers to both questions are most easily understood by invoking the scanning model for initiation [1] , which postulates that the 40s ribosomal subunit binds initially at the capped 5' end of the mrna and migrates linearly until it encounters the first aug codon, at which point the 60s subunit joins and the resulting 80s ribosome is poised to form the first peptide bond. evidence in support of the scanning mechanism has been summarized previously [ 1, 21. the fidelity of initiation (ie selection of the correct start site) is determined primarily by the position of the aug codon relative to the 5' end of the mrna, with contributions from the surrounding primary sequence and in some cases frc~m downstream secondary structure. the importance of the fidelity of initiation can be grasped intuitively, but the point is made concrete by reports in which truncated proteins, initiated inappropriately from internal aug codons, have been shown to be unstable, or sorted improperly, or capable of interfering with the function of the fulllength protein [3] [4] [5] . the dominant role of position in determining the site of initiation has been shown experimentally by introducing aug codons upstream from the normal start site: insertion of a strong, upstream, out-of-frame aug codon dramatically inhibits translation, while a strong, upstream, in-frame aug codon supplants the original site of initiation (reviewed in [1]). a rigorous test of the latter point was carried out by constructing an mrna in which the translational start sitecontained within a block of 66 nucleotides derived from the rat preproinsulin gene -was reiterated four times in tandem [6] . although each of the four repeats contained an in-frame aug codon in an identical context, ribosomes initiated exclusively from the first aug codon in the tandem array [6] . this experiment gave a clear demonstration of the 'first-aug rule' because the initiator codon in preproinsulin mrna occurs in (what was later recognized to be) a good context. initiation may not be limited to the first aug codon when the surrounding context is less favorable, as discussed below. systematic mutagenesis of nucleotides in the vicinity of the aug codon revealed that gccaccaugg is the optimal context for initiation of translation in vertebrates [7, 8] . the experimentally determined optimal context matches the consensus sequence derived from inspection of published vertebrate mrna sequences [9] . in experimental tests of context effects, the strongest contributors were a purine (preferably a) in position -3 and a g in position +4. (in the numbering scheme used here, the a of the aug codon is designated +1, with positive and negative integers proceeding 3' and 5', respectively.) the com-bination of a -3 and g +4 can increase translation > 10fold in vivo 171 and ht ~,itro [ i0!; and t~ose are the two most highly conserved positions in the leader sequences of vertebrate mrnas [9] . the small number of vertebrate mrnas in which the aug initiator codon occurs in an extremely poor context (eg pyrimidines in positions -3 and +4) includes several growth factor and cytokine genes. in these cases, the poor context might be a deliberate ploy to throttle the expression of potent proteins, the overproduction of which might be deleterious. although context effects have been studied most thoroughly in vertebrate systems, there is some evidence that a -3 and g +4 augment aug-codon recognition in plant [10-121 and insect [ 13] translation systems. s cerevisiae is the only organism thus far studied in which context effects seem to contribute only slightly to aug-codon recognition [14] [15] [16] 88] . in vertebrates, the deleterious effects of a suboptimal context can be mitigated by downstream secondary structure [ 17] , which has been postulated to slow scanning and thus to provide ,a,ore time for the 40s ribosomal subunit to recognize the aug codon. by extension, the absence of strong context effects in s cerevisiae might be rationalized by postulating that the rate of scanning by 40s ribosomal subunits is inherently slow in lower et,~,at)'otes, but that idea awaits experimental study. in vertebrate mrnas in which the first aug codon is in the optimal context, the usual outcome is that all 40s ribosomal subunits stop scanning at the first aug and translation initiates uniquely from that ,;ire. when the lirst aug eodon is in a suboptimal context vis-avis positions -3 and/or +4, some 40s ribosomes bypass that site and initiate instead at the second (or, rarely, even the third) aug. this 'leaky scanning' mechanism thus enables two independently-initiated proteins to be produced from one mrna. nearly 30 examples of bifunctional mrnas that fit this description have been identified [18] and, in ten cases, the postulated connection between leaky scanning and a suboptimal context has been confirmed by mutational analysis [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] . thus, the importance of context for recognition of the aug initiator codon has been confirmed in many laboratories. while most instances of leaky scanning are attributable to a poor context around the first aug codon, in rare cases the second aug codon may be accessible because the first aug lies too close to the cap to be recognized efficiently [ 16, [29] [30] [31] . whereas an unfavorable context around the first aug codon enables some 40s ribosomes to scan past that site and initiate farther downstream, the presence of a highly favorable context (possibly augmented by downstream secondary structure [ 17] ) around certain non. \ug codons, such as cug or acg or gug, may encourage 40s ribosomal subunits to pause and initiate at these adventitious, upstream sites in addi-tion to initiating at the first aug codon. initiation of translation from an upstream non-aug codon is growth-regulated or developmentally-regulated in some cases [32, 33] . in a few cases, the n-terminally extended protein initiated from an upstream non-aug codon has been found to function differently from the shorter protein initiated from the first aug codon ( [34] [35] [36] ; see [89] , this issue). but it would be wrong to expect a priori that every polypeptide initiated from an alternative upstream site serves a special function. in some cases, initiation from spurious upstream sites may be an inadvertent consequence of the passage of 40s ribosomal subunits across the entire 5' non-coding region en route to the aug start codon. (because initiation at codons other than aug occurs rarely and usually inefficiently in eukaryotes, some genetic diseases are attributable to point mutations in the aug initiator codon [371. on the other hand, the fact that some non-aug codons can support at least a low level of initiation explains the moderate phenotype of some genetic diseases [381.) initiation at non-aug codons is best evaluated in vivo, since it can be artificially enhanced by choosing inappropriate reaction conditions in vitro [ 101. the leaky scanning that occurs when the first aug codon is in an unfavorable context provides one means of escape from the rule that eukaryotie ribosomes are limited to initiating at the first aug codon. the ability of eukaryotic ribosomes to reinitiate (see below) provides a second escape from the first-aug rule. these escape mechanisms mean that the scanning model does not have to be abandoned as the list of edna sequences with aug-burdened leader sequences grows longer 1181. there are rules (leaky scanning, reinitiation) that allow the first-aug rule to be broken. but breaking the rule usually is paid for by a reduction in translational efficiency, and therefore a edna sequence in which the presumptive 5' noncoding sequence has many upstream aug codons should be treated sceptically. indeed, tbllow-up studies have revealed that many edna sequences with problematical leader sequences do not correspond to functional mrnas. some aug-burdened 5' noncoding sequences have been traced to artifacts during edna construction and cloning [1, 18] . more interestingly, some edna sequences have been shown to derive from mrna precursors that still retain a 5' intron: the upstream aug eodons reside within the intron and therefore do not compromise translation of the mature mrna [18] . another common solution to tile upstream-aug conundrum is promoter switching; ie in certain tissues or under certain growth conditions, activation of a downstream promoter produces a second form of mrna that lacks the long, aug-burdened leader sequence and supports translation more efficiently [18, [39] [40] [41] . an important ancillary lesson from these studies is that the rather insensitive northem blotting assay does not always detect all the functionally important transcripts from complex genes. even with the exhaustively studied sv40 system, for example, new mrna species have been found recently by devising more sensitive assays [42] . the importance of not accepting northern blots as the definitive measure of transcription is further discussed elsewhere [2] in connection with the problematical 'internal initiation' hypothesis. the efficiency of initiation of translation (ie the yield of protein per unit of mrna) is affected by five structural elements near the 5' end of the mrna: i) the m7g cap; ii) the primary sequence or context surrounding the aug codon; iii) the position of the aug codon (ie whether or not it is 'first'); iv) secondary structure; and v) leader length. because the interplay of these five features in controlling the translation of synthetic mrnas was reviewed previously [43] , i will summarize succinctly what we have learned. the ability of the m7g cap to increase translational efficiency was first shown by dr aaron shatkin and has been confirmed many times since. as far as we know, all cellular mrnas and nearly all viral mrnas are capped, and therefore this nearly universal requirement probably does not underlie common differences in translational efficiency. the second structural feature that affects the initiation of translation is the gccacc...g sequence flanking the aug codon; this was discussed in the preceding section. the third feature, the position of the aug codon, has important effects on both the fidelity and efficiency of initiation. when the first aug codon is in a favorable context, virtually all 405 ribosomes will initiate at that site, usually to the exclusion of downstream sites. when the first aug codon is followed shortly by an in-frame terminator codon, however, some initiation from downstream site(s) may occur ([44l and references cited therein). the usual interpretation is that, after an 80s ribosome has translated the first small ore the 60s ribosomal subunit dissociates while the 40s subunit remains bound to the mrna, resumes scanning, and reinitiates at the next aug downstream. while this mechanism gets around the limitation of the first-aug rule, there is a cost in terms of efficiency because, in higher eukaryotes, reinitiation is nearly always inefficient. (in yeast the efficiency of reinitiation can be regulated [45] , but in multicellular eukaryotes only low-level constitutive reinitiation has been documented. in all eukaryotes the ability to reinitiate appears to be limited to the 5' 817 end of the mrna, ie reinitiation is possible following the translation of a short orf but not following the translation of a full-length cistron.) the reduction in translational efficiency imposed by an upstream ore and hence by the need to reinitiate, appears to be important in controlling the developmental [46] or tissue-specific [47] expression of some genes, modulating the synthesis of some proteins that might be harmful if overproduced [48] , and regulating the replication [49] or pathogenicity [50, 51 ] of some viruses. the presence of spurious upstream orfs might also be a means to prevent the translation of unrearranged immunoglobulin genes [52] . the fourth structural feature that we have explored using synthetic mrnas is secondary structure. a base-paired structure near the 5' end of the mrna can affect translation in ways that depend on the stability and position of the structure. a very stable stem-andloop structure (ag -61 kcal/mol, calculated according to tinoco et al's rules [53] ) inhibits translation profoundly by blocking an early step in initiation. we showed that when a hairpin of this sort was positioned 72 nucleotides from the m7g cap, a 40s ribosomal subunit was able to bind to the mrna and apparently migrate up to, but not through, the base-paired structure [54] . in contrast with the strong inhibition imposed in vitro [54] and in vivo [55] by structures in the range of -50 to -61 kcal/mol, a -30 kcal hairpin structure inhibited translation only when it was close to (eg 12 nucleotides from) the 5' end of the mrna. in that position, the base-paired structure prevented 40s ribosomal subunits from engaging the mrna [54] . when the same -30 kcal structure was repositioned 52 nucleotides from the 5' end, however, it no longer inhibited translation [54] . a reasonable interpretation is that, as long as there is room for a 40s ribosomal subunit to bind at the 5' end of the mrna, the subsequent migration of the 40s ribosome/factor complex (the set of initiation factors associated with 40s ribosomal subunits at this stage has not been defined exactly. as discussed elsewhere [2] , the putative (limited) helicase activity of eif-4a has not been directly implicated ip scanning) can disrppt basepaired structures that occur downstream. there is a limit to this ability, however, as evidenced by the aforementioned inhibitory -61 kcal hairpin. it is striking that a -30 kcal hairpin, positioned some distance from the cap, did not impair translation even when the basepaired domain included the aug initiator codon [54, 55] . this is profoundly different from the situation in prokaryotes where translation is virtually abolished by marginally stable (ag -10 kcal/mol) base-paired structures that impinge on the aug codon. the explanation, of course, is that prokaryotic ribosomes engage the mrna near the aug codon while eukaryotic ribosomes enter upstream, near the m7g cap. in contrast with the foregoing predictable inhibitory effects of upstream secondary structure, a modest amount of secondary structure (ag -19 kcal/mol) positioned downstream from the aug codon was found to augment translation, apparently by suppressing the leaky scanning that would otherwise have occurred when the aug codon was in a suboptimal context [17] . this unexpected positive effect of downstream secondary structure was strongest when the base-paired structure was positioned 14 nucleotides from the aug codon [17] . since rnase-protection experiments have shown that a ribosome bound at the aug codon protects 12 to 15 nucleotides 3' of the aug codon [56] , we postulated that a hairpin positioned 12 to 15 nucleotides downstream from the aug codon causes the scanning 40s ribosome to pause with its aug-recognition center right over the aug codon, thereby favoring initiation. (if this explanation for the enhancing effect of downstream secondary structure is correct, it would seem that the -30 kcal hairpin positioned shortly upstream from the aug codon must also have slowed scanning, even though the -30 kcal hairpin did not reduce translational efficiency [54, 55] . in other words, pausing during scanning is not necessarily inhibitory, and may even be helpful, as long as the ribosome, after pausing, can eventually move on.) although this phenomenon has been demonstrated so far only with synthetic transcripts, it is interesting that some natural mrnas in which the aug codon occurs in a weak primary sequence context have the potential to form a basepaired structure in an appropriate position downstream [57--61] . the ability of secondary structure to slow scanning, and thus to favor initiation, might also encourage adventitious initiation from upstream non-aug sites, as discussed elsewhere [17, 43] . the fifth structural feature that has been shown to modulate the translation of test transcripts is leader length. lengthening the 5' non-coding sequence beyond the 20 or so nucleotides (exactly how long the y non-coding sequence has to be to ensure recognition of the first aug codon depends on whether the sequence 3' of the aug codon is structured or unstructured [29] ) required for the fidelity of initiation can dramatically increase the efficiency of translation in vitro [62] and, under certain conditions, in vivo [63] . the increased efficiency was clearly attributable to leader length, rather than to any particular sequence, inasmuch as insertion of three different synthetic oligonucleotides, each 60 nucleotides long, stimulated translation identically [62] . the only feature common to all three sequences was a paucity of g residues, which ensured against the formation of secondary structure. the ability of long leader sequences to enhance translation irrespective of the particular nucleotide sequence might be explained by the apparent binding of extra 40s ribosomal subunits to such mrnas [62] . it seems likely, although not proven, that this 'early recruiting' of 40s subunits gives mrnas an advantage under conditions of competition. the facilitating effect of leader length has not yet been verified widely in other laboratories, perhaps because a long 5' non-coding sequence helps only if it lacks secondary structure; most random rnas squences do not meet that requirement. the five structural features discussed above were delineated primarily by studying synthetic leader sequences, an approach that engenders clear results because it enables each feature to be studied in isolation. with natural mrnas, in contrast, the effects of flanking nucleotides on aug codon recognition might be underestimated if there happen to be downstream secondary structure(s) that slow scanning and thus compensate for a poor context. and the potential advantage of lengthening the 5' non-coding sequence might be missed if the long leader sequence contains inhibitory secondary structures. notwithstanding these and other complications encountered with natural mrna sequences, some progress has been made in understanding why certain mrnas are translated more efficiently than others. because probing of various 'good' mrna leader sequences has failed to identify any particular motif capable of enhancing translation [64] [65] [66] [67] (in a few cases where a particular sequence has been claimed to facilitate translation [68] [69] [70] [71] , the possibility that the so-called enhancer sequence works simply by reducing secondary structure and/or increasing leader length was not rigorously evaluated), a reasonable view is that a moderately long, moderately unstructured 5' non-coding sequence may be sufficient to support efficient translation. in the case of 13-globin mrna, for example, we found that translational efficiency was not reduced by substitutions introduced in nearly every position of the 53-nucleotide leader sequence [72] , suggesting that the 5' non-coding domain of this unusually good mrna contains no special effector motifs. consistent with the view that efficient translation requires only a moderately long leader sequence that is not extensively base-paired, probing of the tx-and [3-globin mrna leader sequences and derivatives thereof revealed a perfect inverse correlation between 5' secondary-structure content and translational efficiency [72] . the longnoted [73] two-fold difference in translatability between o~-and [3-globin mrnas may thus be explained. significantly, the secondary structure that apparently restricts translation of wild type o~-globin mrna is far less stable than the -30 kcal hairpin [74] [75] [76] [77] [78] [79] [80] . the atg (aug) codon shown in boldface is the start of the protein coding domain. participation of the coding domain in some of the proposed structures does not necessarily contradict evidence from leader-shuffling experiments that translational regulation is a function of the 5' non-coding domain [81 ] , since secondary structure could be preserved if a gc-rich sequence in the reporter gene were to substitute for a gc-rich segment of the original ribosomal protein coding sequence. structure discussed above in connection with synthetic mrnas. tl':e work with o~-globin mrna suggests that even a -10 kcal/moi base-paired structure can limit translation, under some reaction conditions, when the structure occurs very near the 5' end of the mrna. the poor translation of many mrnas that encode oncoproteins, growth factors, transcription factors and other critical regulatory proteins can probably be attributed to the highly structured leader sequences on these mrnas [18] . in some cases, for example, the g+c content of these leader sequences is 80% or greater, which implies an extraordinary potential for base pairing. but secondary structure might also limit the translation of mrnas in which the g+c bias is less extreme. as illustrated in figure 1, for example, the 5' ends of ribosomal protein mrnas might forrn secondary structures that are comparable in strength to the structure that restricts oc-giobin translation. with secondary structures in this modest range, inhibition of translation would not be absolute; inhibition should and does depend on the proximity of the structure to the 5' end of the mrna [81 ] and on the imposition of conditions that force mrnas to compete [82] . notwithstanding the failure of mutational analyses to identify cis-acting effector motifs [64] [65] [66] [67] 72] , some investigators continue to assert vague claims that a particular mrna leader sequence contains a specific motif that facilitates translation [68] [69] [70] [71] , and some continue to postulate special trans-acting factors to explain the preferential translation [71, 83, 84] . but the fact is that n.__q mrna-specific, translational initiation factor has y_~ be.en demonstrated. it is easy enough to find proteins that bind to the 5' non-coding sequence of certain mrnas, but binding is not indicative of function, and attempts to show stimulation of translation by such proteins have not been very convincing [85] . (even negative regulation requires more than mere binding of a protein to the mrna leader sequence. to inhibit translation, a potential repressor protein has to bind with very high ~ to a site near the 5' end [86, 87] .) in contrast with the lack of compelling evidence for positive-acting, mrnaspecific cisand trans-acting elements involving the 5' non-coding sequence, there is growing evidence for positive-acting elements near the 3' end of eukaryotic mrnas. what is not clear, however, is whether these 3' non-coding elements act directly at the level of translation. while these and other special instances of translational regulation remain to be unraveled, the five recognized structural elements in 5' non-coding sequences -cap, context, position, secondary structure, and leader length -seem capable of explaining many aspects of translational regulation in higher eukaryotes. the scanning model for translation: an update a consideration of alternative models for the initiation of translation in eukaryotes thyroid hormone receptor transcriptional activity is potentially autoregulated by truncated forms of the receptor tracheal u (1992) n-terminal truncation of salmon calcitonin leads to calcitonin antagonists mutation eliminating mitochondrial leader sequence of methylmalonyi-coa mutase causes tlwi ° methyl-malonic acidemia translation of insulin-r~lated polypeptides from messenger rnas with tandemly reiterated copies of the ribosome binding site point mutations define a sequence flanking the aug initiator codon that modulates translation by eukm'yotic ribosomes at least six nuclcotides preceding the aug initiator codon enhance translation in mammalian cells an analysis of 5' non-coding sequences from 699 vertebrate messenger rnas context effects and (inefficient) initiation at non-aug codons in eukaryotic cell-free translation systems expression of bacterial chitinase protein in tobacco leaves using two photosynthetic 8ene promoters cavener dr (1991) translation initiation in dnj,wq~hila mehmogaster is reduced by mutations upstream of the aug initiator codon mutational analysis of the his4 translational initiator region in sarrhammyres (,erevisiae influence of the three nucleotides upstream of the initiation codon on expression of the e colt lacz gene in s curevisiae. nuhei(' ahds res 2 i hopper ale (1991) mrna leader length and initiation codon context determine alternative aug selection for the yeast gene mod3. prm downstream secondary structure facilitates recognition of initiator tendons by eukaryotic ribosomes an analysis of vertebrate mrna sequences: intimations of translational control ribosomal initiation from an acg codon in the sendal virus pt'c mrna a liver-enriched transcriptional actbmor protein, lap. and a transcriptional inhibitory protein, lip, are translaled from the same mrna control of start codon choice on a plant viral rna encoding overlapping genes translation of bicistronic viral mrha in transfected cells: regulation at the level of elongation nucleotide sequence responsible for the synthesis of a truncated coat protein of brome mosaic virus strain atcc66 ccaat/enhancer-binding protein mrna is translated into multiple proteins with different transcription activation potentials mechanism of translation of monocistmnic and multicistronic hiv type 1 mrnas mechanisms of synthesis of virion proteins from t~he functionally bigenic late mrnas of sv40. j l/iro162 preferential ribosoma| ~lning is involved in the differential synthesis of the hbv surface antigens from subgenomic transcripts processing of rotavims glycoprotein vp7: implications for the retention of the protein in the endoplasmic reticulum a short leader sequence impair "helfidelity of initiation by eukaryotic ribosomes. gene expression i. i i i-i 15 a sm~i highly basic protein is encoded in overlapping frame within the pgene of vesicular stomatitis virus two forms of the major barley stripe mosaic virus nonstructural protein are synthesized in rive from alternative initiation codons sports gd 0992) translational activation of the non-aug-initiated c-myc 1 protein at high cell densities due to methionine deprivation multiple molecular weight forms of bfgf are developmentally regulated in the central nervous system translation of equine infectious anemia virus bicistronic tat-ray mrna requires leaky ribosome scanning of the tat ctg initiation codon a 31-amino-acid n-terminal extension regulates c-crk binding to tyrosine-phosphorylated proteins site-directed mutagenesis of aav type 2 structural protein initiation codons: effects on regulation of synthesis and biological activity human gene mutations affecting rna processing and translation an initiation codon mutation in cdi8 in association with the moderate pbenotype of leukocyte adhesion deficiency enhanced translational efficiency of a novel transforming growth factor [$3 mrna in human breast cancer cells effect of growth hormone on levels of differentially processed igf-! mrnas in total and polysomal mrna populations c/s-acting elements in the 5' untranslated region of rat testis proenkephalin mrna regulate translation of the precursor protein deppert w 0993) independent expression of the transforming amino-terminal domain of sv40 large t antigen from an alternatively spliced third sv40 early mrna structural features in eukaryotic mrnas that modulate the initiation of translation effects of intereistmnic length on the efficiency of reinitiation by eocaryotic ribosomes involvement of an initiation factor and protein phosphorylation in translational control ofgcn4 mrna adams tk h993) translational repression of brla expression prevents premature development in aspergillus regulation of tissue-specific expression of the esterase s gene in drosophila virilis biosynthesis of human fibroblast growth factor-5 aiterations of the three short orfs in the rous sarcoma virus leader rna modulate viral replication and gone expression systematic movement of an rna plant virus determined by a point substitution in a 5' leader sequence a translation-attenuating intraleader orf is selected on coronavirus mrnas during persisw, m infection mrna transcripts initiating within the human ig mu heavy chain enhancer region contain a non-translatable exon improved estimation of secondary structure in ribonucleic acids circumstances and mechanisms of inhibition of translation by secondary structure in eucaryotic mrnas influences of mrna secondary structure on initiation by eukaryotic ribosomes sequences of two 5'-terminal ribosome-protected fragments from reovirus messenger rnas the human galactose-l-phosphate uridyitransferase gone 993) interleugin-i3 is a new human lymphokine regu!afing inflammalory and immune responses the amphiregulin gene encodes a novel epidermal growth factor-related protein with tumor-inhibitory activity cdna cloning by pcr of rat transforming growth factor ~-i identitication and characterization of a new mammalian mitogen-activated protein kinase kinase, mkk2 effects of long 5' leader sequences on initiation by eukaryolic ribosomes in vitro. gone expression i leader langth and secondary structure modulate mrna function under conditions of stress the preferential translation of drosophila hsp70 mrna requires sequences in the untranslated leader studies on the mechanism of translational enhancement by the 5'-leader sequence of tobacco mosaic virus rna the function of plant heat shock promoter elements in the regulated expression of chimaeric genes in transgenic tobacco lsranslation by the adenovirus tripartite leader: elements which determine independence from cap-binding protein complex effect of mutations and deletions in a bicistronic mrna on the synthesis of influenza b virus nb and na glycoproteins both the 5" untranslated region and the sequences surrounding the start site contribute to efficient initiation of translation in vitro the iron regulatory region of ferritin mrna is also a positive control element for iron-independent translation identification of the motifs within the tobacco mosaic virus 5'-leader responsible for enhancing translation features in the 5' non-coding sequences of rabbit a-and ~-globin mrnas that affect translational efficiency control of haemoglobin synthesis: a difference in the size of the polysomes making a and [3 chains primary structure of rat ribosomal protein 32 the human ribosomal protein $6 gone" isolation, primary structure and location in chromosome 9 cloning and sequencivg of raouse ribosomal protein s12 cdna the primary structure of r:,t ribosomal protein s18 zinc fi~lger-like motifs in rat ribosomal proteins $27 and $29 nucleotide sequence of cloned cdna specific for rat ribosomal protein l31 structure ofxenopus laevis ribosomal protein l32 and its expression during development sequences mediating the translation of mouse s 16 ribosomal protein mrna during myobiast differentiation and in vitro and possible control points for the in vitro translation lodish model and regulation of ribosomal protein synthesis by insulin-deficient chick embryo fibroblasts translational coutrol by influenza virus ) 5'-sequences of rubella virus rna stimulate translation of chimeric rnas and specifically interact with two host-encoded proteins sonenberg n (1993) la autoantigen enhances and corrects aberrant translation of poliovirus rna in reticulecyte lysate bacteriophage and spliceosomal proteins function as position-dependent cis/trans repressors of mrna translation in vitro regulation of translation in eukaryotic systems the yeast saccharomyces cerevisiae system: a powerful tool to study the mechanism of protein synthesis initiation in eukaryotes regulation and function of non.aug-initiated proto-oncogenes research in the author's laboratory is supported by national institutes of health grant gm33915. key: cord-304607-td0776wj authors: paszkiewicz, konrad h.; giezen, mark van der title: omics, bioinformatics, and infectious disease research date: 2010-12-24 journal: genetics and evolution of infectious disease doi: 10.1016/b978-0-12-384890-1.00018-2 sha: doc_id: 304607 cord_uid: td0776wj bioinformatics is basically the study of informatic processes in biotic systems. actually what constitutes bioinformatics is not entirely clear and arguably varies depending on who tries to define it. this chapter discusses the considerable progress in infectious diseases research that has been made in recent years using various “omics” case studies. bioinformatics is tasked with making sense of it, mining it, storing it, disseminating it, and ensuring valid biological conclusions can be drawn from it. this chapter discusses the current state of play of bioinformatics related to genomics and transcriptomics, briefs metagenomics that finds use in infectious disease research as well as the random sequencing of genomes from a variety of organisms. this chapter explains the various possibilities of pan-genome, transcriptional reshaping and also enormous progress of proteomics study. bioinformatic algorithms and tools are crucial tools in analyzing the data. the chapter also attempts to provide some details on the various problems and solution in bioinformatics that current-day scientists face while concentrating on second-generation sequencing strategies. although bioinformatics is generally perceived to be a modern science, the term had been put forward over thirty years ago by paulien hogeweg and ben hesper for "the study of informatic processes in biotic systems" (hogeweg, 1978; hogeweg and hesper, 1978) . it is necessarily nebulous-bioinformatics spans many disciplines and can have many shades of meaning. indeed it can be argued that it is the collation and analysis of data from different disciplines that has provided some of the greatest insights. in the field of genomics and transcriptomics, bioinformatics is an incredibly diverse field. evolution, epidemiology, ecology, and the response of an organism to its environment are all fields that require bioinformatics to accurately process and place into context various sources of data. at the heart of genomics and transcriptomics is the generation and analysis of vast quantities of sequence data. dna sequencing took off in the late 1980s when applied biosystems developed the first automated sequencing machine. the subsequent development of more efficient ways to sequence resulted in the phenomenal growth of the number of sequences deposited in genbank (figure 18 .1). obviously, with over 100 million sequences deposited in genbank, it is not feasible to do any serious manual work with such a large dataset. data obtained from modern secondgeneration sequencers is on the order of 1000 times greater than capillary-based sequencers. it is now possible to routinely generate many gigabases of sequence data. bioinformatics is tasked with making sense of it, mining it, storing it, disseminating it, and ensuring valid biological conclusions can be drawn from it. many of the recent high-throughput functional genomics technologies rely on a bioinformatics component, though bioinformatics is just one part of the process. for example, identification of proteins by mass spectroscopy, quantitative analysis of expression data, phylogenetics, and so on all make use of bioinformatics tools, methods, and databases. bioinformatics plays a key role at several steps in genomics, comparative genomics, and functional genomics: sequence alignment, assembly, identification of single nucleotide polymorphisms (snp), gene prediction, quantitative analysis of transcription data, etc. in this chapter, we will discuss the current state of play of bioinformatics related to genomics and transcriptomics and use relevant examples from the field of infectious diseases. the term "metagenomics" was originally used to describe the sequencing of genomes of uncultured microorganisms in order to explore their abilities to produce natural products (handelsman et al., 1998 , rondon et al., 2000 and subsequently resulted in novel insights into the ecology and evolution of microorganisms on a scale not imagined possible before (see cardenas and tiedje, 2008; hugenholtz and tyson, 2008 for an overview). however, metagenomics now finds use in infectious disease research as well as the random sequencing of genomes from a variety of organisms from, for example, patient material that could lead to the identification of the cause of disease. in a quite straightforward metagenomics approach to identify pathogens in sputa from cystic fibrosis patients, standard microbiological culture techniques were compared to molecular methods using 16s rdna pcr (bittar et al., 2008) . the well-known disadvantage of the microbiological methods is that they normally employ "selective" media that are designed to pick up those bacterial pathogens that are thought to be present. emerging pathogens will be missed using traditional culture techniques. indeed, bittar et al. identified 33 bacteria using cultivation while 53 bacterial species were detected using molecular methods (based on blast comparisons; altschul et al., 1990) , interestingly, 30% of the latter were anaerobes, organisms missed in the routine cultivation methods. many bacteria identified using the molecular methods are traditionally not thought to be associated with cystic fibrosis. whether these novel species are associated with the physiopathology of disease remains to be studied. bittar et al. (2008) also noted that the number of bacteria detected increased with increased numbers of clones sequenced, a well-known phenomenon in environmental sequencing that relates to sample depth (huber et al., 2007; huse et al., 2010) . however, with the increased use of next-generation sequencing methods in infectious disease research, the lessons learned from environmental studies relating to diversity and relative abundance of different microbes can be put to effective use. an example of the use of second-generation sequencing in a metagenomics approach of patient material is the study by nakamura et al. (2009) to identify viruses in nasal and fecal material. in this study, rna was isolated from patient material obtained during seasonal influenza infections and norovirus outbreaks. this rna was reverse transcribed into cdna, which was subsequently subjected to large-scale parallel pyrosequencing resulting in 25,000 reads on average per sample. although the influenza samples were mainly (.90%) human in origin, it was nonetheless possible to identify the influenza subtypes in each sample (nakamura et al., 2009) . as the fecal samples were cleared of human and bacterial cells, yields were much better and the complete norovirus gii.4 subtype genome was sequenced with an average cover depth of up to 2583. in addition to being able to identify the influenza and noroviruses, two recently identified human viruses were also identified: wu polyomavirus and human coronavirus hku1 (nakamura et al., 2009) . major bacterial species normally found in the respiratory tract were also identified. although nakamura et al. suggest that the high-throughput sequencing is more sensitive than standard pcr-based analysis and might result in the detection of additional possible pathogens, they also warn that the increased sensitivity might necessitate follow-up work to decide which of the detected pathogens is the actual cause of the disease. important results are expected from the human microbiome project (http:// www.hmpdacc.org/), which will obtain metagenomic information from various human microenvironments such as the gastrointestinal, nasooral, and urogenital cavities as well as the skin. understanding the human microbiome is thought to answer questions such as whether changes in the human microbiome are related to human health. however, large-scale metagenomics projects that include eukaryotic genomes have thus far been quite costly and laborious due to the generally large genomes of eukaryotes. the lowering of sequencing costs may alleviate part of the problem, but sequence data are still accumulating at a faster rate than developments in computational analysis (hugenholtz and tyson, 2008) . organisms that have attracted the attention of genome centers are those that cause disease followed by those from model organisms such as saccharomyces cerevisiae (goffeau et al., 1996) and caenorhabditis elegans (the c. elegans sequencing consortium, 1998), for example. indeed, the first bacterial genomes sequenced were those from pathogens fraser et al., 1995; tomb et al., 1997) , and these were preceded by many bacteriophage genomes such as bacteriophage ms2 (fiers et al., 1976) and ϕx174 (sanger et al., 1977) and viral genomes (fiers et al., 1978) . currently, pathogen genomes represent at least one third of all sequenced genomes. obviously, for comparative genomics two genomes are required, and indeed, when the second bacterial pathogen was sequenced (mycoplasma genitalium by fraser et al., 1995) , it was immediately compared with the first one (haemophilus influenzae by fleischmann et al., 1995) . interestingly, the h. influenzae genome was completed using a "bioinformatics" approach. unlike previous sequencing projects, the used shotgun approach relied on a computational justification that sufficient random sequencing of small fragments would result in a complete coverage of the whole genome. comparing the m. genitalium genome with the haemophilus genome suggested that the percentage of the total genome dedicated to genes is similar albeit that m. genitalium has far fewer genes (fraser et al., 1995) . although the genome of m. genitalium is about three times smaller than that of h. influenzae, its smaller genome has not resulted in an increase in gene density or decrease in gene size. detection of several repeats of components of the mycoplasma adhesin, which elicits a strong immune response in humans, suggests that recombination might underlie its ability to evade the human immune response. that this initial genome study was only the tip of the comparative genomics iceberg was already clear from fleischmann et al. (1995) last sentence: "knowledge of the complete genomes of pathogenic organisms could lead to new vaccines." a whole-genome effort at identifying vaccine candidates appeared some 5 years later when pizza et al. (2000) employed bioinformatics to extract putative surface-exposed antigens by genome analysis. although effective vaccines against neisseria meningitidis, the causative agent of meningococcal meningitis and sepsis, did exist, these vaccines did not cover all pathogenic serogroups. serogroup b had evaded the development of a good vaccine as its capsular polysaccharide (against which the vaccines of the other serogroups were developed) is identical to a human carbohydrate. in order to identify putative candidates for vaccine development, pizza et al. decided to sequence the whole genome of a serogroup b strain. all potential open reading frames (orfs) were analyzed for putative cellular locations using blastx. those orfs likely to be cytosolic were excluded from further analysis. the remaining orfs were analyzed to determine whether they encoded proteins that contained transmembrane domains, leader peptides, and outer membrane anchoring motives using a variety of databases such as pfam (finn et al., 2010) and prodom (servant et al., 2002) . this resulted in 570 orfs encoding putative exposed antigens. these 570 putative genes were cloned in escherichia coli and pizza et al. successfully expressed 350 orfs. these 350 recombinant proteins were used to generate antisera that were tested in enzyme-linked immunosorbent assay (elisa) and fluorescence-activated cell sorter (facs) analyses to test whether they detected proteins on the outer surface of serogroup b meningococcus strains. in addition, the sera were tested for bactericidal activity. of the 350 proteins, 85 reacted positively in at least one assay but only 7 were positive in all three assays. these 7 were subsequently tested on a large variety of strains to analyze their efficacy. a total of 5 seemed able to provide protection against 31 n. meningitidis strains and in addition, those 5 proteins are 95à99% similar to the homologous n. gonorrhoeae proteins, suggesting they might provide successful protection against that pathogen as well (pizza et al., 2000) . arguably the most striking aspect of this study is that in 18 months the authors identified more vaccine candidates than in the preceding 40 years using a novel genomics/bioinformatics approach (seib et al., 2009 ). this study resulted in a vaccine that is currently in phase iii clinical trials (giuliani et al., 2006) . protozoan infections are a major burden on developing nations; they take 8 of the 13 diseases targeted by the world health organization's special program for research and training in tropical diseases (http://www.who.int/tdr). over the last 5 years or so, more than 10 parasitic genomes have been sequenced in the hope that their sequences would reveal weak spots to target these pathogens. the trypanosomatids cause serious disease in africa and south america. trypanosoma brucei causes sleeping sickness in humans and wasting disease in cattle. trypanosoma cruzi is the causative agent of chagas disease and leishmania major leads to skin lesions. the completion of their genomes , el-sayed et al., 2005a , ivens et al., 2005 and the comparative analysis of all three genomes (el-sayed et al., 2005b) may be able to focus efforts toward obtaining vaccines, as current drugs have serious toxicity issues. although their genomes encode a different number of protein-encoding genes (around 8100 in t. brucei; 8300 in l. major; 12,000 in t. cruzi), comparative analysis resulted in the identification of about 6200 genes that entail the trypanosomatid core proteome. all protein coding genes were compared in a three-way manner using blastp (el-sayed et al., 2005b) and the mutual best hits were grouped as clusters of orthologous genes or cogs ( figure 18 .2). trypanosomatid specific proteins from these 6200 might be used in a broadscale vaccine. the remainder of the protein-encoding genes from each parasite (26% of the genes in t. brucei; 12% in l. major; 32% in t. cruzi) consists of species-specific genes. interestingly, a large proportion of these genes encode surface antigens and this might relate to the different mechanisms these parasites employ to evade the host immune system. in addition, it was noted that many genes encoding surface antigens are found at or near telomeres and that many retroelements seem to be present in these regions as well. this might be related to the enormous antigenic variation observed in both trypanosoma species. the presence of novel genes in these areas might suggest that their products play an unknown role in antigenic variation as well which warrants further studies into these uncharacterized genes (el-sayed et al., 2005b) . detailed knowledge of well-studied pathogens might be successfully used to understand the biology of closely related emerging pathogens. this was the driving force for the sequencing of six candida species (butler et al., 2009) . candida species are the most common opportunistic fungal infections in the world and c. albicans is the most common of all candida species causing infection. however, c. albicans incidence is declining while other species are emerging. comparison of eight candida species indicated that although genome size was variable, gene content was nearly identical across all species. as the analysis included pathogenic and nonpathogenic species, butler et al. (2009) specifically studied differences between these two groups. of the over 9000 gene families analyzed, 21 were significantly enriched in pathogenic species. many gene families known to be involved in pathogenesis were present in these 21 families (e.g., lipases, oligopeptide transporters, and adhesins). more interestingly, several poorly characterized gene families were also identified, suggesting these might play an unexpected role in pathogenesis as well. this comparative study revealed a wealth of new avenues to explore, which, combined with the large body of work performed on c. albicans, will aid understanding the newly emerging pathogenic candida species (butler et al., 2009 ). although comparative studies using multiple species can reveal hitherto unknown features as evidenced from the mentioned trypanosomatid and candida studies, they can also reveal something unexpected. because the definition of a bacterial species has been debated for a long time, tettelin et al. (2005) set out to address this question by sequencing multiple strains from streptococcus agalactiae, the most common cause of illness or death among newborns. unexpectedly, despite the presence of a "core-genome" shared between all 8 genomes, mathematical modeling suggested that each additional sequenced genome would add 33 new genes to the "dispensable genome." an additional analysis using s. pyogenes also suggested that sequencing additional genomes would continue to add new genes to the pool resulting in a pan-genome that can be defined as the global gene repertoire of a species . this cannot be extrapolated ad infinitum, as a similar analysis of bacillus anthracis indicated that after the fourth genome, no additional genes were identified in agreement with its known limited genetic diversity (keim and smith, 2002) . subsequent analyses have confirmed the presence of pan-genomes for many bacterial species (hiller et al., 2007; lefébure and stanhope, 2007; rasko et al., 2008; schoen et al., 2008; lefébure and stanhope, 2009) and the ultimate gene repertoire of a bacterial species is much larger than generally perceived. whether this would be the case for eukaryotes remains to be shown. despite the apparently ever-expanding possibilities of the pan-genome, it has also resulted in a universal vaccine candidate for group b streptococcus (gbs). because various gbs serotypes exist, current vaccines only offer protection against a limited set of serotypes. eight genomes from six serotypes were compared resulting in the identification of a core-genome of 1811 genes and a dispensable genome of 765 genes, which were not present in each strain . both genomes were analyzed for the presence of putative surface-associated and secreted proteins. of the 598 identified genes, one third were part of the dispensable genome (193 genes). the authors subsequently produced recombinant tagged proteins in e. coli that were used to immunize mice. ultimately, a combination of four antigens turned out to be highly effective against all major gbs serotypes. three of these antigens were part of the dispensable genome. in addition, this bioinformatics approach highlights the importance of not dismissing unidentified orfs on genomes (generally up to 50% of sequenced genomes) as all four antigens had no assigned function. because of their identification using this method, it became obvious they were part of a pilus-like structure that had never seen before in group b streptococcus (lauer et al., 2005) . the presence of antigens that provide protection on these pilus-like structures suggest that these might play a role in pathogenicity. genomic information is useful as a scaffold. however, in a given environment pathogens and hosts only express a subset of their genes at any one time. the presence of pan-genomes only complicates matters even more. to investigate the response of an organism to an environmental or other stress it is necessary to examine the expression pattern of proteins. at present, this is not possible to accomplish directly on a large scale, but a good approximation can be made by sequencing and counting mrna molecules. at present the process involves converting the rna to cdna, which can introduce biases but nonetheless sequencing has a great many advantages over traditional microarrays (ledford, 2008) . these include high specificity with little or no background noise and one also gains nucleotide level resolution of expression. despite such drawbacks, microarrays are still extremely powerful tools to understand levels of gene expression, and this is obvious from the study by toledo-arana et al., who discovered novel regulatory mechanisms in listeria (toledo-arana et al., 2009) . l. monocytogenes is normally harmless but can lead to serious food-borne infections. environmental change, from the soil through the stomach to the intestinal lumen and ultimately into the bloodstream, is thought to be responsible for the up-and downregulation of a plethora of genes. comparative genomics of the nonpathogenic l. innocua has resulted in the identification of a virulence locus (glaser et al., 2001) . using microarrays, transcripts of one strain grown at 37 c in rich medium were compared to three different conditions: stationary phase, hypoxia, and low temperature (30 c). in addition, knockout mutants in three known regulators of listeria virulence gene expression (prfa, sigb, and hfq) were compared to the control strain as well. rna was also extracted from the intestine of inoculated mice and from blood from healthy human donors that were both infected with three different strains (control and prfa and sigb knockouts). this analysis resulted in the discovery of massive transcriptional reshaping under the control of sigb when listeria enters the intestines. however, in the bloodstream, gene expression is under control of prfa. various noncoding rnas were uncovered, which show the same expression patters as virulence genes suggesting a potential role in virulence (toledo-arana et al., 2009) . because microarray data are based on a comparative difference in hybridization, high-throughput next-generation sequencing is seen as more quantitative as it based on number of hits for each sequenced transcript ( van vliet, 2010) . however, when making cdna for next-generation sequencing transcriptomics in prokaryotes, there are several difficulties not found in eukaryotes, such as high levels of rrna and trna molecules as well as a lack of poly-a tails, making extraction difficult. nontheless, it is possible to overcome these by either reducing the amount of rrna and trna using commercially available kits or by bioinformatic removal of such sequences postsequencing ( van vliet, 2010) . to date, some 20 rna-seq style experiments have been performed on prokaryotes. to give an example of the sort of novel insights that can be gleaned using such technology, passalacqua et al. (2009) sequenced the bacillus anthracis transcriptome using solid and illumina sequencing and clearly showed the polycistronic nature of many transcripts on a whole genome scale. although known for individual operons, this had never been shown on a genome-wide scale. they were also able to test the current genome annotations and discovered that 36 loci that were removed as nongenes showed significant transcriptional activity. in addition, 21 nonannotated regions had clear levels of transcription and should therefore be considered as genes (passalacqua et al., 2009) . as internal methionines could have incidentally been identified as start codons, they also checked whether upstream regions were included in the transcribed region. in 11 cases this proved to be the case suggesting the original start codons were incorrectly annotated. reassuringly, when comparing their data with microarray data, a strong correlation was observed. interestingly, because of the very high resolution of sequence-based transcriptomics studies, it is possible to identify novel regulatory elements. for example, when comparing expression levels under o 2 -and co 2 -rich conditions, the first gene of an eight-gene operon did not show a marked difference in expression level while all the others were significantly upregulated under co 2 (passalacqua et al., 2009 ). indeed, a bioinformatics approach had suggested the presence of a t-box riboswitch between genes 1 and 2 of this operon (griffiths-jones et al., 2005) . a similar approach to study how burkholderia cenocepacia, an opportunistic cystic fibrosis pathogen, responds to environmental changes revealed several new potential virulence factors (yoder-himes et al., 2009). as b. cenocepacia is routinely isolated from soil, two strains (one isolated from a cystic fibrosis patient and one from soil) were analyzed in their response to changes from growth at synthetic human sputum medium and soil medium. although their overall nucleotide identity is 99.8%, 179 and 120 homologous genes showed a significant difference in expression between the two strains when grown in synthetic sputum medium and soil medium, respectively. this suggests that despite the high level of relatedness, differential gene expression plays a large role in adaptation to their ecological niche (yoder-himes et al., 2009) . interestingly, similar to passalacqua et al. (2009) , several expressed noncoding rnas were uncovered with different expression levels depending on environmental condition. the significance of this needs to be investigated but highlights the ability of second-generation sequencing to unearth novel findings. despite the fact that a species' genome could well be larger than the actual genome content of one member of that species due to the pan-genome concept, an organism's proteome is by far much more complex. as discussed earlier, transcriptomics will reveal which subset of the genome is expressed under a given condition. however, posttranslational modifications of proteins make the actual proteome far more complex than the transcriptome. this is also the strength of proteomics, as can be seen in a study of the obligate intracellular parasite chlamydia pneumonia. c. pneumonia is the third-most-common cause of respiratory infections in the world, which, in part, is made possible due to the unique bi-phasic life cycle of this bacterial pathogen. chlamydia spread via a metabolically inert infectious particle called the elementary body. these elementary bodies enter the host cell where they differentiate into reticulate bodies. as the elementary body is the infectious phase, proteins presented on the outer membrane would be ideal candidates for vaccine development, especially as effective vaccines are lacking and treatment is via antibiotic therapy. a large-scale genomics-proteomics study by montigiani et al. (2002) systematically assessed putative exposed antigens for possible use in vaccine development. of the 1073 c. pneumonia genes, 636 have assigned functions, 72 of the latter are predicted to be peripherally located and were therefore selected for follow-up studies. in addition, the remaining 437 orfs were subjected to a series of search algorithms aimed at identifying putative surface-exposed antigens. in total, 141 orfs were identified as being possibly located on the cell surface. these 141 were subsequently used to produce recombinant proteins in e. coli. because both his-tagged as well as gst-tagged versions were made, a total of 173 recombinant proteins were produced and used for immunizations of mice. all antisera were used in facs analysis to test if they could bind to the c. pneumonia cell surface. this resulted in the identification of 53 putative surface-exposed antigens. interestingly, apart from well-known antigens, 14 antigens from unidentified orfs were part of this group of potential vaccine candidates. all 53 candidates were tested on western blots whether they generated a clean band of the expected size or whether they cross-reacted with other proteins; 33 of the 53 were specific. finally, montigiani et al. conducted a proteomic analysis of total protein from the elementary body phase identifying spots using mass spectrometry. protein sequencing using maldi-tof identified 28 putative surface-exposed antigens on the c. pneumonia 2d gels (montigiani et al., 2002) . a follow-up study by thorpe et al. (2007) clearly showed that one of the identified candidates, lcre, induced, amongst others, cd4 1 and cd8 1 t cell activation and completely cleared infection in a murine model. interestingly, lcre is homologous to a protein that is thought be part of the type iii secretion system of yersinia. the exposed nature of lcre on the c. pneumonia cell surface suggests that a type iii secretion system plays a role in chlamydia infection (montigiani et al., 2002) . the importance of exposed outer membrane proteins as potential vaccine candidates has prompted berlanda scorza et al. to assess the complement of outer membrane proteins from an extraintestinal pathogenic e. coli strain (berlanda scorza et al., 2008) . extraintestinal pathogenic e. coli is the leading cause of severe sepsis and current increases in drug resistance warrant the search for novel vaccine targets. in addition, current whole-cell vaccines suffer from undesired cross-reactions to commensal e. coli as well. the novel approach by berland scorza et al. is based on the observation that some gram-negative bacteria release outer membrane vesicles (omv) in the culture media, albeit in minute quantities. a tolr mutant appeared to release much more omvs than wild-type cells and subsequent large-scale mass spectroscopic analysis of its protein content resulted in the identification of 100 proteins. the majority of these were outer membrane and periplasmic proteins. intriguingly, three subunits from the cytolethal distending toxin (cdt) were included. this toxin is unusual in that one of its subunits is targeted to the eukaryotic host cell, where it breaks doublestranded dna resulting in cell death (de rycke and oswald, 2001) . to check whether the presence of cdt in the omv was due to the tolr knockout, wild-type extraintestinal pathogenic e. coli was tested using western blotting. indeed, cdt was detected in wild-type omv as well (berlanda scorza et al., 2008) . this suggests that toxin delivery via vesicles might well be the key event in pathogenesis. interestingly, 18 of the 100 identified proteins were not predicted to be targeted to the periplasm or outer membrane by psortb (gardy et al., 2005) . we see here excellent opportunities to train protein targeting algorithms with new wetbench data as these algorithms generally have been trained on a limited set of model organisms that do not reflect the diversity encountered in real life. despite the enormous progress in genomics of infectious diseases, the discovery of new drugs has not kept equal pace. for example, no candidate drugs have been identified after 70 high-throughput screens using validated bacterial drug targets (payne et al., 2007) . although broad-spectrum drugs might be more desirable, there has been a recent trend in targeting specific proteins from specific pathogens using structural biology. several structural genomics initiatives have been set up to target specific groups of pathogens. for example, the seattle structural genomics center for infectious diseases (http://ssgcid.org) and the center for structural genomics of infectious diseases (http://www.csgid.org) work on category a to c agents listed by the national institute for allergy and infectious diseases (niaid). other centers focus on specific organisms such as mycobacterium tuberculosis. examples are the mycobacterium tuberculosis structural proteomics project (http://xmtb. org) and the mycobacterium tuberculosis structural proteomics consortium (http://www.doe-mbi.ucla.edu/tb). the field of structural genomics aims to solve as many protein structures as possible from human pathogens with the aim to come up with new drug targets or vaccines (van voorhis et al., 2009) . obviously, correct selection of candidates for structural genomics projects is paramount and various criteria have been put forward (anderson, 2009; van voorhis et al., 2009) . if a protein is already a validated drug target obviously aids in selection. the proteins need to be essential for the pathogen and ideally, absent in humans. proteins involved in the uptake of essential nutrients are another target. classically, drug design has been focusing on substrate binding sites. more recently, small molecules interfering with subunit binding have started to attract attention. as eukaryotic and prokaryotic inorganic pyrophosphatases differ in composition (the former are homodimers, while the latter are homohexamers), efforts are aimed at compounds that interfere with the oligomeric state of the enzyme. in contrast, the highly conserved active site of inorganic pyrophosphatase would not have been a good target (van voorhis et al., 2009) . the 2003 sars outbreak that caught the infectious diseases community (if not the whole world) by surprise is one example where structural genomics has made enormous progress. despite knowing that coronaviruses caused serious diseases in animals, the fact that they only caused mild disease in humans meant that there was very little knowledge about coronavirus biology. the subsequent effort to understand viral assembly and replication/transcription, for example, has resulted in the elucidation of 12 sars-cov solved protein structures. interestingly, the novel fold-discovery rate was nearly 50%, while it would normally be more close to 6% (bartlam et al., 2007) . in addition, one key protein, the sars-cov main protease, has since been at the center of structure-based drug discovery. because of the nature of the discipline, structural genomics is dependent on various other disciplines such as biochemistry, microbiology, structural biology, computational biology, and bioinformatics and can only foster in a truly interdisciplinary environment (anderson, 2009 ). it is now possible to sequence the entire genome of a bacterial pathogen, assemble the raw sequence reads, perform automated annotation, and visualize the results within 3 weeks. at the same time (indeed even on the same sequencer) it is also possible to selectively sequence the transcriptome (rna-seq) regions of dna bound to protein (chip-seq) or for relevant species methylated dna to study epigenetic effects as well as small rna molecules. it is also possible to perform the very same sequencing on the host organism at the same time. bioinformatic algorithms and tools are a crucial tool in analyzing such unprecedented volumes of data. these data volumes have emerged as a result of secondgeneration sequencers such as the roche/454, illumina, and abi/solid systems. although useful information can be extracted by single researchers by targeted analysis of the sequencer output, to gain the most information out of such data, it is becoming increasingly common for multiple researchers or research groups with widely differing areas of expertise to collaborate. this collaboration is absolutely crucial if relevant insights are to be gained from large-scale datasets. as a result a vast array of data is generated, which is required to be annotated and curated as well as analyzed for information relevant to any particular experiment. in addition this information needs to be stored, shared, and distributed in a manner that enables reanalysis if and when new hypotheses are generated. platforms as produced by the gmod consortium (http://gmod.org), such as gbrowse, and underlying databases are excellent web-based tools for visualizing and comparing datasets. however, they currently offer limited scope for collaborative annotation or curation of datasets where relevant expertise can be brought to bear from a variety of different research groups. this problem is magnified with the advent of second-generation sequencers since much smaller groups of researchers tend to be involved, meaning that the expertise that large collaborations can muster (such as the influenza research database [fludb], http://www.fludb.org/) is much smaller. thus there is a need for integrated annotation and visualization pipelines to enable individual researchers to perform comparative genomics and transcriptomics. the broad institute offers a number of useful visualization tools to the individual researcher such as argo (http://www.broadinstitute.org/annotation/argo/) and the integrated genome viewer (igv) (http://www.broadinstitute.org/igv/). argo offers the ability to manually annotate and visualize a genome as well as provide a good graphical overview for comparative genomics and transcriptomics. currently, there is no one standard for bioinformatics pipeline development for next-generation sequencing. several efforts are underway or can be adapted from sanger sequencing pipelines. these include the prokaryote annotation pipeline xbase and the isga server (hemmerich et al., 2010) . these enable de novo sequenced prokaryote genomes to be annotated automatically and corrected manually at a later date. alternative sanger adaptations such as maker can also be used once an assembly has been generated. a large array of programs is now available to either align reads to a reference genome or to assemble them de novo (miller et al., 2010; paszkiewicz and studholme, 2010 ). they will not be listed in detail here as there are many considerations, including sequencing platform used, the read length in use, the expected genome size, length of longest repetitive elements, gc content, and whether paired-end reads are in use. the proprietary newbler software from roche is the most popular method of de novo assembly of 454 reads (typically 400à500bp). popular assemblers for short reads (i.e., mostly from illumina or solid platforms) are velvet (http://www.ebi.ac.uk/bzerbino/velvet) for the assembly of genomic dna or oases from the same group dealing with assembly of reads from transcriptomic cdna (http://www.ebi.ac.uk/bzerbino/oases) (zerbino and birney, 2008) . other assemblers such as abyss (simpson et al., 2009) , allpaths (butler et al., 2008) or soapdenovo (http://soap.genomics.org.cn/soapdenovo.html) are also popular. abyss enables assembly to be parallelized, thus speeding up assembly. allpaths has been shown to offer superior performance when multiple pairedend libraries are used. independent of read length, it is crucial that paired-end libraries are used when constructing de novo assemblies of any genome. note that the use of short-read sequences only can lead to significant gaps being left in the final assembly due to repetitive elements. however, for many analyses (especially for prokaryotic organisms) these gaps are generally not considered to be significant. in cases where closure of these gaps is more desirable than the addition of 454, sanger or long-range pcr data can often help. where significant quantities of long-and short-read data are available, then a joint assembly can be attempted. a recommended protocol is to assemble the short and long reads separately using their respective packages and to then merge the two assemblers using programs such as minimus (sommer et al., 2007) . another option is to use a template sequence from a related organism to help guide the assembly (note-this is distinct from remapping as described). the amoscmp package is useful for this purpose (pop et al., 2004) . finally, whatever assembly method is used, it is important to remember that a longer assembly is not necessarily a better one. examining the reads making up a contig (e.g., using the amos package (http://amos.sourceforge.net) or the tablet viewer (http://bioinf.scri.ac.uk/tablet) and alignment to a core-conserved group of genes should be standard practice to ensure that blatant errors are corrected. remapping of short reads to a reference genome is also a valid method of comparison. although software such as blat (kent, 2002) can be used with longer 454 reads, it is not an ideal tool for shorter read technologies where data volumes are much greater. where such a genome is available, software such as maq, its successor, bwa, bowtie, soap, and others offer a wealth of tools to identify indels, snps, and other variants which may be of interest. crucially in these cases it is important to have sufficient depth of coverage to ensure snp calls are valid. paired-end data is also valuable to have to highlight the presence of indels. after remapping it is also common practice to assemble unmapped reads using the de novo assembly software to reveal any novel sequence variants, which may be absent in the reference. in the case where pathogens and hosts are sequenced together, if the sequence of at least one is known, then it is relatively straightforward to separate the two using bioinformatic techniques. to deal with transcriptomic data where a reference sequence is available, softwares, such as erange (http://woldlab.caltech.edu/rnaseq/), tophat (trapnell et al., 2009) , and cufflinks (http://cufflinks.cbcb.umd.edu/), are extremely useful. the cufflinks module in particular offers the ability to predict the most likely exon isoform expression pattern using a combination of bayesian statistics and graphbased algorithms. we are aware that our treatment of the use of "omics" and bioinformatics in infectious disease research is not exhaustive. as mentioned in the introduction, what constitutes bioinformatics is not entirely clear and arguably varies depending on who tries to define it. however, we have attempted to show the considerable progress in infectious diseases research that has been made in recent years using various "omics" case studies. in addition, the last section is an attempt to provide a brief overview of the problems and (bioinformatics) solutions that current-day scientists face who embark on second-generation sequencing strategies. this is a fast-moving field, but the provided references and websites should be a good first approach for those who wish to make further strides toward eradicating infectious diseases from our planet. basic local alignment search tool structural genomics and drug discovery for infectious diseases structural proteomics of the sars coronavirus: a model response to emerging infectious diseases proteomics characterization of outer membrane vesicles from the extraintestinal pathogenic escherichia coli δtolr ihe3034 mutant the genome of the african trypanosome trypanosoma brucei molecular detection of multiple emerging pathogens in sputa from cystic fibrosis patients allpaths: de novo assembly of whole-genome shotgun microreads evolution of pathogenicity and sexual reproduction in eight candida genomes new tools for discovering and characterizing microbial diversity cytolethal distending toxin (cdt): a bacterial weapon to control host cell proliferation? the genome sequence of trypanosoma cruzi, etiologic agent of chagas disease comparative genomics of trypanosomatid parasitic protozoa complete nucleotide sequence of bacteriophage ms2 rna: primary and secondary structure of the replicase gene complete nucleotide sequence of sv40 dna the pfam protein families database whole-genome random sequencing and assembly of haemophilus influenzae rd the minimal gene complement of mycoplasma genitalium psortb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis a universal vaccine for serogroup b meningococcus comparative genomics of listeria species life with 6000 genes rfam: annotating non-coding rnas in complete genomes molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products an ergatisbased prokaryotic genome annotation web server comparative genomic analyses of seventeen streptococcus pneumoniae strains: insights into the pneumococcal supragenome interactive instruction on population interactions microbial population structures in the deep marine biosphere microbiology: metagenomics ironing out the wrinkles in the rare biosphere through improved otu clustering the genome of the kinetoplastid parasite, leishmania major bacillus anthracis evolution and epidemiology blat-the blast-like alignment tool genome analysis reveals pili in group b streptococcus the death of microarrays? evolution of the core and pan-genome of streptococcus: positive selection, recombination, and genome composition pervasive, genome-wide positive selection leading to functional divergence in the bacterial genus campylobacter identification of a universal group b streptococcus vaccine by multiple genome screen the microbial pangenome assembly algorithms for next-generation sequencing data genomic approach for analysis of surface proteins in chlamydia pneumoniae direct metagenomic detection of viral pathogens in nasal and fecal specimens using an unbiased high-throughput sequencing approach structure and complexity of a bacterial transcriptome de novo assembly of short sequence reads identification of vaccine candidates against serogroup b meningococcus by whole-genome sequencing the pangenome structure of escherichia coli: comparative genomic analysis of e. coli commensal and pathogenic isolates cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms nucleotide sequence of bacteriophage phix174 dna whole-genome comparison of disease and carriage strains provides insights into virulence evolution in neisseria meningitidis the key role of genomics in modern vaccine and drug design for emerging infectious diseases prodom: automated clustering of homologous domains abyss: a parallel assembler for short read sequence data minimus: a fast, lightweight genome assembler genome analysis of multiple pathogenic isolates of streptococcus agalactiae: implications for the microbial "pan-genome genome sequence of the nematode c. elegans: a platform for investigating biology discovery of a vaccine antigen that protects mice from chlamydia pneumoniae infection the listeria transcriptional landscape from saprophytism to virulence the complete genome sequence of the gastric pathogen helicobacter pylori tophat: discovering splice junctions with rna-seq next generation sequencing of microbial transcriptomes: challenges and opportunities the role of medical structural genomics in discovering new drugs for infectious diseases mapping the burkholderia cenocepacia niche response via high-throughput sequencing velvet: algorithms for de novo short read assembly using de bruijn graphs we would like to acknowledge our colleague dr. david j. studholme for his suggestions and feedback. key: cord-341564-fvuwick5 authors: qi, zhao-hui; li, ke-cheng; ma, jin-long; yao, yu-hua; liu, ling-yun title: novel method of 3-dimensional graphical representation for proteins and its application date: 2018-06-12 journal: evol bioinform online doi: 10.1177/1176934318777755 sha: doc_id: 341564 cord_uid: fvuwick5 in this article, we propose a 3-dimensional graphical representation of protein sequences based on 10 physicochemical properties of 20 amino acids and the blosum62 matrix. it contains evolutionary information and provides intuitive visualization. to further analyze the similarity of proteins, we extract a specific vector from the graphical representation curve. the vector is used to calculate the similarity distance between 2 protein sequences. to prove the effectiveness of our approach, we apply it to 3 real data sets. the results are consistent with the known evolution fact and show that our method is effective in phylogenetic analysis. with the number of available biological sequences developing rapidly, how to mine essential information from a huge amount of biological sequences effectively and reliably has become a critical problem. as a result, many methods in information extraction are proposed by researchers. among them, the graphical representation of dna sequences is an effective method for the virtualization and similarity analysis. graphical representation is a kind of alignment-free method. it provides intuitive information of data by visualization of biological sequences. what is more, it is more generally applicable because its mathematical description of data facilitates numerical analysis without difficult calculations. therefore, numerous works based on graphical representation have been presented by researchers. [1] [2] [3] [4] [5] [6] [7] [8] for example, randić et al 1 proposed a graphical representation of rna secondary structure based on twelve symbols. bielińska-waż et al 5 proposed a 2d-dynamic representation of dna sequences in 2007. after that they proposed more dynamic representations of dna sequences for generalization. 6, 7 however, the graphical representation of protein sequences is much more difficult because there are 20 amino acids instead of 4 nucleotides. various approaches have been proposed by researchers only until recently. [9] [10] [11] [12] [13] [14] [15] [16] among them, many approaches are based primarily on the physicochemical properties of amino acids. randić 9 early proposed a 2-dimensional graphical representation of proteins based on a pair of physicochemical properties in 2007. after that, yu et al 11 proposed a protein mapping method of protein sequences based on 10 physicochemical properties. wang et al 10 presented a graphical representation of protein sequences based on 9 physicochemical properties. in the works by he et al 15 and hu, 16 the physicochemical properties are also indispensable in information extraction from proteins because they have effects on the rate and pattern of protein evolution. from these, we can see that physicochemical properties are widely applied with graphical representation of protein sequences by these researchers and their results seem well. in this article, we propose a 3-dimensional (3d) graphic representation of protein sequences based on 10 physicochemical properties [17] [18] [19] [20] [21] of amino acids and the blosum62 matrix. 22 the representation can provide good visualization without degeneracy or circuit. then, we extract a specific vector from the graphical curve of a protein sequence. in addition, we proposed 2 applications based on the vector to analyze the similarity and evolutionary relationship of 3 data sets, respectively. the results are consistent with the evolution fact and works by other researchers. this shows our approach can be applied to hundreds of sequences with different lengths and perform well. as we know, a protein sequence is usually composed of 20 kinds of amino acids. every amino acid has its own particular physicochemical properties. therefore, to mine essential information from a protein sequence, we propose an effective graphical method combining physicochemical properties of amino acids and the blosum62 matrix. evolutionary bioinformatics that one amino acid is replaced by other amino acids. in their scoring scheme, a positive score represents a higher similarity between 2 amino acids and a negative score represents a lower similarity. here, we consider 10 primary physicochemical properties of amino acids, such as the pk1 (-cooh), 17 the pk2 (-nh3), 21 the polar requirement, 21 the isoelectric point, 18 the hydrogenation, 20 the hydroxythiolation, 20 the molecular volume, 19 the aromaticity, 20 the aliphaticity, 20 and the polarity values. 19 the 10 physicochemical properties of 20 amino acids are shown in table 1 . for each physicochemical property, we will use k-means clustering method 23 to classify the 20 amino acids into several groups. k-means clustering is an efficient unsupervised clustering method which is widely used in a diverse range of fields such as data mining, bioinformatics, and natural language processing. 24 however, there are some weaknesses in k-means. k-means needs to be given the number of clusters beforehand. silhouette 25 is a cluster validity index that can be used to determine the number of clusters. it considers 2 factors: cohesion and separation. its value ranges from −1 to 1 and a higher value represents a better effect of clustering. according to this index, we can obtain a valid number of clusters of the given data set. in this way, we can obtain 10 kinds of clustering classification based on the 10 different properties, which are shown in table 2 . according to the property pk1 (-cooh), we can divide the 20 amino acids into 7 groups: g1 (a, g, i, l, m, w, v), g2 (h, f), g3 (q, e, k, s, y), g4 (t), g5 (n, p), g6 (c), and g7 (d). if 2 or more amino acids are divided into the same group, it denotes that they are similar to each other by the property pk1 (-cooh). taking all the properties into consideration, we can obtain the number of similar properties between each pair of amino acids. if x denotes an amino acid and y denotes another amino acid, then we define the similar degree of 2 amino acids s xy as follows: where n xy is the number of similar properties between amino acid x and y . b xy is the value of amino acid x and y in the blosum62 matrix. then, we calculate all the values of s xy and the result is shown in table 3 . from table 3 , we can find that the similarity degree s xy of different amino acids can be numerically different from each other. to describe the similarity degree graphically, we will use a unitary linear regression to extract characters from every amino acid. here, we take i = {1, 2 , 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19} as independent variables of linear regression. for amino acid x, we take the 19 values in its corresponding row in table 3 as dependent variables. using unitary linear regression, we can obtain the corresponding slope and intercept of amino acid x. the slope and intercept can describe amino acid x effectively. all the slopes and intercepts of 20 amino acids are given in table 4 . we assume that p p p p n = 1 2 , , ,  is an arbitrary protein sequence composed of n amino acids. if x i , y i , and z i represent the 3d coordinates of p i in the protein sequence, then the 3d representation of a protein sequence will be constructed as follows: where a i and b i represent the slope and intercept of p i . in addition, the initial condition is x y 0 0 0 = = . next, we can convert the n points into a graphical curve. to demonstrate the effectiveness of the 3d graphical method, we take 2 protein sequences as an example. both the sequences are taken from yeast saccharomyces cerevisiae. 26 the graphical representations of 2 protein sequences are shown in figure 1 . based on the constructed graphical curve, we can get a specific vector from a protein sequence. using this vector we can analyze the similarity between 2 protein sequences effectively. characteristic vector is a common method to calculate the pairwise distance between 2 protein sequences. a good characteristic vector should avoid the problem about different lengths of sequences and complicated calculation. here, we define a 2-tuple ( , ) a b x x of amino acid x for characterization. given a protein sequence composed of n amino acids p p p p n = 1 2 , , ,  , we can compute the 2-tuple as follows: where a x and b x are, respectively, slope and intercept of amino acid x in table 4 . n x is the number of amino acid x in the sequence. from equation (3), we can see that n n x / table 4 . taking a short segment of 10 amino acids, aarrarrnnn, as an example, the numbers of amino acid a, r, and n in the segment are 3, 4, and 3. therefore, we can obtain the 40-dimensional characterizing vector (0.015, −0.738, −0.02, −0.16, −0.042, −0.369, 0, 0, . . ., 0, 0) according to table 4 and equation (3). the similarity/dissimilarity between 2 protein sequences can be represented by similar distance. there are several calculating methods for measurement of similar distance such as euclidean distance, city block distance, and manhattan distance. here, we use euclidean distance as a measure to represent the similarity/dissimilarity between 2 sequences. we will compute the similarity distance using the 40-dimensional characteristic vector. if the two 40-dimensional characteristic vectors are denoted as table 4 . slope, intercept, and linear equation of each amino acid similarity degree sequence. evolutionary bioinformatics , , , , , , their euclidean distance is calculated as follows: the smaller the distance d is, the more similar 2 protein sequences are. to show the effectiveness of the proposed similarity analysis method, we apply it to 9 nd5 protein sequences (provided as supplementary file 1): human, common chimpanzee, pygmy chimpanzee, gorilla, fin whale, blue whale, rat, mouse, and opossum (their accession number in ncbi [national center for biotechnology information] are ap_000649, np_008196, np_008209, np_008222, np_006899, np_007066, ap_004902, np_904338, and np_007105, respectively). according to the method given in section "similarity analysis," we can obtain the similarity distance matrix of these protein sequences. the corresponding result is shown in table 5 . on the basis of table 5 , we can find that the distance between fin whale and blue whale is the smallest. this indicates that they have a high degree of similarity. the distance between human, common chimpanzee, pygmy chimpanzee, and gorilla is relatively small, which means that they are similar to each other. besides, opossum is quite dissimilar to other species because the similarity distances between opossum and other species are large. all these results are consistent with the evolution theory and the recent studies. [14] [15] [16] that is to say the proposed method can analyze the similarities of proteins effectively. to further demonstrate the effectiveness of our method, we apply it to another data set which is widely used in many works. 10, 27 this data set consists of 29 spike protein sequences of coronavirus (provided as supplementary file 2) . the basic information of the protein sequences is shown in table 6 . we construct the phylogenetic tree for the 29 spike protein sequences based on our method using upmga method in figure 2 . from figure 2 , we can see that all the sequences are mainly classified into 4 groups by our method. this is consistent with the works 10,27 and the known biology fact that coronavirus are always classified into 4 groups: the group i (contains pedv, tgev), the group ii (contains bcov, mhv, rtcov), the group iii (contains ibv, tcov), and the sars-covs (severe acute respiratory syndrome coronavirus). in this section, we give an application for the similarity analysis of ha gene sequences of influenza a (h1n1) from march 1, 2009 to april 30, 2009 (available online at https://www.ncbi. nlm.nih.gov). we obtain a data set that consists of 560 gene sequences with full length (provided as supplementary file 3) . to further demonstrate the validity of our method, we apply the method to this data set. according to our method, for each virus isolate, we can get a corresponding 20-dimensional vector. thus, we can obtain a vector set of 560 vectors. by computing the similarity distance between pairs of these vectors, we can obtain a similarity distance matrix. next, we construct the phylogenetic tree based on our method in figure 3 . to analyze the results better, we mark 2 typical strains: a/california/07/2009 (h1n1) and a/ indiana/08/2009 (h1n1). from figure 3 , it is easy to identify that all virus isolates are mainly classified into 2 groups. this illustrates that there are 2 different kinds of influenza a (h1n1) virus isolates from march 1, 2009 to april 30, 2009. novel spectral representation of rna secondary structure without loss of information milestones in graphical bioinformatics four-component spectral representation of dna sequences graphical and numerical representations of dna sequences: statistical aspects of similarity 2d-dynamic representation of dna sequences spectral-dynamic representation of dna sequences 3d-dynamic representation of dna sequences a group of 3d graphical representation of dna sequences based on dual nucleotides withdrawn: 2-d graphical representation of proteins based on physico-chemical properties of amino acids adld: a novel graphical representation of protein sequences and its application protein map: an alignment-free sequence comparison method based on various properties of amino acids an efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation graphical representation of proteins as four-color maps and their numerical characterization a protein mapping method based on physicochemical properties and dimension reduction the graphical representation of protein sequences based on the physicochemical properties and its applications f-curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids analysis of similarity/dissimilarity of protein sequences the genetic code and error transmission amino acid difference formula to help explain protein evolution relations between chemical structure and biological activity on the fundamental nature and evolution of the genetic code amino acid substitution matrices from protein blocks automated programming: the next wave of developer power tools recovering the number of clusters in data sets with noise features using feature rescaling factors silhouettes: a graphical aid to the interpretation and validation of cluster analysis novel 2-d graphical representation of proteins structure, function, and evolution of coronavirus spike proteins evolution trends of the 2009 pandemic influenza a (h1n1) viruses in different continents from mega6: molecular evolutionary genetics analysis version 6.0 the phylogenetic tree of the 560 influenza a (h1n1) isolates from march to this paper has not been submitted elsewhere for consideration of publication. z-hq conceived and designed the work that led to the submission. k-cl contributed significantly to analysis and manuscript preparation. j-lm and y-hy helped to perform the analysis with constructive discussions. all the authors reviewed and approved the final manuscript. this result is consistent with the works by qi et al. 14, 28 furthermore, the result is also consistent with the biology fact that a new influenza virus, a/california/07/2009 (h1n1)-like virus, appeared and showed a strong ability to infect human beings in april 2009. 23 the branch length in figure 3 is the similarity distance between 2 virus isolates.clustalw is one of the most widely used multiple sequence alignment method for nucleic acid and protein sequence in molecular biology. we construct the phylogenetic tree of the 560 gene sequences using clustalw method 29 in this article, a new 3d graphical representation of protein sequences is introduced based on 10 physicochemical properties and blosum62 matrix. on the basis of the graphical representation curve, we extract a specific vector and use the vector to calculate the similarity distance between 2 protein sequences. to prove the effectiveness of our method, we apply our method to 3 real data sets. the results show the validity of our method in phylogenetic analysis compared with related works and evolution facts. key: cord-031957-df4luh5v authors: dos santos-silva, carlos andré; zupin, luisa; oliveira-lima, marx; vilela, lívia maria batista; bezerra-neto, joão pacifico; ferreira-neto, josé ribamar; ferreira, josé diogo cavalcanti; de oliveira-silva, roberta lane; pires, carolline de jesús; aburjaile, flavia figueira; de oliveira, marianne firmino; kido, ederson akio; crovella, sergio; benko-iseppon, ana maria title: plant antimicrobial peptides: state of the art, in silico prediction and perspectives in the omics era date: 2020-09-02 journal: bioinform biol insights doi: 10.1177/1177932220952739 sha: doc_id: 31957 cord_uid: df4luh5v even before the perception or interaction with pathogens, plants rely on constitutively guardian molecules, often specific to tissue or stage, with further expression after contact with the pathogen. these guardians include small molecules as antimicrobial peptides (amps), generally cysteine-rich, functioning to prevent pathogen establishment. some of these amps are shared among eukaryotes (eg, defensins and cyclotides), others are plant specific (eg, snakins), while some are specific to certain plant families (such as heveins). when compared with other organisms, plants tend to present a higher amount of amp isoforms due to gene duplications or polyploidy, an occurrence possibly also associated with the sessile habit of plants, which prevents them from evading biotic and environmental stresses. therefore, plants arise as a rich resource for new amps. as these molecules are difficult to retrieve from databases using simple sequence alignments, a description of their characteristics and in silico (bioinformatics) approaches used to retrieve them is provided, considering resources and databases available. the possibilities and applications based on tools versus database approaches are considerable and have been so far underestimated. proteins and peptides play different roles depending on their amino acids (aa) constitution, which may vary from tens to thousands residues. 1 peptides are conventionally understood as having less than 50 aa. 2 proteins, on the contrary, would be any molecule presenting higher amino acid content and bothproteins and peptides-present a plethora of variations in plants. despite that, plant proteomes have been much more studied than peptidomes. it is well-known that the biochemical machinery necessary for the synthesis and metabolism of peptides is present in every living organism. from the variations of this machinery, a wide structural and functional diversity of peptides was generated, justifying the growing interest in their study. in eukaryotes, peptides are prevalent in intercellular communication, performing as hormones, growth factors, and neuropeptides, but they are also present in the defense system. 3 besides plants and animals, several pathogenic microorganisms, peptides can serve as classical virulence factors, which disrupt the epithelial barrier, damage cells, and activate or modulate host immune responses. an example of this performance is represented by candidalysin, 4 a fungal cytolytic peptide toxin found in the pathogenic fungus candida albicans that damages epithelial membranes, triggers a response signaling pathway, and activates epithelial immunity. there are also reports of defense-related fungal peptides. for example, the copsin, a peptide-based fungal antibiotic recently identified in the fungus coprinopsis cinerea 5 kills bacteria by inhibiting their cell wall synthesis. regarding bacterial peptides, certain species from the gastrointestinal microbial community can release low-molecular-weight peptides, able to trigger immune responses. 6 there are additionally peptides that act like bacterial "hormones" that allow bacterial communities to organize multicellular behavior such as biofilm formation. 7 some peptides are known for their medical importance, as defensins that 2 bioinformatics and biology insights present antibacterial, antiviral, and antifungal activities. for example, human alpha-and beta-defensins present in the saliva may potentially impede virus replication, including sars-cov-2, 8 besides other roles as protection against intestinal inflammation (colitis). 9 considering the roles of plant peptides, they can also be multifunctional, and have been classified into 2 main categories 10 (supplementary figure s1) : (1) peptides with no bioactivity, primarily resulting from the degradation of proteins by proteolytic enzymes, aiming at their recycling, and (2) bioactive peptides, which are encrypted in the structure of the parent proteins and are released mainly by enzymatic processes. the first group is innocuous regarding signaling, regulatory functions, and bioactivity. so far, it has been reported that some of them may play a significant role in nitrogen mobilization across cellular membranes. 11 the second group of bioactive peptides has a substantial impact on the plant cell physiology. some peptides of this group can act in the plant growth regulation (through cell-to-cell signaling), endurance against pathogens and pests by acting as toxins or elicitors, or even detoxification of heavy metals by ion-sequestration. comprising bioactive peptides, additional subcategorization has been proposed regarding their function. tavormina et al 12 figure s1 ) based on the type of precursor: • • derived from functional precursors: originated from a functional precursor protein; • • derived from nonfunctional precursors: originated from a longer precursor that has no known biological function (as a preprotein, proprotein, or preproprotein); • • not derived from a precursor protein: some sorfs (small open read frames; usually <100 codons) are considered to represent a potential new source of functional peptides (known as "short peptides encoded by sorfs"). a more intuitive classification of bioactive peptides was further proposed by farrokhi et al 10 receptors in leaves. 13 another example is the pls (polaris) peptide that acts during early embryogenesis but later activates auxin synthesis, also affecting cytokine synthesis and ethylene response. 14 regarding the second group, it includes peptides with signaling roles in plant defense, comprising at least 4 subgroups, including syst (systemin) (supplementary figure s1 ). the syst peptides were identified in solanaceae members, like tomato and potato 15 (acting on the signaling response to herbivory). the syst leads to the production of a plant protease inhibitor that suppresses insect's proteases. 16 stratmann 17 suggested that in plants, systs act to stimulate the jasmonic acid signaling cascade within vascular tissues to induce a systemic wound response. • • defense peptides or antimicrobial peptides (amps): to be fitted into this class, a plant peptide must fulfill some specific biochemical and genetic prerequisites. regarding biochemical features, in vitro antimicrobial activity is required. concerning the genetic condition, the gene encoding the peptide should be inducted in the presence of infectious agents. 18 in practice, this last requirement is not ever fulfilled as some amps are tissue-specific and are considered as part of the plant innate immunity, while other isoforms of the same class appear induced after pathogen inoculation. 19 plant amps are the central focus of the present review, comprising information on their structural features (at genomic, gene, and protein levels), resources, and bioinformatic tools available, besides the proposition of an annotation routine. their biotechnological potential is also highlighted in the generation of both transgenic plants resistant to pathogens, and new drugs or bioactive compounds. antimicrobial peptides are ubiquitous host defense weapons against microbial pathogens. the overall plant amp characterization regards the following variables ( figure 1 ): electrical charge, hydrophilicity, secondary and 3-dimensional (3d) structures, and the abundance or spatial pattern of cysteine residues. 20 these features are primarily related to their defensive role(s) as membrane-active antifungal, antibacterial, or antiviral peptides. regarding the nucleotide sequence, plant amps are hypervariable and this genetic variability is considered crucial to provide diversity and the ability to recognize different targets. for their charges, amps can be classified as cationic or anionic ( figure 1 ). most plant amps have positive charges, which is a fundamental feature for the interaction with the membrane lipids of pathogens. 21 concerning hydrophilicity, amps are generally amphipathic, that is, they exhibit molecular conformation with both hydrophilic and hydrophobic domains. 22 silva et al 3 with respect to their 3d structure, amps can be either linear or cyclic ( figure 1 ). some linear amps adopt an amphipathic α-helical conformation, whereas non-α-helical linear peptides generally show 1 or 2 predominant amino acids. 23 in turn, cyclic amps, including cysteine-containing peptides, can be divided into 2 subgroups based on the presence of a single or multiple disulfide bonds. a peculiar feature of these peptides is a cationic and amphipathic character, what improves their functioning as membrane-permeabilizing agents. 23 considering the secondary structures, amps may exhibit α-helices, β-chains, β-pleated sheets, and loops ( figure 1 ). wang 24 classified plant amps into 4 families (α, β, αβ, and non-αβ), based on the protein classification of murzin et al, 25 with some modifications. antimicrobial peptides of the "α" family present α-helical structures, 1 whereas amps from the "β" family contains β-sheet structures usually stabilized by disulfide bonds. 26, 27 some plant amps show an α-hairpinin motif formed by antiparallel α-helices that are stabilized by 2 disulfide bridges. 28 such amps present a higher resistance to enzymatic, chemical, or thermal degradation. 29 antimicrobial peptides from the "αβ" family having both "α" and "β" structures are also stabilized by disulfide bridges. an example of amp presenting "αβ" structures are defensins, usually with a cysteine-stabilized αβ motif (csαβ), an α-helix, and a triplestranded antiparallel β sheet stabilized mostly by 4 disulfide bonds. 30 finally, amps that do not belong to the "αβ" group exhibit no clearly defined "α" or "β" structures. 26 plant amps are also classified into families considering protein sequence similarity, cysteine motifs, and distinctive patterns of disulfide bonds, which determine the folding of the tertiary structure. 31 therefore, plant amps are commonly grouped as thionins, defensins, heveins, knottins (linear and cyclic), lipid transfer proteins (ltp), snakins, and cyclotides. 27, 31 these amp categories will be detailed in the next sections, together with other groups here considered (impatienlike, macadamia [β-barrelins], puroindoline (pin), and thaumatin-like protein [tlp]) and the recently described αhairpinin amps. the description includes comments on their structure, pattern for regular expression (regex) analysis (when available), functions, tissue-specificity, and scientific data availability. thionins are composed by 45 to 48 amino acid residues with a molecular weight around 5 kda, considering the mature peptide. they are synthesized with a signal peptide together with the mature thionin and the so-called acidic domain. 32 to date, there is no experimental information available about possible functions of the acidic domain, even though it is clearly not dispensable as shown by the high conservation of the cysteine residues. 33 the thionin superfamily comprises 2 distinct groups of plant peptides α/β-thionins and γ-thionins with distinguished structural features. 34 the α/β thionins have homologous amino acid sequences and similar structures. 35 besides, they are rich in arginine, lysine, and cysteine. 36 in turn, γ-thionins have a greater similarity with defensins, and some authors classify them within this group. 37 however, compared with the defensins, they present a longer conserved amino acid sequence. 31 regarding the cysteine motif, it can be divided into 2 subgroups, one with 8 residues connected by 4 disulfide bonds called 8c and the other with 6 residues connected by 3 disulfide bonds called 6c. 38 the general designation of thionins has been proposed as a family of homologous peptides that includes purothionins. the first plant thionin was isolated in 1942 from wheat flour and labeled as purothionin. 39 since then, homologues from various taxa have been also identified, like bioinformatics and biology insights viscotoxins (viscum album) and crambins (crambe abyssinica). 40 they have also been isolated from different plant tissues like seeds, leaves, and roots. 41, 42 thionins have been tested for different elicitors: grampositive 43, 44 or gram-negative bacteria, 45, 46 yeast, 38, 43 insect larvae, 47 nematode, 33 and inhibitory proteinase. 48 thionins are hydrophobic in nature, interact with hydrophobic residues, and lyse bacteria cell membrane. their toxicity is due to an electrostatic interaction with the negatively charged membrane phospholipids, followed by either pore formation or a specific interaction with a membrane. 38 it has been reported that they are able to inhibit other enzymes possibly through covalent attachment mediated by the formation of disulfide bonds, as previously observed for other thionin/enzyme combinations. 48 thionin representatives with known 3d structures determined by x-ray crystallography are crambin (pdb id: 1crn), α1and β-purothionins (pdb id: 2phn and 1bhp), β-hordothionin (pdb id: 1wuw), and viscotoxin-a3 (pdb id: 1okh). the first to be determined was the mixed form of crambin. 35, 49 it showed a distinct capital γ shape with the n terminus forming the first strand in a βsheet. the architecture of this sheet is additionally strengthened by 2 disulfide bonds. 50 after a short stretch of extended conformation, there is a helix-turn-helix motif. in crambin, there is a single disulfide involved in stabilizing the helix-tohelix contacts. at the center of this motif, there is a crucial arg10 that forms 5 hydrogen bonds to tie together the first strand, the first helix, and the c terminus. 50 the first plant defensins were isolated from wheat 51 and barley grains, 52 initially called γ-hordothionins. due to some similarities in cysteine content and molecular weight, they were classified as γ-thionins. later, the term "γ-thionin" was replaced by "defensin" based on the higher number of primary and tertiary structures of these proteins and also on their antifungal activities more related to insect and mammalian defensins than to plant thionins. 53 plant defensins belong to a diverse protein superfamily called cis-defensin 54 and exhibit cationic charge, consisting of 45 to 54 aa with 2 to 4 disulfide bonds. 53, 55 plant defensins share similar tertiary structures and typically exhibit a triple-stranded antiparallel β sheet, enveloped by an α-helix and confined by intramolecular disulfide bonds 1 (figure 2a ). this motif is called cysteine-stabilized αβ (csαβ). 56 the csαβ defensins were classified into 3 groups based on their sequence, structure, and functional similarity. defensins are known for their antimicrobial activity at low micromolar concentrations against gram-positive and gramnegative bacteria, 57 fungi, 58 viruses, and protozoa. 59 in addition, they present protein inhibition, insecticidal, and antiproliferative activity, acting as an ion-channel blocker, being also associated with the inhibition of pathogen protein synthesis. 60 instead, plant defensins act in the regulation of signal transduction pathways and induce inflammatory processes, in addition to wound healing, proliferation control, and chemotaxis. 61 in general, plant defensins do not present high toxicity to human cells, having in vivo efficacy records, with relevant therapeutic potential, and can be applied in treatments associated with traditional medicine. 62 cools et al 63 reported that a peptide derived from a plant defensin (hsafp1) acted synergistically with caspofungin (an antimycotic) (in vivo and in vitro) against the formation of candida albicans biofilm on polystyrene and catheter substrate, indicating that the hsafp1 variant presented a strong antifungal potential in the proposed treatment. other biotechnological applications of defensins are described, as in the case of ecgdf1, which was isolated from a legume (erythrina crista-galli), heterologously expressed in escherichia coli and purified. ecgdf1 inhibited the growth of various plant and human pathogens (such as candida albicans and aspergillus niger and the plant pathogens clavibacter michiganensis ssp. michiganensis, penicillium expansum, botrytis cinerea and alternaria alternate). 64 due to these features, ecgdf1 is a candidate for the development of antimicrobial products for both agriculture and medicine. 64 non-specific lipid transfer proteins (ns-ltps) were first isolated from potato tubers 65 and are actually identified in diverse terrestrial plant species. they comprise a large gene family, are abundantly expressed in most tissues, but absent in most basal plant groups as chlorophyte and charophyte green algae. 66 they generally include an n-terminal signal peptide that directs the protein to the apoplastic space. 67 some ltps have a c-terminal sequence that allows their post-translational modification with a glycosylphosphatidylinositol molecule, facilitating the integration of ltp on the extracellular side of the plasma membrane. the ns-ltps are small proteins which were thus named because of their function of transferring lipids between the different membranes carrying lipids (non-specifically, the list includes phospholipids, fatty acids, their acylcoas, or sterols). they consist of approximately 100 aa and are relatively larger in size than other amps, such as defensins. depending on their sizes, ltps may be classified into 2 subfamilies: ltp1 and ltp2, with relative molecular weight of 9 and 7 kda, respectively. 68, 69 the limited sequence conservation turned this classification inadequate. thus, a modified and expanded classification system was proposed, presenting 5 main types (ltp1, ltp2, ltpc, ltpd, and ltpg) and 5 additional types with a smaller number of members (ltpe, ltpf, ltph, ltpj, and ltpk). 66 the new classification system is not based on molecular size but rather on (1) the position of a conserved intron, (2) the identity of the amino acid sequence, and figure s2) . although this latter classification system is the most recent, the conventional classification of ltp1 and ltp2 types has been maintained by most working groups. lipid transfer protein nomenclature has been confusing and without consistent guidelines or standards. there are several examples where specific ltps receive different names in different scientific articles. the lack of a robust terminology sometimes turns it quite difficult, extremely time-consuming, and frustrating to compare ltps with different roles/functions. 67 therefore, an additional nomenclature was also proposed by salminen et al, 67 naming ltps as follows: atltp1.3, osltp2.4, hvltpc6, ppltpd5, and taltpg7, with the first 2 letters indicating the plant species (eg, at = arabidopsis thaliana, pp = physcomitrella patens); ltp1, ltp2, and ltpc indicating the type; while the last digit (here 3-7) regards the specific number given to each gene or protein within a given ltp type. for the sake of clarity, the authors recommend the inclusion of a point between the type specification (ltp1 and ltp2) and the gene number. for ltpc, ltpd, ltpg, and other types of ltp defined with a letter, the punctuation mark was not recommended. this latter classification system is currently recommended as it comprises several features of ltps and is more robust than the previous classification systems. lipid transfer proteins are small cysteine-rich proteins, having 4 to 5 helices in their tertiary structure ( figure 2b ), which is stabilized by several hydrogen bonds. such a folding gives ltps a hydrophobic cavity to bind the lipids through hydrophobic interactions. this structure is stabilized by 4 disulfide bridges formed by 8 conserved cysteines, similar to defensins, although bound by cysteines in different positions. the disulfide bridges promote ltp folding into a very compact structure, which is extremely stable at different temperatures and denaturing agents. [70] [71] [72] these foldings provide a different specificity of lipid binding at the ltp binding site, where the ltp2 structure is relatively more flexible and present a lower lipid specificity when compared with ltp1. 34 the first 3d structure of an ltp was established for taltp1.1 based on 2d and 3d data of 1h-nmr, purified from wheat (triticum aestivum) seeds in aqueous solution. 73, 74 currently, several 3d structures of ltps have been determined, either by nuclear magnetic resonance (nmr) or x-ray crystallography; in their free, unbound form or in a complex with ligands. the heveins were first identified in 1960 in the rubber-tree (hevea brasiliensis), but its sequence was determined later, whereas a similarity was detected to the chitin-binding domain of an agglutinin isolated from urtica dioica (l.) 75 with 8 cysteine residues forming a typical cys motif. 76 the primary structure of the hevein consists of 29 to 45 aa, positively charged, with abundant glycine (6) and cysteine (8-10) residues, 76 and aromatic residues. 31, 77 the chitin-binding domain is a determinant component in the identification of hevein-like peptides whose binding site is represented by the amino acid sequence sxfgy/sxygy, where x regards any amino acid. 76, 78 most heveins have a coil-β1-β2-coil-β3 structure that occurs by variations with the secondary structural motif in the presence of turns in 2 long coils in the β3 chain. 31 antiparallel β chains form the central β sheet of the hevein motif with 2 long coils stabilized by disulfide bonds ( figure 2c ). although the presence of chitin has not been identified in plants, there are chitin-like structures present in proteins that exhibit a strong affinity to this polysaccharide isolated from different plant sources. 79 the presence of 3 aromatic amino acids in the chitinbinding domain favors chitin binding by providing stability to the hydrophobic group c-h and the π electron system through van der waals forces, as well as the hydrogen bonds between serine and n-acetylglucosamine (glcnac) present in the chitin structure. 76, 77 this domain is commonly found in chitinases of classes i to v, in addition to other plant antimicrobial proteins, such as lectins and pr-4 (pathogenesis-related protein 4) members. 80, 81 it may also occur in other proteins that bind to polysaccharide chitin, 80 such as the antimicrobial proteins ac-amp1 and ac-amp2 of amaranthus caudatus (amaranthaceae) seeds which are homologous to hevein but lack the c-terminal glycosylated region. 82 plant chitinases (class i) have the hevein-like domains, called hlds. due to the similar structural epitopes between chitinases and heveins, they are responsible for the cross reactive syndrome (latex-fruit syndrome). 83, 84 among the several classes of proteins mentioned, the proteins with a high degree of similarity to hevein are chitinases i and iv. 76 chitinases are known to play an essential role in plant defense against pathogens, 85 also inhibiting in vitro fungal growth, 86 especially when combined with β-1,3-glucanases. 87 it also interferes with the growth of hyphae, resulting in abnormal ramification, delay, and swelling in their stretching. 81 however, it has been shown that heveins have a higher inhibitory potential than chitinases and that their antifungal effect is not related only to the presence of chitinases 88 ; pn-amp1 and pn-amp2 amps with hevein domains have potent antifungal activities against a broad spectrum of fungi, including those without chitin in their cell walls. 88, 89 modes of action of chitinases usually include degradation and disruption of the fungal cell wall and plasma membrane due to its hydrolytic action, causing extravasation of plasma particles. 21, 89 therefore, heveins have good antifungal activity, and only a few are active against bacteria, most of them with low activity. another role of hevein chitinases regards the antagonistic effect in triggering the aggregation of rubber particles in the latex extraction process in rubber trees. unlike heveins, other chitinases inhibit rubber particle aggregation. however, its action in conjunction with other proteins (β-1,3-glucanase) increases the effect of β-1,3-glucanase on rubber particle aggregation. 90 a study by shi et al 91 found that the interaction of the protein network related to the antipathogenic activity released by lutoids (lysosomal microvacuole in latex) is essential in closing laticiferous cells (cells that produce and store latex), not only providing a physical barrier, but a biochemical barrier used by laticiferous cells affected by pathogen invasion. knottins are part of the cysteine-rich peptides (crps) superfamily, sharing the cysteine-knot motif and therefore resembling other families as defensins, heveins, and cyclotides. 92 their structure was initially identified by crystallography of carboxypeptides isolated from potato, showing the cysteine-knot motif with 39 aa and 6 cysteine residues. 93 they are also called "cysteine-knot peptides," "inhibitor cysteine-knot peptides," or even "cysteine-knot miniproteins" because their mature peptide presents less than 50 aa, forming 3 interconnected disulfide bonds in the cysteine-knot motif, characterizing a particular scaffold. 92 this conformation confers thermal stability at high temperatures. for example, the cysteine-stabilized β-sheet (csb) motif derived from knottins presents stability at approximately 100°c with only 2 disulfide bonds. 94 the knottins may have linear or cyclic conformation. however, both exhibit connectivity between the cysteines at positions 1-4c, 2-5c, and 3-6c, forming a ring at the last bridge 92 ( figure 2d ). knottins have different functions, such as signaling molecules, 95 response against biotic and abiotic stresses, 96 root growth, 97 symbiotic interactions as well as antimicrobial activity against bacteria, 98 fungi, 99 virus, 100 and insecticidal activity, 101 among others. knottins antimicrobial activity has been attributed to the action of functional components of the plasma membrane, leading to alterations of lipids, ion flux, and exposed charge. 99 the accumulation of peptides on the surface of the membrane results in the weakening of the pathogen membrane, 102 resulting in transient and toroidal perforations. 99 in the course of a large-scale survey to identify novel amps from australian plants, 103, 104 an amp with no sequence homology was purified. its complementary dna (cdna) was cloned from macadamia integrifolia (proteaceae) seeds, containing the complete peptide coding region. the peptide was named miamp1, being highly basic with an estimated isoelectric point (pi) of 10 and a mass of 8 kda. the miamp1 is 102 aa long, including a 26 aa signal peptide in the n-terminal region, bound to a 76 aa mature region with 6 cysteine residues. 105 its 3d structure was determined using nmr spectroscopy, 104 revealing a unique conformation among plant amps, with 8 beta-strands arranged in 2 greek key motifs, forming a greek key beta-barrel ( figure 2e ). due to its particularities, miamp1 was classified as a new structural family of plant amps, and the name β-barrelins was proposed for this class. 104 this structural fold resembles a superfamily of proteins called γ-crystallin-like characterized by the precursors βγ-crystallin. 106 this family includes amps from other organisms, for example, wmkt, a toxin produced by the wild yeast williopsis mraki. 107 the miamp1 exhibited in vitro antimicrobial activity against various phytopathogenic fungi, oomycetes, and grampositive bacteria 103 with a concentration range of 0.2 to 2 μm generally required for a 50% growth inhibition (ic50). in addition, the transient expression of miamp1 in canola (brassica napus) provided resistance against blackleg disease caused by the fungus leptosphaeria maculans, 108 turning miamp1 potentially useful for genetic engineering aiming at disease resistance in crop plants. there are few scientific publications with macadamia-like peptides, maybe because they prevail in primitive plant groups (eg, lycophytes, gymnosperms to early angiosperms as amborella and papaver), being apparently absent in derived angiosperms (eg, asteridae, including brassicaceae as arabidopsis thaliana). on the contrary, they have been identified in some monocots (as zantedeschia, zea, and sorghum). 109 in fact, peptides similar to miamp1 appear to play a role in the defense against pathogens in gymnosperms, including species of economic importance (as pinus and picea) thus deserving attention for their biotechnological potential. 109 four closely related amps (ib-amp1, ib-amp2, ib-amp3, and ib-amp4) were isolated from seeds of impatiens balsamina (balsaminaceae) with antimicrobial activity against a variety of fungi and bacteria, and low toxicity to human cells in culture. these amps are the smallest isolated from plants to date, consisting of only 20 aa in length. the ib-amps are highly basic and contain 4 cysteine residues that form 2 disulfide bonds. interestingly, they have no significant homology with other amps available in public databases. sequencing of cdnas isolated from i. balsamina revealed that all 4 peptides are encoded within a single transcript. concerning the predicted precursor of ib-amp protein, it consists of a pre-peptide followed by 6 mature peptide domains, each one of them flanked by propeptide domains ranging from 16 to 35 aa in length (supplementary figure s3) . this primary structure with repeated domains of alternating basic peptides and acid propeptide domains has, to date, not been reported in other plant species. 110 patel et al 111 conducted an experiment to purify ib-amp1 from seeds of impatiens balsamina. after purification, this peptide had its secondary structure tested by circular dichroism (cd). the results revealed a peptide that may include a β-turn but do not show evidences for either helical or β-sheet structure over a range of temperature and ph. structural information from 2d 1h-nmr was obtained in the form of proton-proton internuclear distances inferred from nuclear overhauser enhancements (noes) and dihedral angle restraints from spinspin coupling constants, which were used for distance geometry calculations. owing to the difficulty in obtaining the correct disulfide connectivity by chemical methods, the authors had built and performed 3 separate calculations: (1) a model with no disulfides; (2) another with predicted disulfide bonds; and (3) a model with alternative connectivity disulfide, as assigned from the nuclear overhauser effect spectroscopy (noesy) nmr spectra. as a result, 2 hydrophilic patches were observed at opposite ends and opposite sides of the models, whereas in between them a large hydrophobic patch was identified. however, the study did not conclude which of the 3 models would be the most likely representative of ib-amp1, reporting only that cysteines are necessary for maintaining the structure. based on the experiment performed by patel et al, 111 the present work built 3 different models: model 1: without disulfide bonds, and the other 2 models with different disulfide connections-model 2: nmr prediction by patel et al 111 6-cys;16-cys and 7-cys;20-cys, and model 3: disulfide bond partner prediction by dianna 7-cys;16-cys and 6-cys;20-cys. calculations have shown that although the peptide is small, the cysteines constrain part of it to adopt a well-defined main chain conformation. from residue 4 to 20 (except 11), the main chain is well-defined, whereas residues 1 to 3 in the n-terminal region present few restrictions and appear to be more flexible (supplementary figure s4) . analyzing the rmsd (root mean square deviation), we observed that all the models lost the initial conformation and, among them, model 3 was the most stable. models 1 and 2 showed a similar pattern (supplementary figure s5) , as in the models of patel et al, 111 although model 1 was the most flexible. little is known about impatiens-like amps mode of action. lee et al 112 investigated the antifungal mechanism of ib-amp1 noting that when oxidized (bound by disulfide bridges), there occurs a 4-fold increase in antifungal activity against aspergillus flavus and candida albicans, as compared with reduced ib-amp1 (without disulfide bridges). confocal microscopy analyses have shown that ib-amp1 can either bind to the cell 8 bioinformatics and biology insights surface or penetrate cell membranes, indicating an antifungal activity by inhibiting a distinct cellular process, rather than ion channel or membrane pore formation. fan et al 113 reported the ib-amp4 antimicrobial activity dependent of β-sheet configuration to enable insertion into the lipid membrane, thus killing the bacteria through a non-lytic mechanism. 114 current approaches aim to make changes in ib-amp to improve its antimicrobial activity. as an example, synthetic variants of ib-amp1 were fully active against yeasts and fungi, where the replacement of amino acid residues by arginine or tryptophan improved more than twice the antifungal activity. 115 another study involving amp modification generated a synthetic peptide without the disulfide bridges (ie, a linear analog of ib-amp1), which showed an antimicrobial specificity 3.7 to 4.8 times higher than the wild-type ib-amp1. 116 puroindoline puroindolines are small basic proteins that contain a single domain rich in tryptophan. these proteins were isolated from wheat endosperm, have a molecular mass around 13 kda, and a calculated isoelectric point higher than 10. at least 2 main isoforms (called pin-a and pin-b) are known, which are encoded by pina-d1 and pinb-d1 genes, respectively. these genes share 70.2% identical coding regions but exhibit only 53% identity in the 3′ untranslated region. 117 both pin-a and pin-b contain a structure with 10 conserved cysteine residues and a tertiary structure similar to ltps, consisting of 4 α-helices separated by loops of varying lengths, with the tertiary structure joined by 5 disulfide bonds, 4 of which identical to ns-ltps. 117 the conformation of the 2 pin isoforms was studied by infrared and raman spectroscopy. both pin-a and pin-b have similar secondary structures comprising approximately 30% helices, 30% β-sheets, and 40% non-ordered structures at ph 7. it has been proposed that the folding of both pins is highly dependent on the ph of the medium. the reduction of the disulfide bridges results in a decrease of pins solubility in water and to an increment of the β-sheet content by about 15% at the expense of the α-helix content. 118 no high-resolution structure for any of the pin isoforms is available, bringing challenges to understanding the function of their hydrophobic regions, with some evidence coming only from partially homolog peptides. 117 however, wilkinson et al 119 proposed a theoretical model for several sequences of this amp. puroindolines are proposed to be functional components of wheat grain hardness loci, control core texture, besides antifungal activity. [120] [121] [122] [123] although the biological function of pins is unknown, their involvement in lipid binding has been proposed. while ltps bind to hydrophobic molecules in a large cavity, pins interact only with lipid aggregates, that is, micelles or liposomes, through a single stretch of tryptophan residues. this stretch of tryptophan residues is especially significant in the main form, pin-a (wrwwkwwk), while it is truncated in the smaller form, pin-b (wptwwk). [124] [125] [126] puroindolines form protein aggregates in the presence of membrane lipids, and the organization of such aggregates is controlled by the lipid structure. in the absence of lipids, these proteins may aggregate, but there is no accurate information on the relationship between aggregation and interaction with lipids. the antimicrobial activity of pins is targeted to cell membranes. charnet et al 127 indicated that pin is capable of forming ion channels in artificial and biological membranes that exhibit some selectivity over monovalent cations. the stress and ca 2+ ions modulate the formation and/or opening of channels. puroindolines may also be membranotoxins, which may play a role in the plant defense mechanism against microbial pathogens. morris 128 reported that the pin-a and pin-b act through similar but somewhat different modes, which may involve "membrane binding, membrane disruption and ion channel formation" or "intracellular nucleic acid binding and metabolic disruption." natural and synthetic mutants have allowed the identification of pins as key elements for antimicrobial activity. snakins are crps first identified in potato (solanum tuberosum). 129, 130 due to their sequence similarity to gasa (gibberellic acid stimulated in arabidopsis) proteins, the snakins were classified as members of the snakin/gasa family. 131 the genes that encode these peptides have (1) a signal sequence of approximately 28 aa, (2) a variable region, and (3) a mature peptide of approximately 60 residues, with 12 highly conserved cysteine residues. these cysteine residues maintain the 3d structure of the peptide through disulfide bonds, besides providing stability to the molecule when the plant is under stress 129, 130, 132, 133 (figure 2f; supplementary figure s6 ). snakins may be expressed in different parts of the plant, like stem, leaves, flowers, seeds, and roots, 134-137 both constitutive or induced by biotic or abiotic stresses. in vitro activity was observed against a variety of fungi, bacteria, and nematodes, acting as a destabilizer of the plasma membrane. 129, 138, 139 moreover, they were reported as essential agents in biological processes such as cell division, elongation, cell growth, flowering, embryogenesis, and signaling pathways. [140] [141] [142] [143] alpha-hairpinins as reported by nolde et al, 144 alpha-hairpin emerged as a new amp with unusual motif configuration. these peptides prevail in plants and their structure was resolved based on nmr data obtained from the ecamp-1 peptide isolated from barnyard grass seeds (echinoa crus-galli). 144 some α-hairpinins comprise trypsin inhibitors with helical hairpin structure and this group silva et al 9 was recently proposed as a new plant amp family. 145 similar to other amps, the amino acid sequences of α-hairpinins are variable. they share the conserved cysteine motif (cx3cx1-15cx3c) that form a helix-loop-helix fold and may have 2 disulfide bridges c1-c4 and c2-c3. 146 its structural stability is maintained by forming hydrogen bonds, so that the side chains have a relatively stable spatial orientation. 147 as reviewed by slavokhotova et al, 148 members of alphahairpin family have been described in both mono and dicot groups, including species as echinochloa crus-galli and zea mays (both poaceae, monocot), fagopyrum esculentum (polygonaceae, eudicot), and stellaria media (caryophyllaceae, eudicot). several transcripts with α-hairpinin motif exhibit similarities to snakin/gasa genes and are sometimes positioned within this family. although the α-hairpinins structure has been published, its mechanism of action is still not resolved ( figure 2j , pdb id: 2l2r). however, studies indicate they present a potential dna binding capacity. 149 the term cyclotide was created at the end of the past century to designate a family of plant peptides with approximately 30 aa in size and a structural motif called cyclic cysteine knot (cck). 150 this motif is composed by a head-to-tail cyclization that is stabilized by a knotted arrangement of disulfide bridges, with 6 conserved cysteines, connected as follows: c1-2, c3-6, c4-5. 151 cyclotides are generally divided into 2 subfamilies, mӧbius and bracelets, based on structural aspects. in addition to ccks, 2 loops (between c1-2 and c4-5) have high similarity between both subfamilies, while the other 2 loops (between c2-3 and c3-4) exhibit some conservation within the subfamilies 152,153 (supplementary figure s7) . to date, several cyclotides were identified in eudicot families such as rubiaceae, 154 violaceae, 155 fabaceae, 156 and solanaceae, 157 in addition to some monocots of poaceae family. 158 in general, cyclotides may act in defense against a range of agents like insects, helminths, or mollusks. in addition, they can also act as ecbolic (inducer of uterus contractions), 154 antibacterial, 159 anti-hiv, 100 and anticancer factors. 160 all these characteristics added to the stability conferred by the cck motif turn these peptides into excellent candidates for drug development. 161, 162 thaumatin-like protein thaumatins or tlps belong to the pr-5 (pathogen-related protein) family and received this name due to its first isolation from the fruit of thaumatococcus daniellii (maranthaceae) from west africa. 163 thaumatin-like proteins are abundant in the plant kingdom, 164 being found in angiosperms, gymnosperms, and bryophytes, 163 being also identified in other organisms, including fungi, 165, 166 insects, 167 and nematodes. 168 thaumatin-like proteins are known for their antifungal activity, either by permeating fungal membranes 169 or by binding and hydrolyzing β-1,3-glucans. 170, 171 in addition, they may act by inhibiting fungal enzymes, such as xylanases, 172 α-amylases, or trypsin. 173 besides, the expression of tlps is regulated in response to some stress factors, such as drought, 174 injuries, 175 freezing, 176 and infection by fungi 177, 178 viruses, and bacteria. 179 as to the tlp structure, this protein presents characteristic thaumatin signature (ps00316): 180, 181 most of the tlps have molecular mass ranging from 21 to 26 kda, 163 possessing 16 conserved cysteine residues (supplementary figure s8) involved in the formation of 8 disulfide bonds, 182 which help in the stability of the molecule, allowing a correct folding even under extreme conditions of temperature and ph. 183 thaumatin-like proteins also contain a signal peptide at the n-terminal, which is responsible for targeting the mature protein to a particular secretory pathway. 163 the tertiary structure presents 3 distinct domains, which are conserved and form the central cleft, responsible for the enzymatic activity of the protein, being located between domains i and ii. 184 this central cleft may be of an acidic, neutral, or basic nature depending on the binding of the different linkers/receptors. all plant tlps with antifungal activity have an acidic cleft known as motif reddd due to 5 highly conserved amino acid residues (arginine, glutamic acid, and 3 aspartic acid; supplementary figure s8 ), being very relevant for specific receptor binding and antifungal activity. 169, 185, 186 crystallized structures were determined for some plant tlps, such as thaumatin 187 (figure 2g ), zeamatin 169 ( figure 2h ), tobacco pr-5d 185 and osmotin, 186 the cherry allergen pruav2, 188 and banana allergen ba-tlp, 184 among other tlps. some tlps are known as small tlps (stlps) due to the deletion of peptides in one of their domains, culminating in the absence of the typical central cleft. these stlps exhibit only 10-conserved cysteine residues, forming 5 disulfide bonds, resulting in a molecular weight of approximately 16 to 17 kda. they have been described in monocots, conifers, and fungi, so far. 163, 189, 190 other tlps exhibit an extracellular tlp domain and an intracellular kinase domain, being known as pr5k (pr5-like receptor kinases) 191 and are present in both monocots and dicots. for example, arabidopsis contains 3 pr5k genes, while rice has only 1. 163 with the rapid growth in the number of available sequences, it is unfeasible to handle such amount of data manually. thus, amp sequences (as well as their biological information) have been deposited in large general databases, such as uniprot and trembl, which contain sequences of multiple origins. 192, 193 in this sense, the construction of databases that deal specifically with amps was an important step to organize the data. during the past decade, several databases were built to support the deposition, consultation, and mining of amps. thus, these databases can be classified into 2 groups: general and specific. 194 the specific databases can be divided into 2 subgroups: those containing only 1 specific group (defensins or cyclotides) and those containing data from a supergroup of peptides (plant, animal, or cyclic peptides) (supplementary table 1 ). in general, both types of databases share some characteristics such as the way that the data are available or the tools to analyze amps. the collection of antimicrobial peptides (campr3) is a database that comprises experimentally validated peptides, sequences experimentally deduced and still those with patent data, besides putative data based on similarity. [195] [196] [197] the current version includes structures and signatures specific to families of prokaryotic and eukaryotic amps. 197 the platform also includes some tools for amp prediction. the antimicrobial peptide database (apd) 198 collects mature amps from natural sources, ranging from protozoa to bacteria, archaea, fungi, plants, and animals, including humans. amps encoded by genes that undergo post-translational modifications are also part of the scope, besides some peptides synthesized by multienzyme systems. the apd provides interactive interfaces for peptide research, prediction, and design, statistical data for a specific group, or for all peptides available in the database. the lamp (database linking antimicrobial peptides) comprises natural and synthetic amps, which can be separated into 3 groups: experimentally validated, predicted, and patented. their data were primarily collected from the scientific literature, including uniprot and other amp-related databases. 199 the database of antimicrobial activity and structure of peptides (dbaasp) 200 contains information about amps from different origins (synthetic or non-synthetic) and complexity levels (monomers and dimers) that were retrieved from pubmed using the following keywords: antimicrobial, antibacterial, antifungal, antiviral, antitumor, anticancer, and antiparasitic peptides. this database is manually curated and provides information about peptides that have specific targets validated experimentally. it also includes information on chemical structure, post-translational modifications, modifications in the n/c terminal amino acids, antimicrobial activities, cell target and experimental conditions in which a given activity was observed, besides information about the hemolytic and cytotoxic activities of the peptides. 200 due to the diversity of amps and the need to accommodate the most representative subclasses, several databases were established, focusing on specific types, sources, or features. there are several ways to classify amps, and they can range from biological sources such as bacterial amps (bacteriocins), plants, animals, and so on; biological activity: antibacterial, antiviral, antifungal, and insecticide; and based on molecular properties, pattern of covalent bonds, 3d structure and molecular targets. 201, 202 the "defensins knowledgebase" is a database with manual curation and focused exclusively on defensins. this database contains information about sequence, structure, and activity, with a web-based interface providing access to information and enabling text-based search. in addition, the site presents information on patents, grants, laboratories, researchers, clinical studies, and commercial entities. 203, 205 the cybase is a database dedicated to the study of sequences and 3d structures of cyclized proteins and their synthetic variants, including tools for the analysis of mass spectral fingerprints of cyclic peptides, also assisting in the discovery of new circular proteins. 205 the phytamp is a database designed to be solely dedicated to plant amps based on information collected from the uniprot database and from the scientific literature through pubmed. 206 plantpepdb is a database with manual curation of plantderived peptides, mostly experimentally validated at the protein level. it includes data on the physical-chemical properties and tertiary structure of amps, also useful to identify their therapeutic potential. different search options for simple and advanced compositing are provided for users to perform dynamic search and retrieve the desired data. overall, plantpepdb is the first database that comprises detailed analysis and comprehensive information on phyto-peptides from a wide functional range. 207 biological data banks (dbs) are organized collections of data of diverse nature that can be retrieved using different inputs. the management of this information is done through various software and hardware resources, whose retrieval and organization can be performed in a quick and efficient way. 208 considering biological data, information can be classified into (1) primary (sequences), (2) secondary (structure, expression, metabolic pathways, types of drugs, etc), and (3) specialized, for example, containing information on a species or on a class of protein. 209 within this third group, some references to amps can be mentioned, such as campr3 196 and apd 198 that compile sequence data and structure retrieved from diverse sources, and also the defensin knowledgebase 203 and the cybase 205 which are dedicated to specific classes of peptides (defensins and cyclotides, respectively), in addition to phytamp, 206 a specific database of plant amps (supplementary table 2 ). the first step to infer the function of a given sequence (annotation) is to retrieve it in databases. for this purpose, 3 approaches have been used mostly: (1) local alignments, especially by using basic local alignment search tool (blast) 210 and fasta 211 ; by searching for specific patterns using (2) regex or (3) hidden markov model (hmm). 194 the first approach has been widely used, since most of the information is available in databases as sequences, together with tools to align them, whereas the blast is the primary tool for doing so. 212 this tool splits the sequence into small pieces (words), comparing it with the database. however, this approach has a limitation. small motifs may not be significantly aligned as they comprise small portions of the sequences that can be smaller than 20% of the total size. 31, 194 due to the high variability of amps, only few highly conserved sequences can be identified using this type of inference. to reduce the effects of local alignment limitations, other strategies based on the search for specific patterns were introduced, such as regex 213 (supplementary table 1 ) and hmm. 214 the regex is a precise way of describing a pattern in a string where each regex position must be set, although ambiguous characters (or wildcards) can also be used. for example, if we want to find a match for both amino acid sequences caiessk and waiesk, we can use the following expression: [cw]aies{1,2}k, this expression would find a pattern starting with the letter "c" or "w," followed by an "a," an "i," and an "e," 1 or 2 "s," and ending with a "k." the hmms are well-known for their effectiveness in modeling the correlations between adjacent symbols, domains, or events, and they have been extensively used in various fields of biological analysis, including pairwise and multiple sequence alignment, base-calling, gene prediction, modeling dna sequencing errors, protein secondary structure prediction, noncoding rna (ncrna) identification, protein and rna structural alignments, acceleration of rna folding and alignment, fast noncoding rna annotation, and many others. using hmm, a statistic profile is included in the model, which is calculated from a sequence alignment, and a score is determined site-to-site, with conserved and variable positions defined a priori. 194, 215 predicting antimicrobial activity the design of new amps led to the development of methods for the discovery of new peptides, thus allowing new experiments to be done by researchers. in this sense, the new challenge lies in the construction of new prediction models capable of discovering peptides with desired activities. the apd db has established a prediction interface based on some parameters defined by the entire set of peptides available in this database. these values are calculated from natural amps to consider features like length, net charge, hydrophobicity, amino acid composition, and so on. if we take as an example the net load, the amps deposited in the apd range from -12 to +30. this is the first parameter incorporated into the prediction algorithm. however, most amps have a net load ranging from -5 to +10, which then becomes the alternative prediction condition. therefore, the same method is applied to the remaining parameters. the prediction in apd is performed in 3 main steps. first, the sequence parameters will be calculated and compared. if defined as an amp, the peptide can then be classified into 3 groups: (1) rich in given amino acids, (2) stabilized by disulfide, and (3) linear bridges. finally, sequence alignments will be conducted to find 5 peptides of higher similarity. 198, 216, 217 the advent of machine learning (ml) methods has promoted new possibilities for drug discovery. in ml inferences, both a positive and a negative dataset are usually required to train the predictive models. the positive data, in this case, regard preferably experimentally validated amps that can be collected in databases, whereas negative data are randomly selected protein sequences that do not have amp characteristics. 197, 218 machine learning methods based on support vector machine (svm), random forest (rf), and neural networks (nn) have been the most widely used. svm is a specific type of supervised method of ml, aiming to classify data points by maximizing the margin between classes in a high-dimensional space. 219, 220 random forest is a non-parametric tree-based approach that combines the ideas of adaptive neighbors with bagging for efficient adaptive data inference. neural networks is an information processing paradigm inspired by how a biological nerve system process information. it is composed of highly interconnected processing elements (neurons or nodes) working together to solve specific problems. [221] [222] [223] evaluating proteomic data regarding the use of amps in peptide therapeutics, as an alternative to antimicrobial treatment, new efficient and specific antimicrobials are demanded. as aforementioned, amps are naturally occurring across all classes of life, presenting high active potential as therapeutic agents against various kinds of bacteria. 224 the identification of novel amps in databases is primarily dependent on knowing about specific amps together with a sufficient sequence similarity. 225 however, orthologs may be divergent in sequence, mainly because they are under strong positive selection for variation in many taxa, 226 leading to remarkably lower similarity, even in closely related species. in this scenario, where alignment tools present limited use, 1 strategy to identify amps is related to proteomic approaches. proteins and peptides are biomolecules responsible for various biochemical events in living organisms, from formation and composition to regulation and functioning. thus, understanding of the expression, function, and regulation of the proteins encoded by an organism is fundamental, leading to the so-called "proteomic era." the term "proteome" was first used by marc wilkins in 1994 and it represents the set of proteins encoded by the genome of a biological system (cell, tissue, organ, biological fluid, or organism) at a specific time under certain conditions. 227 protein extraction, purification, and identification methods have significantly advanced our capacity to elucidate many biological questions using proteomic approaches. 228, 229 due to the wide diversity of proteomic analysis, methods makes the choice of the correct approach dependent on the type of material and compounds to be analyzed. 213, 230 two main tools are used to isolate proteins: (1) the 2-dimensional electrophoresis (2-de) associated with mass spectrometry (ms) and (2) liquid chromatography associated with ms, each one with its own limitations. [230] [231] [232] obtaining native proteins is a challenge in proteomics or peptidomics, due to high protein complexity in samples, as the occurrence of post-translational modifications. alternative strategies applied to extraction, purification, biochemical, and functional analyses of these molecules have been proposed, favoring access to structural and functional information of hard-to-reach proteins and peptides. 233 based on 2d gel, al akeel et al 234 evaluated 14 spots obtained from seeds of foeniculum vulgare (apiaceae) aiming at proteomic analyses and isolation of small peptides. extracted proteins were subjected to 3 kda dialysis, and separation was carried out by deae-ion exchange chromatography while further proteins were identified by 2d gel electrophoresis. one of its spots showed high antibacterial activity against pseudomonas aeruginosa, pointing to promising antibacterial effects, but requiring further research to authenticate the role of the anticipated proteins. for amps, 2de is challenging due to the low concentration of the peptide molecules captured by this approach, their small sizes, and their ionic features (strongly cationic). in addition, the limited number of available specific databases and high variability turn their identification through proteolysis techniques and mass spectrometry, matrix-assisted laser desorption/ionization (maldi-ms) difficult. in addition, the partial hydrophobicity characteristics and surface charges facilitate peptide molecular associations, making analysis difficult by any known proteomic approaches. 232 in addition, peptides are most often cleaved from larger precursors by various releasing or processing enzymes. 235 furthermore, profiles generated do not represent integral proteome, as 2de has limitations to detect proteins with low concentration, values of extreme molecular masses, pis, and hydrophobic proteins, including those of membranes. 236 due to these limitations, multidimensional liquid chromatography-high-performance liquid chromatography (mdlc-hplc) has been successfully employed as an alternative to 2d gels. techniques and equipments for the newly developed separation and detection of proteins and peptides, such as nano-hplc and multidimensional hplc, have improved proteomics evaluation. 237 molecular mass values obtained are used in computational searches in which they are compared with in silico digestion results of proteins in databases. in silico approaches, usually by the action of trypsin as a proteolytic agent, may generate a set of unique peptides whose masses are determined by ms. 238, 239 these methodologies are widely adopted for large-scale identification of peptide from ms/ms spectra. 240 theoretical spectra are generated using fragmentation patterns known for specific series of amino acids. the first 2 widely used search engines in database searching were sequest 241 and mascot (matrix science, boston, ma; www.matrixscience.com). 242 they rank peptide matches based on a cross-correlation to match the hypothetical spectra to the experimental one. mascot is widely used for peptidomics and proteomics analysis, including amp identification in many organisms, or to evaluate the antibacterial efficacy of new amps. evaluating new amp against multidrug-resistant (mdr) salmonella enterica, tsai et al 243 used 2d gel electrophoresis and liquid chromatography-electrospray ionization-quadrupole-time-offlight tandem ms to determine the protein profiles. the protein identification was performed using the mascot with trypsin as cutting enzyme, whereas ncbi nr protein was set as a reference database. the methodology used in this study indicated that the novel amp might serve as a potential candidate for drug development against mdr strains, confirming the usability of mascot. in a similar way, umadevi et al 244 described the amp profile of black pepper (piper nigrum l.) and their expression on phytophthora infection using label-free quantitative proteomics strategy. for protein/peptide identification, ms/ms data were searched against the apd database 245 using an in-house mascot server, established full tryptic peptides with a maximum of 3 missed cleavage sites and carbamidomethyl on cysteine, besides an oxidized methionine included as variable modifications. the apd database was used for amp signature identification, 245 together with phytamp 206 and campr3. 197 to enrich the characterization parameters, isoelectric point, aliphatic index, and grand average of hydropathy were also used 246 (gravy) (using protparam tool), besides the net charge from phytamp database. based on label-free proteomics strategy, they established for the first time the black pepper peptidomics associated with the innate immunity against phytophthora, evidencing the usability of proteomics/ peptidomics data for amp characterization in any taxa, including plant amps, aiming the exploitation of these peptides as next-generation molecules against pathogens. 244 other tools use database searching algorithms, such as x!tandem, 247 open mass spectrometry search algorithm (omssa), 248 probid, 249 radars, 250 and so on. these search engines are based on database search but use different scoring schemes to determine the top hit for a peptide match. general information on database search engines, their algorithms, and scoring schemes were reviewed by nesvizhskii et al. 251 despite its efficient ability to identify peptides, database searching presents several drawbacks, like false positive identifications due to overly noisy spectra and lower quality peptides score (related to the short size of peptides). so, the identification is strongly influenced by the amount of protein in the sample, the degree of post-translational modification, the quality of automatic searches, and the presence of the protein in the databases. 252, 253 in this scenario, the knowledge about the genome from a specific organism is important to allow the identification of the exact pattern of a given peptide. if an organism has no sequenced genome, it is not searchable using these methods. 235, 240 once the sequences are obtained, bioinformatic tools can be used to predict peptides structure and estimate bioactive peptides. 254 more recently, an interactive and free web software platform, mixprotool, was developed, aiming to process multigroup proteomics data sets. this tool is compiled in r (www.r-project. org), providing integrated data analysis workflow for quality control assessment, statistics, gene ontology enrichment, and other facilities. the mixprotool is compatible with identification and quantification outputs from other programs, such as maxquant and mascot, where results may be visualized as vector graphs and tables for further analysis, in contrast to existing softwares, such as giapronto. 255 according to the authors, the web tool can be conveniently operated, even by users without bioinformatics expertise, and it is beneficial for mining the most relevant features among different samples. 24 the central tenet of structural biology is that structure determines function. for proteins, it is often said the "function follows form" and "form defines function." therefore, to understand protein function in detail at the molecular level, it is mandatory to know its tertiary structure. 256 experimental techniques for determining structures, such as x-ray crystallography, nmr, electron paramagnetic resonance, and electron microscopy, require significant effort and investments. 257 all methods mentioned have their own limitations, and the gap between the number of known proteins and the number of known structures is still substantial. thus, there is a need for computational framework methods to predict protein structures based on the knowledge of the sequence. 256 in addition, in recent years, there has been impressive progress in the development of algorithms for protein folding that may aid in the prediction of protein structures from amino acid sequence information. 258 historically, the prediction of a protein structure has been classified into 3 categories: comparative modeling, threading, and ab initio. the first 2 approaches construct protein models by aligning the query sequences with already solved model structures. if the models are absent in the protein data bank (pdb), the models must be constructed from scratch, that is, by ab initio modeling, considered the most challenging way to predict protein structures. 256 in the case of comparative modeling methods, when inserting a target sequence, the programs identify evolutionarily related models of solved structures based on their sequence or profile comparison, thus constructing structure models supported by these previously resolved models. 259 this approach comprises 4 main steps: (1) fold assignment, which identifies similarity between the target and the structure of the solved model; (2) alignment of the target sequence to the model; (3) generation of a model based on alignment with the chosen template; and (4) analysis of errors considering the generated model. 260 there are several servers and computer models that automate the comparative modeling process, with swiss-model and modeler figuring as the most used. 261, 262 although automation makes comparative modeling accessible to experts and beginners, some adjustments are still needed in most cases to maximize model accuracy, especially in the case of more complex proteins. 262 therefore, some caution must be taken regarding the generated models, considering the resolution and quality of the model used, as well as homology between the model and the protein of interest. threading modeling methods are based on the observation that known protein structures appear to comprise a limited set of stable folds, and those similarity elements are often found in evolutionarily distant or unrelated proteins. the most used servers based on this approach are muster, 263 sparks-x, 264 raptorx, 259 prosa-web, 265 and most notably the i-tasser. 266 in some cases, the incorporation of structural information to combine the sequence used in the search with possible models allows the detection of similarity in the fold, even in the absence of an explicit evolutionary relation. the prediction of structures from known protein models is, at first sight, a more straightforward task than the prediction of protein structures from available sequences. therefore, when no solved model is available, another approach is recommended, namely, the ab initio modeling. this method is intended to predict the structure only from the sequence information, without any direct assistance from previously known structures. the ab initio modeling aims to predict the best model, based on the minimum energy for a potential energy function by sampling the potential energy surface using various searchable information. 267, 268 such approaches turn it challenging to produce high-resolution modeling, essential for determining the native protein folding and its biochemical interpretation. on the contrary, later resolved structures and comparisons with previously predicted proteins point to a higher successful modeling generated by ab initio methods than those generated by pure energy minimization methods, classical or even pure methods. 256 among the most used servers and programs for ab initio modeling, we highlight the rosetta, 257 quark, 269 and touchstone ii. 267 the accuracy of the models calculated by many of these methods is evaluated by cameo (continuous automated model evaluation) 270 and by casp (critical assessment of protein structure prediction). 258 probably the first reasonably accurate ab initio model was built in casp4. since then, sustained progress was achieved in ab initio prediction, but mainly for small proteins (120 residues or less). in casp11, for the first time, a novel 256-residue protein with a sequence identity with known structures lower than 5% was constructed with high precision for sequences of this size. 271 in casp12, a significant improvement was reported in 4 areas: contact prediction, free modeling, template-based modeling, and estimating the accuracy of models. the authors report that this improvement is due to the accuracy of modeling and alignment methods, as well as increased data availability for both sequence and structure. 258 due to the number of amps deposited in the pdb (to date approximately 1099 structures), comparative modeling is the most used. however, when it comes to de novo peptide design, the most recommended choice would be ab initio 272 or a hybrid approach that uses more than 1 modeling method. 273 after the generation of a model, the amp stability should be evaluated using molecular dynamics (md). molecular dynamics comprises the application of computational simulations that predict the changes in the positions and velocities of the constituent atoms of a system under given time and condition. this calculation is done through a classical approximation of empirical parameters, called "force field." 274 if, on one hand, this approximation makes the dynamics of a system containing thousands of atoms numerically accessible, it obviously limits the nature of the processes that can be observed during the simulations. no quantum effect is visualized in a md simulation; just as no chemical bond is broken, no interactions occur between orbitals, resonance, polarization, or charge transfer effects. 275 however, the molecules go beyond a static system. thus, md is a computational technique that can be used for predicting or refining structures, dynamics of molecular complexes, drug development, and action of molecular biological systems. 276 molecular dynamics simulation is widely used for protein research, aiming to extract information about the physical properties of individual proteins. the results of such simulations are then compared with experimental results. as these experiments are generally carried out in solvents, it is necessary to simulate molecular systems of protein in water. these simulations have a variety of applications, such as determining the folding of a structure to a native structure and analyzing the dynamic stability of this structure. 277 the use of md to simulate protein folding processes is one of the most challenging applications and should be relatively long (in the order of microseconds to milliseconds) to allow observing a single fold event. in addition, the force field used must correctly describe the relative energies of a wide variety of shapes, including unfolding and poorly folded shapes that may occur during the simulation. 275 the considerable application potential led to the implementation of md simulation in many software packages, including gromacs, 278-280 amber, 281 namd, 282 charmm, 283 lammps, 284 and desmond. 285 in addition to the above mentioned, there are other simulation types available, such as the monte carlo method, stochastic dynamics, and brownian dynamics. 280 in the last decades, md simulation has become a standard tool in theoretical studies of large biomolecular systems, including dna or proteins, in environments with near realistic solvents. indeed, simulations have proven valuable in deciphering functional mechanisms of proteins and other biomolecules, in uncovering the structural basis for disease, and in the design and optimization of small molecules, peptides, and proteins. 286 historically, the computational complexity of this type of computation has been extremely high, and much research has focused on algorithms to achieve unique simulations that are as long or as large as possible. 278 the interplay between a given pathogen (eg, virus, bacteria, fungus) must be studied through a holistic approach. hostpathogen relationships are very complex and occur at diverse conceivable levels, including the cellular/molecular level of both, pathogen and host, under given environmental conditions. a most approximate understanding of these interactions at every level is the ultimate goal of "systems biology" (sb). it comprises a holistic approach, integrating distinct disciplines, as biology, computer science, engineering, bioinformatics, physics, and others to predict how a given system behaves under given conditions and what is the role of its parts. systems biology stands out because it is capable of correlating omics data for the understanding of plant-pathogen interaction. the construction of a plant-pathogen interaction network includes the reconstruction of metabolic pathways of these organisms, identification of the degree of pathogenicity, besides the expression of genes and proteins from both plant and pathogen. the networks can be classified into 5 types: (1) regulatory; (2) metabolic; (3) protein-protein interaction; (4) signaling and regulatory; and (5) signaling, regulatory, and metabolic. 287 each of these networks can be plotted according to computational approaches. also, further studies are required to contemplate the construction of evolutionary in silico models and the characterization of these molecular targets in vitro. 288, 289 studies of protein-protein interactions to understand the regulatory process are essential 290 and new computational methods are necessary for this purpose with more optimized algorithms, also to remove potential false positives. thus, in-depth studies on the orientation of molecules and their linkages to the formation of a stable complex are of great importance for understanding plant-pathogen studies and also to develop new drugs. 291 the understanding of the regulatory principles by which protein receptors recognize, interact, and associate with molecular substrates or inhibitors is of paramount importance to generate new therapeutic strategies. 292 in modern drug discovery, docking plays an important role in predicting the orientation of the binder when it is attached to a protein receptor or enzyme, using forms and electrostatic interactions, van der walls, coulombic, and hydrogen bond as parameters to quantify or predict a given interaction. 293, 294 molecular docking aims at exploring the predominant mode(s) of binding of a molecule (protein or ligand) when it binds to a protein with a known 3d structure based on a scoring function that has 3 main functions: the first is to determine the binding mode and the binding site of a protein, the second is to predict the absolute binding affinity between protein and ligand (or other protein) in lead optimization, and the third is virtual screening, which can identify potential drug leads for a given protein target by searching a large ligand or protein in database. 295 protein-protein interactions are essential for cellular and immune function. in many cases, due to the absence of an experimentally determined structure of the complex, these interactions must be modeled to obtain an understanding about their structure and molecular basis. 296 few studies on plant-pathogen interactions include docking approaches and most studies focus on drug development for medical purposes. drug research based on structure is a powerful technique for the rapid identification of small molecules against the 3d structure of available macromolecular targets, usually by x-ray crystallography, nmr structures, or homology models. due to abundant information on protein sequences and structures, the structural information on specific proteins and their interactions have become crucial for current pharmacological research. 297 even in the absence of knowledge about the binding site and limited backbone movements, a variety of algorithms have been developed for docking over the past 2 decades. although the zdock, 296 the rdock, 298 and the hex 299 have provided results with high coupling precision, the complexes provided are not very useful for designing inhibitors for protein interfaces due to constraints on rigid body docking. 294 in this context, more flexible approaches have been developed which generally examine very limited conformations compared with rigid body methods. these docking methods predict that binding is more likely to occur in broad surface regions and then defines the sites in complex structures of high affinity. 300 the best example is the haddock software, 297 which has been successful in solving a large number of precise models for protein-protein complexes. a good example of its use is the study of the complex formed between plectasin, a member of the innate immune system, and a precursor lipid of bacterial cell wall ii. the study identified the residues involved in the binding site between the 2 proteins, providing valuable information for planning new antibiotics. 301 however, the absolute energies associated with intermolecular interaction are not estimated with satisfactory accuracy by the current algorithms. some significant issues as solvent effects, entropic effects, and receptor flexibility still need to be addressed. however, some methods, such as moe-dock, 302 gold, 303 glide, 304 flexx, 305 and surflex 306 which deal with lateral chain flexibility, have proven to be effective and adequate in most cases. realistic interactions between small molecules and receptors still depend on experimental wet-lab validation. 294, 307 despite the current difficulties, there is a growing interest in the mechanisms and prediction of small molecules such as peptides, as they bind to proteins in a highly selective and conserved manner, being promising as new medicinal and biological agents. 308 while both "small molecule docking methods" and "custom protocols" can be used, short peptides are challenging targets because of their high torsional flexibility. 307 proteinpeptide docking is generally more challenging than those related to other small molecules, and a variety of methods have been applied so far. however, few of these approaches have been published in a way that can be reproduced with ease. [309] [310] [311] although it is difficult to use peptide docking, a recent focus of basic and pharmacological research has used computational tools with modified peptides to predict the selective disruption of proteinprotein interactions. these studies are based on the involvement of some critical amino acid residues that contribute most to the binding affinity of a given interaction, also called hot-spots. 312, 313 despite the number of docking programs, existing algorithms still demand improvements. however, approaches are being developed to improve all issues related to punctuation, protein flexibility, interaction with plain water, among other issues. 314 in this context, the capri (critical assessment of predicted interactions) is a community that provides a quality assessment of different docking approaches. it started in 2001 and since then has aided the development and improvement of the methodologies applied for docking. 315 an evaluation was carried out for capri in 2016, resulting in an improvement in the integration of different modeling tools with docking procedures, as well as the use of more sophisticated evolutionary information to classify models. however, adequate modeling of conformational flexibility in interacting proteins remains an essential demand with a crucial need for improvement. 314 different docking programs are currently available, 294 and new alternatives continue to appear. some of these alternatives will disappear, just as others will become the top choices among field users. molecular docking technique is not often used for amps, due to its standard mechanism of action based on the classical association with the external membrane of the pathogen. despite that, some amps have the ability to bind other proteins and/or enzymes, a feature still scarcely studied. in such cases, molecular docking can be useful. an example of success is the study performed by melo et al, 47 where they showed the specific binding of a trypsin to a cowpea (vigna unguiculata) thionine, revealing that this interaction occurs in a canonical manner with lys11, located in an extended exposed loop. therefore, further application of docking may bring new evidences about the antimicrobial mechanisms revealing other molecular targets of interest. it is clear that the combination of data bank information with bioinformatic tools (especially those allowing the identification of patterns, rather than sequence order) is able to revolutionize the identification of amps and prediction of their activity. the data may come from genomic, transcriptomic, or proteomic databases, or a combination of different information sources (eg, genomic and transcriptomics, transcriptomics and proteomics). supplementary figure s9 brings a schematic flowchart describing the steps for mining, annotation, and structural/ functional analysis of amps, in addition to some wet-lab analyses that can be integrated to assess/confirm candidate amps. similar bioinformatic approaches have been actually used to identify potential peptide candidates with anti-sars-cov-2 activity, especially those potentially able to interact with the spike protein and proteases involved in viral penetration. 316, 317 as emphasized, plant amps show greater diversity and abundance, when compared with other kingdoms. it can be speculated that plants shelter many yet undescribed amp classes, given their vast abundance and isoform diversity. the genomic and peptidic structure of amps can be variable, with few key residues conserved, which turns their identification, classification, and comparison challenging even in the omics age. nevertheless, advances in the generation of new bioinformatics tools and specialized databases have led to new and more efficient approaches for both the identification of primary sequences and molecular modeling, besides the analysis of the stability of the generated models. despite the large availability of omics data and bioinformatics tools, most new plant peptides have been discovered by wet-lab approaches regarding single candidates. high throughput in silico methods have the potential to transform this scenario, revealing many new candidates, including some new or "non-canonical" peptides. it may be also speculated that a myriad of new peptides may exist considering even smaller peptides, still less considered and more difficult to identify. finally, in silico approaches shall in future studies be mandatory to define the design of wet-lab studies, turning the identification more efficient and requiring reasonably less time to track, identify, and confirm new candidate amps. considering the actual pandemic scenario of covid-19, plant amps may be regarded as an important source of antiviral drug candidates, especially considering that some amp categories present not only antiviral effects but also a wide spectrum antimicrobial activity, act as anti-inflammatory, and also induce the immune response. of higher education personnel, biocomputational program), cnpq (brazilian national council for scientific and technological development), and facepe (fundação de amparo à ciência e tecnologia de pernambuco) for fellowships. the project is supported by the interreg italia-slovenia, ise-emh 07/2019 and rc 03/20 from irccs burlo garofolo/ italian ministry of health cass performed the literature review and whore the manuscript. lz, mol, lmbv, jpbn, jrfn, jdcf, rlos, cjp, ffa, and mfo wrote specific chapters, eak and sc critically revised the text and included relevant suggestions. ambi conceived the review, wrote the introduction and concluding remarks, besides critically revising the manuscript. all authors have read the manuscript and agree to its content. supplemental material for this article is available online. prediction of protein function from protein sequence and structure plant peptides in defense and signaling plant bioactive peptides: an expanding class of signaling molecules candidalysin is a fungal peptide toxin critical for mucosal infection copsin, a novel peptide-based fungal antibiotic interfering with the peptidoglycan synthesis innate and specific gut-associated immunity and microbial interference the wide world of ribosomally encoded bacterial peptides oral saliva and covid-19 human β-defensin 2 mediated immune modulation as treatment for experimental colitis plant peptides and peptidomics nucleic acids and proteins in plants i the plant peptidome: an expanding repertoire of structural features and biological functions a small peptide modulates stomatal control via abscisic acid in long-distance signalling interaction of pls and pin and hormonal crosstalk in arabidopsis root development peptide signals for plant defense display a more universal role protease inhibitors in plants: genes for improving defenses against insects and pathogens long distance run in the wound response-jasmonic acid is pulling ahead rodríguez-palenzuéla p. plant defense peptides overview on plant antimicrobial peptides ethnobotanical bioprospection of candidates for potential antimicrobial drugs from brazilian plants: state of art and perspectives conopeptide characterization and classifications: an analysis using conoserver adaptive hydrophobic and hydrophilic interactions of mussel foot proteins with organic thin films cathelicidins, multifunctional peptides of the innate immunity antimicrobial peptides: discovery, design and novel therapeutic strategies scop: a structural classification of proteins database for the investigation of sequences and structures antimicrobial peptides from plants cyclotides insert into lipid bilayers to form membrane pores and destabilize the membrane through hydrophobic and phosphoethanolamine-specific interactions analysis of two novel classes of plant antifungal proteins from radish (raphanus sativus l.) seeds antifungal plant defensins: mechanisms of action and production h-nmr studies on the structure of a new thionin from barley endosperm: structure of a new thionin antimicrobial peptides from plants arabidopsis thionin-like genes are involved in resistance against the beet-cyst nematode (heterodera schachtii) host defense peptides and their potential as therapeutic agents plant thionins-the structural perspective plant antimicrobial peptides de smet i. plant peptides-taking them to the next level antimicrobial peptides from plants and their mode of action the inhibitory effect of a protamine from wheat flour on the fermentation of wheat mashes characterization and analysis of thionin genes thionin genes specifically expressed in barley leaves antimicrobial peptides as effective tools for enhanced disease resistance in plants identification of a cowpea γ-thionin with bactericidal activity novel thionins from black seed (nigella sativa l.) demonstrate antimicrobial activity synthetic and structural studies on pyrularia pubera thionin: a single-residue mutation enhances activity against gram-negative bacteria antimicrobial activity of γ-thionin-like soybean se60 in e. coli and tobacco plants inhibition of trypsin by cowpea thionin: characterization, molecular modeling, and docking toxicity of purothionin and its homologues to the tobacco hornworm, manduca sexta (l.) (lepidoptera:sphingidae) studies on purothionin by chemical modifications full-matrix refinement of the protein crambin at 0.83 å and 130 k γ-purothionins: amino acid sequence of two polypeptides of a new family of thionins from wheat endosperm primary structure and inhibition of protein synthesis in eukaryotic cell-free system of a novel thionin, gammahordothionin, from barley endosperm plant defensins: novel antimicrobial peptides as components of the host defense system the evolution, function and mechanisms of action for plant defensins plant γ-thionins: novel insights on the mechanism of action of a multi-functional class of defense proteins disulfide bridges in defensins comparative analysis of the antimicrobial activities of plant defensin-like and ultrashort peptides against food-spoiling bacteria isolation, purification, and characterization of a stable defensin-like antifungal peptide from trigonella foenum-graecum (fenugreek) seeds antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? plant defensins-prospects for the biological functions and biotechnological properties defensins and paneth cells in inflammatory bowel disease plant defensins: types, mechanism of action and prospects of genetic engineering for enhanced disease resistance in plants benko-iseppon am, cecchetto g. gene isolation and structural characterization of a legume tree defensin with a broad spectrum of antimicrobial activity recent advances in the chemistry and biochemistry of plant lipids evolutionary history of the non-specific lipid transfer proteins lipid transfer proteins: classification, nomenclature, structure, and function lipid-transfer proteins in plants purification and characterization of a small (7.3 kda) putative lipid transfer protein from maize seeds surprisingly high stability of barley lipid transfer protein, ltp1, towards denaturant, heat and proteases structural stability and surface activity of sunflower 2s albumins and nonspecific lipid transfer protein involvement of gpi-anchored lipid transfer proteins in the development of seed coats and pollen in arabidopsis thaliana two-and three-dimensional proton nmr studies of a wheat phospholipid transfer protein: sequential resonance assignments and secondary structure three-dimensional structure in solution of a wheat lipid-transfer protein from multidimensional 1h-nmr data. a new folding for lipid carriers an unusual lectin from stinging nettle (urtica dioica) rhizomes hevein-like antimicrobial peptides of plants structural basis for chitin recognition by defense proteins: glcnac residues are bound in a multivalent fashion by extended binding sites in hevein domains ginkgotides: proline-rich hevein-like peptides from gymnosperm ginkgo biloba structure and function of chitin-binding proteins structural features of plant chitinases and chitin-binding proteins a novel antifungal peptide from leaves of the weed stellaria media l antimicrobial peptides from amaranthus caudatus seeds with sequence homology to the cysteine/glycinerich domain of chitin-binding proteins overview of plant chitinases identified as food allergens the latex-fruit syndrome the n-terminal cysteine-rich domain of tobacco class i chitinase is essential for chitin binding but not for catalytic or antifungal activity a chitin-binding lectin from stinging nettle rhizomes with antifungal properties biochemical and molecular characterization of three barley seed proteins with antifungal properties hevein: an antifungalprotein from rubber-tree (hevea brasiliensis) latex two hevein homologs isolated from the seed of pharbitis nil l. exhibit potent antifungal activity comparative proteomics of primary and secondary lutoids reveals that chitinase and glucanase play a crucial combined role in rubber particle aggregation in hevea brasiliensis the formation and accumulation of protein-networks by physical interactions in the rapid occlusion of laticifer cells in rubber tree undergoing successive mechanical wounding plant cystineknot peptides: pharmacological perspectives: plant cystine-knot proteins in pharmacology refined crystal structure of the potato inhibitor complex of carboxypeptidase a at 2.5 å resolution squash inhibitors: from structural motifs to macrocyclic knottins small signaling peptides in arabidopsis development: how cells communicate over a short distance use of scots pine seedling roots as an experimental model to investigate gene expression during interaction with the conifer pathogen heterobasidion annosum (p-type) tying the knot: the cystine signature and molecular-recognition processes of the vascular endothelial growth factor family of angiogenic cytokines a cactus-derived toxin-like cystine knot peptide with selective antimicrobial activity circular proteins from plants and fungi circulins a b. novel human immunodeficiency virus (hiv)-inhibitory macrocyclic peptides from the tropical tree chassalia parvifolia isolation, solution structure, and insecticidal activity of kalata b2, a circular protein with a twist: do möbius strips exist in nature? purification, characterisation and cdna cloning of an antimicrobial peptide from macadamia integrifolia miamp1, a novel protein from macadamia integrifolia adopts a greek key β-barrel fold unique amongst plant antimicrobial proteins peptides of the innate immune system of plants. part ii. biosynthesis, biological functions, and possible practical applications nmr structure of the streptomyces metalloproteinase inhibitor, smpi, isolated from streptomyces nigrescens tk-23: another example of an ancestral βγ-crystallin precursor structure ancestral beta gamma-crystallin precursor structure in a yeast killer toxin enhanced quantitative resistance to leptosphaeria maculans conferred by expression of a novel antimicrobial peptide in canola (brassica napus l.) primitive defence: the miamp1 antimicrobial peptide family a novel family of small cysteine-rich antimicrobial peptides from seed of impatiens balsamina is derived from a single precursor protein structural studies of impatiens balsamina antimicrobial protein (ib-amp1) antifungal mechanism of a cysteine-rich antimicrobial peptide, ib-amp1, from impatiens balsamina against candida albicans antimicrobial peptide hybrid fluorescent protein based sensor array discriminate ten most frequent clinic isolates ib-amp4 insertion causes surface rearrangement in the phospholipid bilayer of biomembranes: implications from quartz-crystal microbalance with dissipation antifungal activity of synthetic peptides derived from impatiens balsamina antimicrobial peptides ib-amp1 and ib-amp4 antimicrobial specificity and mechanism of action of disulfide-removed linear analogs of the plant-derived cys-rich antimicrobial peptide ib-amp1 triticum aestivum puroindolines, two basic cystine-rich seed proteins: cdna sequence analysis and developmental gene expression determination of the secondary structure and conformation of puroindolines by infrared and raman spectroscopy sequence diversity and identification of novel puroindoline and grain softness protein alleles in elymus, agropyron and related species puroindolines: their role in grain hardness and plant defence molecular genetics of puroindolines and related genes: allelic diversity in wheat and other grasses isolation, characterization and antimicrobial activity at diverse dilution of wheat puroindoline protein the wheat puroindoline genes confer fungal resistance in transgenic corn: the puroindolines confer corn slb resistance mini review: structure, biological and technological functions of lipid transfer proteins and indolines, the major lipid binding proteins from cereal kernels puroindolines: the molecular genetic basis of wheat grain hardness plant lipid binding proteins: properties and applications puroindolines form ion channels in biological membranes the antimicrobial properties of the puroindolines, a review snakin-1, a peptide from potato that is active against plant pathogens snakin-2, an antimicrobial peptide from potato whose gene is locally induced by wounding and responds to pathogen infection snakin: structure, roles and applications of a plant antimicrobial peptide the new casn gene belonging to the snakin family induces resistance against root-knot nematode infection in pepper radiation damage and racemic protein crystallography reveal the unique structure of the gasa/snakin protein superfamily gasa5, a regulator of flowering time and stem growth in arabidopsis thaliana isolation and characterization of the tissue and development-specific potato snakin-1 promoter inducible by temperature and wounding the gibberellic acid stimulatedlike gene family in maize and its role in lateral root development analysis of expressed sequence tags (ests) from avocado seed (persea americana var. drymifolia) reveals abundant expression of the gene encoding the antimicrobial peptide snakin increased tolerance to wheat powdery mildew by heterologous constitutive expression of the solanum chacoense snakin-1 gene recombinant production of snakin-2 (an antimicrobial peptide from tomato) in e. coli and analysis of its bioactivity geg participates in the regulation of cell and organ shape during corolla and carpel development in gerbera hybrida two osgasr genes, rice gast homologue genes that are abundant in proliferating tissues, show different expression patterns in developing panicles gasa4, one of the 14-member arabidopsis gasa family of small polypeptides, regulates flowering and seed development identification of novel genes potentially involved in somatic embryogenesis in chicory (cichorium intybus l.) disulfide-stabilized helical hairpin structure and activity of a novel antifungal peptide ecamp1 from seeds of barnyard grass (echinochloa crus-galli) buckwheat trypsin inhibitor with helical hairpin structure belongs to a new family of plant defence peptides novel antifungal αhairpinin peptide from stellaria media seeds: structure, biosynthesis, gene structure and evolution design, synthesis and docking of linear and hairpin-like alpha helix mimetics based on alkoxylated oligobenzamide defense peptide repertoire of stellaria media predicted by high throughput next generation sequencing influence of cysteine and tryptophan substitution on dna-binding activity on maize α-hairpinin antimicrobial peptide plant cyclotides: a unique family of cyclic and knotted proteins that defines the cyclic cystine knot structural motif plants defense-related cyclic peptides: diversity, structure and applications discovery, structure, function, and applications of cyclotides: circular proteins from plants cyclotide evolution: insights from the analyses of their precursor sequences, structures and distribution in violets (viola) isolation of oxytocic peptides from oldenlandia affinis by solvent extraction of tetraphenylborate complexes and chromatography on sephadex lh-20 fractionation protocol for the isolation of polypeptides from plant biomass discovery of cyclotides in the fabaceae plant family provides new insights into the cyclization, evolution, and distribution of circular proteins cyclotides associate with leaf vasculature and are the products of a novel precursor in petunia (solanaceae) discovery and characterization of novel cyclotides originated from chimeric precursors consisting of albumin-1 chain a and cyclotide domains in the fabaceae family the cyclotide cycloviolacin o2 from viola odorata has potent bactericidal activity against gram-negative bacteria cyclotides: a novel type of cytotoxic agents potential therapeutic applications of the cyclotides and related cystine knot mini-proteins disulfide-rich macrocyclic peptides as templates in drug design the superfamily of thaumatinlike proteins: its origin, evolution, and expression towards biological function plant thaumatin-like proteins: function, evolution and biotechnological applications some fungi express beta-1,3-glucanases similar to thaumatin-like proteins lentinula edodes tlg1 encodes a thaumatin-like protein that is involved in lentinan degradation and fruiting body senescence plant stress proteins of the thaumatin-like family discovered in animals plant pathogenesis-related proteins: molecular mechanisms of gene expression and protein function the crystal structure of the antifungal protein zeamatin, a member of the thaumatin-like, pr-5 protein family several thaumatin-like proteins bind to β-1,3-glucans some thaumatin-like proteins hydrolyse polymeric beta-1,3-glucans tlxi, a novel type of xylanase inhibitor from wheat (triticum aestivum) belonging to the thaumatin family zeamatin inhibits trypsin and alpha-amylase activities drought-inducible-but aba-independent-thaumatin-like protein from carrot (daucus carota l.) ethylene-responsive genes are differentially regulated during abscission, organ senescence and wounding in peach (prunus persica) antifreeze proteins in winter rye are similar to pathogenesis-related proteins differential gene expression in arachis diogoi upon interaction with peanut late leaf spot pathogen, phaeoisariopsis personata and characterization of a pathogen induced cyclophilin transcriptome and metabolite profiling of the infection cycle of zymoseptoria tritici on wheat reveals a biphasic interaction with plant immunity involving differential pathogen chromosomal contributions and a variation on the hemibiotrophic lifestyle definition a classification of plant food allergens molecular, biochemical and structural characterization of osmotin-like protein from black nightshade (solanum nigrum) molecular characterization of a novel soybean gene encoding a neutral pr-5 protein induced by high-salt stress thaumatin-like proteins-a new family of pollen and fruit allergens biochemical and structural characterization of tlxi, the triticum aestivum l resolution of the structure of the allergenic and antifungal banana fruit thaumatin-like protein at 1.7-å crystal structure of tobacco pr-5d protein at 1.8 å resolution reveals a conserved acidic cleft structure in antifungal thaumatin-like proteins crystal structure of osmotin, a plant antifungal protein crystal structure of a sweet tasting protein thaumatin i, at 1·65 å resolution crystallization and preliminary structure determination of the plant identification of conidialenriched transcripts in aspergillus nidulans using suppression subtractive hybridization analysis of the aspergillus nidulans thaumatin-like ceta gene and evidence for transcriptional repression of pyr4 expression in the ceta-disrupted strain the pr5k receptor protein kinase from arabidopsis thaliana is structurally related to a family of plant defense proteins uniprot: the universal protein knowledgebase discovering new in silico tools for antimicrobial peptide prediction computational tools for exploring sequence databases as a resource for antimicrobial peptides camp: a useful resource for research on antimicrobial peptides camp: collection of sequences and structures of antimicrobial peptides campr3: a database on sequences, structures and signatures of antimicrobial peptides apd3: the antimicrobial peptide database as a tool for research and education dbaasp: database of antimicrobial activity and structure of peptides new trends in peptide-based anti-biofilm strategies: a review of recent achievements and bioinformatic approaches a large-scale structural classification of antimicrobial peptides defensins knowledgebase: a manually curated database and information source focused on the defensins family of antimicrobial peptides computational resources and tools for antimicrobial peptides cybase: a database of cyclic protein sequence and structure phytamp: a database dedicated to antimicrobial plant peptides plantpepdb: a manually curated plant peptide database bdbms-a database management system for biological data bioinformatics: a way forward to explore "plant omics basic local alignment search tool rapid and sensitive sequence comparison with fastp and fasta comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences programming techniques: regular expression search algorithm profile hidden markov models hidden markov models and their applications in biological sequence analysis what are the ideal properties for functional food peptides with antihypertensive effect? a computational peptidology approach computational peptidology c-pamp: large scale analysis and database construction containing high scoring computationally predicted antimicrobial peptides for all the available plant species predstp: a highly accurate svm based model to predict sequential cystine stabilized peptides assigning biological function using hidden signatures in cystine-stabilized peptide sequences random forests and adaptive nearest neighbors multiple incremental decremental learning of support vector machines evolutionary artificial neural networks: a review novel peptideprotein assay for identification of antimicrobial peptides by fluorescence quenching a reverse search for antimicrobial peptides in ciona intestinalis: identification of a gene family expressed in hemocytes and evaluation of activity positive selection drives a correlation between non-synonymous/synonymous divergence and functional divergence progress with proteome projects: why all proteins expressed by a genome should be identified and how to do it proteomic tools for biomedicine compatibility of plant protein extraction methods with mass spectrometry for proteome analysis proteomic profiling for target identification of biologically active small molecules using 2d dige proteomics technologies and challenges separomics applied to the proteomics and peptidomics of low-abundance proteins: choice of methods and challenges-a review mining the active proteome in plant science and biotechnology screening, purification and characterization of anionic antimicrobial proteins from foeniculum vulgare peptidomics coming of age: a review of contributions from a bioinformatics angle 2d-lc/ms techniques for the identification of proteins in highly complex mixtures hplc techniques for proteomics analysis-a short overview of latest developments bioinformatics in proteomics computational methods for protein identification from mass spectrometry data de novo sequencing methods in proteomics a fast sequest cross correlation algorithm probability-based protein identification by searching sequence databases using mass spectrometry data novel antimicrobial peptides with promising activity against multidrug resistant salmonella enterica serovar choleraesuis and its stress response mechanism proteomics assisted profiling of antimicrobial peptide signatures from black pepper apd: the antimicrobial peptide database protein identification and analysis tools on the expasy server tandem: matching proteins with tandem mass spectra open mass spectrometry search algorithm probid: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data radars, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database analysis and validation of proteomic data generated by tandem mass spectrometry improved method for proteome mapping of the liver by 2-de maldi-tof ms bioinformatics-coupled molecular approaches for unravelling potential antimicrobial peptides coding genes in brazilian native and crop plant species prediction of bioactive peptides from chlorella sorokiniana proteins using proteomic techniques in combination with bioinformatics analyses graphical interpretation and analysis of proteins and their ontologies (giapronto): a one-click graph visualization software for proteomics data sets principles, challenges and advances in ab initio protein structure prediction practically useful: what the rosetta protein modeling suite can do for you. biochemistry critical assessment of methods of protein structure prediction (casp)-round xii template-based protein structure modeling using the raptorx web server comparative protein structure modeling of genes and genomes swiss-model: modelling protein tertiary and quaternary structure using evolutionary information comparative protein structure modeling using mod-eller: comparative protein structure modeling using modeller muster: improving protein sequence profile-profile alignments by using multiple sources of structure information improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates prosa-web: interactive web service for the recognition of errors in three-dimensional structures of proteins i-tasser server for protein 3d structure prediction touchstone ii: a new approach to ab initio protein structure prediction ab initio modeling of small proteins by iterative tasser simulations integration of quark and i-tasser for ab initio protein structure prediction in casp11: ab initio structure prediction in casp11 critical assessment of methods of protein structure prediction: progress and new directions in round xi: progress in casp xi in silico optimization of a guava antimicrobial peptide enables combinatorial exploration for peptide design high-resolution comparative modeling with rosettacm developing a molecular dynamics force field for both folded and disordered protein states challenges in protein-folding simulations molecular dynamics simulations of biomolecules relaxation mode analysis for molecular dynamics simulations of proteins gromacs 4: algorithms for highly efficient, load-balanced, and scalable molecular simulation 5: a high-throughput and highly parallel open source molecular simulation toolkit gromacs: high performance molecular simulations through multi-level parallelism from laptops to supercomputers the amber biomolecular simulation programs scalable molecular dynamics with namd charmm: a program for macromolecular energy, minimization, and dynamics calculations fast parallel algorithms for short-range molecular dynamics scalable algorithms for molecular dynamics simulations on commodity clusters molecular dynamics simulation for all targeting antibiotic tolerance, pathogen by pathogen measuring and mapping the global burden of antimicrobial resistance university of texas medical branch at galveston antibiotic drugs targeting bacterial rnas antimicrobial drugs in fighting against antimicrobial resistance protein-ligand docking: current status and future challenges peptide docking and structurebased characterization of peptide binding: from knowledge to know-how software for molecular docking: a review bacterial multidrug efflux pumps: mechanisms, physiology and pharmacological exploitations zdock server: interactive docking prediction of protein-protein complexes and symmetric multimers the haddock2.2 web server: user-friendly integrative modeling of biomolecular complexes rdock: a fast, versatile and open source program for docking ligands to proteins and nucleic acids protein docking using case-based reasoning principles of flexible protein-protein docking plectasin, a fungal defensin, targets the bacterial cell wall precursor lipid ii variability in docking success rates due to dataset preparation development and validation of a genetic algorithm for flexible docking extra precision glide: docking and scoring incorporating a model of hydrophobic enclosure for protein−ligand complexes protein-ligand docking: current status and future challenges surflex-dock: docking benchmarks and real-world application docking small peptides remains a great challenge: an assessment using autodock vina advances in the prediction of protein-peptide binding affinities: implications for peptide-based drug discovery: protein-peptide binding affinities recent work in the development and application of protein-peptide docking peptide docking and structurebased characterization of peptide binding: from knowledge to know-how protein-ligand docking in the new millennium-a retrospective of 10 years in the field unal eb, gursoy a, erman b. vital: viterbi algorithm for de novo peptide design modeling protein-protein and proteinpeptide complexes: capri 6th edition docking, scoring, and affinity prediction in capri potential chimeric peptides to block the sars-cov-2 spike receptor-binding domain peptide-like and small-molecule inhibitors against covid-19 the authors are very grateful to capes (coordination for the improvement key: cord-334127-wjf8t8vp authors: brister, j. rodney; ako-adjei, danso; bao, yiming; blinkova, olga title: ncbi viral genomes resource date: 2015-01-28 journal: nucleic acids res doi: 10.1093/nar/gku1207 sha: doc_id: 334127 cord_uid: wjf8t8vp recent technological innovations have ignited an explosion in virus genome sequencing that promises to fundamentally alter our understanding of viral biology and profoundly impact public health policy. yet, any potential benefits from the billowing cloud of next generation sequence data hinge upon well implemented reference resources that facilitate the identification of sequences, aid in the assembly of sequence reads and provide reference annotation sources. the ncbi viral genomes resource is a reference resource designed to bring order to this sequence shockwave and improve usability of viral sequence data. the resource can be accessed at http://www.ncbi.nlm.nih.gov/genome/viruses/ and catalogs all publicly available virus genome sequences and curates reference genome sequences. as the number of genome sequences has grown, so too have the difficulties in annotating and maintaining reference sequences. the rapid expansion of the viral sequence universe has forced a recalibration of the data model to better provide extant sequence representation and enhanced reference sequence products to serve the needs of the various viral communities. this, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets. recent outbreaks of ebolavirus (1, 2) and middle east respiratory syndrome coronavirus (mers-cov) (3, 4) clearly demonstrate the power of sequence analysis in viral surveillance, host reservoir identification and public health policy debate. as these viruses have filled media headlines, their genome sequences have spilled into international public databases. such real time analysis promises to fundamentally alter our understanding of viral biology and significantly impact public health responses to viral dis-ease, but it also places renewed emphasis on public research infrastructure that is necessary to support the storage and analysis of sequence data. this infrastructure includes primary databases that together comprise the international nucleotide sequence database collaboration (insdc) (5) , genbank (6) , european molecular biology laboratory's european bioinformatics institute (embl-ebi) (7) , and dna database of japan (ddbj) (8) , and reference databases like the viralzone resource at the swiss institute of bioinformatics (http://viralzone.expasy. org) (9) and the viral genome resource at national center for biotechnology information (ncbi) (http://www. ncbi.nlm.nih.gov/genome/viruses/) (10) . whereas primary databases are archival repositories of sequence data, reference databases provide curated datasets that enable a number of activities, among them are transfer annotation to related genomes (11) (12) (13) , sequence assembly and virus discovery (14) (15) (16) (17) , viral dynamics and evolution (18) (19) (20) and pathogen detection (14, (21) (22) (23) . the ncbi viral genomes project was established in response to the growing need for a public, virus-specific, reference sequence resource (24) . the project catalogs all complete viral genomes deposited in insdc databases and creates so-called refseq records for each viral species. each refseq is derived from an insdc sequence record, but may include additional annotation and/or other information. accessions for refseq genome records include the prefix 'nc ', allowing them to be easily differentiated from insdc records. for example, the refseq genome record for enterobacteria phage t4 has the accession nc 000866 but was derived from the insdc record af158101. typically, the first genome submitted for a particular species is selected as a refseq, and once a refseq is created, other validated genomes for that species are indexed as 'genome neighbors'. as such, the viral refseq data model is taxonomy centric, or more specifically, species centric, and all refseq records and genome neighbors are indexed at the species level. this model requires both the demarcation of individual viral species and the grouping of genome sequences into defined species. virus genome type refseq genome segments total genome segments total insdc sequences dsdna viruses, no rna stage 1755 3023 115 911 dsrna viruses 919 17 929 56 699 ssdna viruses 669 6692 40 337 ssrna negative-strand viruses 187 4384 478 791 ssrna positive-strand viruses, no dna stage 917 14 441 415 664 retro-transcribing viruses 123 8614 727 762 a the table does not include influenza virus sequences. these sequences are stored in a specialized database (11, 25) . there are now 71 628 validated viral and viroid genome segments deposited within insdc databases, not including influenza sequences, which are stored in a specialized database (11, 25) . this figure represents a nearly 9fold increase since 2000 (figure 1 ), and this rise reflects both steady increases in the number of novel viruses sequenced--as measured by the number of refseq genome segments--and a large increase in the number of genome neighbors, i.e. genome sequences belonging to viral species already represented by a refseq (figure 1 ). as shown in table 1 , refseq genome segments are distributed among all viruses, but genome neighbor segments are concentrated among smaller, ssdna, rna, and retro-transcribing viruses. although many of these neighbor genomes are concentrated among human pathogens, there are also several viruses of agricultural importance with high numbers of sequenced genomes ( table 2 ). while most of the viruses in table 2 are well studied in the laboratory, many other sequenced viruses are not. the refseq data model for most organisms underscores the importance of very well annotated reference sequence records (26) . unfortunately, a minority of viral systems are experimentally well defined, so there is often little primary data on which to base genome annotations. in some cases, sequence homologies allow the transfer of annotation from experimentally defined to poorly characterized genomes (11) (12) (13) . yet, often genomes are annotated by purely ab initio processes (27) (28) (29) . given the difficulty of implementing a purely well annotated representation of viral genome sequences, the viral refseq model has evolved into a more flexible approach that includes both reference and representative sequences. reference refseq records provide sources of well annotated sequence features, whereas representative records provide coverage of extant sequence variation. the comment 'reviewed refseq' is added to refseq records to highlight those that include additional annotation, and as of this writing, there are 747 reviewed viral refseq records, including references for several human pathogens, such as human immunodeficiency virus 1 (nc 001802), measles virus (nc 001498) and poliovirus (nc 002058) and several other important viral systems such as enterobacteria t4 (nc 000866), enterobacteria t7 (nc 001604) and tobacco mosaic virus (29) (30) . moreover, some viral communities are developing well defined subspecies classification such as the genotyping schemes for hepatitis b virus and hepatitis c virus (31) (32) (33) . these genotyping schemes can provide an important framework for the interpretation of genome sequence data (34) , and more communities are expected to develop genotyping schemes in the coming years. finally, there are cases when the best characterized viral isolate is a laboratory variant, and it may be important to create multiple refseq records in order to provide both experimentally annotated references and sufficient sequence representation of circulating isolates. together these cases highlight the need for both reference genome sequences that capture the best possible annotation and representative genome sequences that capture important intraspecies variation or define subspecies categories. therefore the viral refseq model has expanded to include both reference and representative genome sequences to better serve community needs. the rising pace of viral discovery has a number of implications for data processing by the viral genomes group. viral taxonomy within the ncbi taxonomy database is based on the list of valid species names and classifications provided by the international committee for the taxonomy of viruses (ictv) (35, 36) . when the viral genomes project was initiated, there were many more viral species recognized by the ictv than viral refseq genome sequence records ( figure 2 ). however, as the rate of viral genome sequencing has increased over the past decade, so too has the pace of viral discovery. as a result many refseqs are made from viruses clearly distinct from existing ones but without official taxonomy designation. taxonomy also affects the interpretation of genome sequence data, and technical difficulties encountered when sequencing the termini of some ssrna and ssdna viruses often lead to differing community standards for 'complete genomes' (37) . this means that some difficult to sequence genomes are considered complete if they include the entire coding region but are missing some terminal sequence. improved methods may eventually resolve this issue (38) , but in the meantime it would be useful for communities to define completeness standards with regard to current technology. in addition to manual selection based on genome length, the taxonomy of both refseq genome records and insdc genome neighbor records are validated. indeed, given that many novel virus genome sequences are submitted before analysis by the ictv (see figure 2 ), validation of taxonomy assignment is a major facet of curation. taxonomy is important to the overall usability of ncbi viral genome resources, and when properly implemented, creates a framework for groups of related sequences. using standards established by individual ictv study sections (36) and published reports, the taxonomy of each viral genome is validated and updated as necessary. newly submitted viral genomes without official ictv assignment are placed with 'uncharacterized' taxonomy bins that are easily distinguished from those recognized by the ictv. often little information is included in the insdc sequence record and a growing number of sequences do not include any linked publications. using sequence analysis and comparative genomics, every attempt is made to place new genomes into a family (i.e. the 'uncharacterized' bin associated with a specific family) or lower order classification bin. however, some genomes are very distinct from previously characterized ones and only higher order classification is possible. reference viral refseq records are generally curated by biologists using in-house annotation tools and the scientific literature as guides. a panel of viral genome advisors from outside ncbi bolsters curation efforts by offering expert guidance or taking responsibility for specific refseq records themselves. this approach is used for the maintenance of adenovirus and herpesvirus refseq records (39) and could be extended to other virus genomes (29) . these efforts considered, the growing number of viral genomes submitted to insdc databases and the rapid pace of scientific discovery make maintenance of up-to-date references difficult. therefore collaboration with scientific communities is critical to providing accurate annotation. sometimes these collaborative efforts are directed at curating a single refseq record, and all of the reviewed refseq records mentioned in the previous section were curated in collaboration with experts from individual viral communities. other times these collaborations are more extensive and touch many sequence records. for example, overlapping gene annotations were corrected on refseq records from 14 virus families (arteriviridae, arteriviridae, bunyaviridae, caliciviridae, circoviridae, disistroviridae, flavoviridae, luteoviridae, paramixovridae, parvoviridae, picornaviridae, potyviridae, reoviridae, togaviridae) as directed by experimental or predictive analysis (40, 41) . a new emphasis has been placed on initiating annotation collaborations at the beginning of a large genome sequencing program so that reference annotations, isolate naming schemes and other standards can be established prior to sequence submission (42) (43) (44) . these collaborations often include members of the uniprot viral protein annotation program (45) (9), and/or curators from sequencing centers and other databases (46) in addition to members of the relevant viral communities and effectively ensure both well annotated references and consistently annotated insdc sequence records. such arrangements underscore the extensive impact of viral genome annotation issues--from public databases to sequencing centers to individual researcher communities--and were formalized within the viral genome annotation working group, which brings together stakeholders and provides a forum for the discussion of annotation issues (29, 47) . in addition to protein annotation and isolate naming issues, this group is working to define standards for viral genome sequence data. as the number of viral sequences has risen, so has the demand for curated metadata describing sequences. the viral genomes group has implemented two models designed to capture and standardize metadata. in the first model exemplified by the virus variation resource, host, isolation country and other important metadata are parsed from individual sequence records, mapped against vocabulary lists and standardized (25, 48) . sequences can then be searched using these standardized metadata terms. currently, only a small subset of viral sequences are included in the virus variation resource, including those for influenza, dengue and west nile viruses, but the ultimate goal is to expand this semi-automated model to include more viruses. the second model captures and standardizes host information for all viruses, and whenever a new refseq record is created, a manually curated 'viral host' property is assigned to the relevant species within the ncbi taxonomy database. the property defines higher order, biologically relevant taxonomic host groups--algae, archaea, bacteria, diatom, environment, fungi, human, invertebrates, plants, protozoa and vertebrates--and enable sorting and selection of sequences within the ncbi taxonomy (http://www. ncbi.nlm.nih.gov/taxonomy) and viral genomes resource. for example searching the ncbi taxonomy database with the term 'vhost fungi'[properties] (quotes included) will return a list of taxonomy groups comprised of viruses that infect fungi. users can then select the 'genome' database from 'find related data' link on the taxonomy search page to view all viral genomes associated with viruses retrieved from the search. in cases where a virus infects multiple types of organisms, multiple terms are assigned, for example 'invertebrates, plants'. to search ncbi taxonomy for viruses that infect multiple hosts simply include 'and' between search terms, for example 'vhost invertebrates' [properties] and 'vhost plants' [properties] (quotes included). the current distribution of assigned viral host terms is shown in figure 3 . the ncbi viral genome resource can be accessed at www.ncbi.nlm.nih.gov/genome/viruses/. on this home page, users will find ftp links where users can download accession list of all viral and viroid genomes (refseq and genome neighbors) and the complete viral and viroid ref-seq dataset. perhaps the central features of the resource are the viral and viroid genome browsers. these tables list all viral and viroid species represented by a reference sequence and include links to genome neighbor sequences. users can navigate to specific taxonomic groups and sort the table by viral host type. once a dataset has been defined by taxonomy and host types, users can download the resultant table, the list of refseq accessions in the table, or a list that includes refseq and genome neighbor accessions as well as taxonomy and viral host information. several specialized viral resources and tools are also linked through the viral genomes resource home page. these include specialized resources for influenza, dengue and west nile and other viruses that are part of the virus variation resource (http://www.ncbi.nlm.nih.gov/genome/viruses/variation/) (25, 48, 49) . the link to the retrovirus resource (http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses) provides access to the retrovirus genotyping tool and hiv-1, human interaction database (50, 51) . these tools are designed to assist retroviral researchers in the identification and classification of sequences and to document hiv-1 and human protein and replication interactions through a searchable interface. finally, there is a link to the pairwise sequence comparison tool (pasc) (http://www.ncbi.nlm.nih.gov/sutils/pasc), a blast-based tool with graphical output that can be used to establish taxonomic classification criteria of some viruses and classify viruses with newly sequenced genomes (52, 53) . both refseq records and other genomes for species are linked throughout ncbi resources and can be used in a variety of operations. among these, the refseq dataset can be used to reduce the redundancy of blast searches (http://blast.ncbi.nlm.nih.gov/blast.cgi) (54), providing fewer, higher quality sequences within search results. to restrict nucleotide blast searches to include only viral refseq genomes, employ the 'choose search set' options in the blast search interface (55): select 'reference genomic sequences (refseq genomic)' in the database field and enter 'viruses' in the 'organism' field text box. for protein blast searches, the viral refseq protein set can be used by selecting 'reference proteins' (refseq proteins) in the database field and entering 'viruses' in the 'organism' field text box. data derived from viral refseqs are also used to support a number of other databases including gene (56) and protein clusters (57) . each species that includes a refseq can be found in the genome database (http://www.ncbi.nlm.nih.gov/genome) (56) . this resource can be searched by taxonomy names, and retrieved genome records include links to all refseqs for that species. each individual genome record also includes links to neighbor sequences for that species under 'related information', and these can be viewed by selecting the 'other genomes for species' option. these links display all genome neighbor records in the nucleotide database where they can be viewed and/or downloaded. genome neighbor records can also be retrieved from multiple genome records using the 'find related data' options, allowing the user to search for an entire viral family or similar and then retrieve all genome neighbor records defined by the original search criteria. simply select 'nucleotide' in 'database' pull down menu and 'other genomes for species' from the 'option' pull down menu to return all genome neighbors for all the species listed in the search results. as the sequencing revolution continues to gather steam, and the rate of viral genome sequencing increases, reference databases will be pressed to serve growing community needs. meeting these will require further collaboration with individual viral communities and across public databases. data models will also need to shift to better represent the extant sequence universe and provide better standardized sequence annotation. once annotated, large-scale genome sequence data will need to be presented in ways that facilitate human data sorting and discovery operations. this will require semiautomated metadata capture and standardization, as well as innovative interfaces and tools that leverage metadata in discovery operations. many of these approaches and processes are currently being tested within the ncbi virus variation resource (25) where users can readily find sequences based on specific, standardized sequence descriptors, greatly improving the accessibility and utility of viral sequence data. while currently limited to a handful of human pathogens, our intent is to expand the virus variation data model to include more viruses from more viral communities. this should open up a number of possibilities and will support the aggregation and retrieval of sequences based on community-defined criteria like genotypes or complete genome sets as is currently possible for influenza virus sequences (11, 25) . the growing cloud of viral genome sequences also poses significant barriers to the maintenance of reference genome records. the pace of experimental discovery and the number and breadth of viral genomes make it increasingly difficult to provide well annotated, up-to-date reference sequences. to counter, we must leverage community knowledge and activities against the goal of better refseq viral resources and must collaborate with viral communities to maintain well annotated reference sequences, develop community-accepted gene and protein naming standards and define community-established subspecies classification schemes. though collaborations have been initiated within d576 nucleic acids research, 2015, vol. 43, database issue some communities (29, (42) (43) (44) 47) , they need to be scaled to include more groups. as a public resource, we serve a range of communities--from the public health to the basic research--and rely on them to both better inform our mission and help support it. only by engaging our stakeholders and working together on shared goals can we provide the rigorous resources necessary to support viral sequence data activities. emergence of zaire ebola virus disease in guinea--preliminary report genomic surveillance elucidates ebola virus origin and transmission during the 2014 outbreak middle east respiratory syndrome coronavirus in dromedary camels: an outbreak investigation transmission and evolution of the middle east respiratory syndrome coronavirus in saudi arabia: a descriptive genomic study the international nucleotide sequence database collaboration the european bioinformatics institute's data resources 2014 ddbj progress report: a new submission system for leading to a correct annotation viralzone: recent updates to the virus knowledge resource ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins flan: a web server for influenza virus genome annotation vigor extended to annotate genomes for additional 12 different viruses vigor, an annotation program for small viral genomes evaluation of alignment algorithms for discovery and identification of pathogens using rna-seq identification of a novel polyomavirus from patients with acute respiratory tract infections klassevirus 1, a previously undescribed member of the family picornaviridae, is globally widespread a highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes deep sequencing of norovirus genomes defines evolutionary patterns in an urban tropical setting molecular epidemiology of contemporary g2p[4] human rotaviruses cocirculating in a single u.s. community: footprints of a globally transitioning genotype going viral: next-generation sequencing applied to phage populations in the human gut pathseq: software to identify or discover microbes by deep sequencing of human tissue virusfinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data a cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples national center for biotechnology information viral genomes project virus variation resource--recent updates and future directions ncbi reference sequences (refseq): current status, new features and genome annotation policy improving gene annotation of complete viral genomes identification of proteins associated with murine cytomegalovirus virions microbial virus genome annotation-mustering the troops to fight the sequence onslaught imbroglios of viral taxonomy: genetic exchange and failings of phenetic approaches molecular identification of hepatitis b virus genotypes/subgenotypes: revised classification hurdles and updated resolutions consensus proposals for a unified system of nomenclature of hepatitis c virus genotypes expanded classification of hepatitis c virus into 7 genotypes and 67 subtypes: updated criteria and genotype assignment web resource is there any value to hepatitis b virus genotype analysis? the ncbi taxonomy database virus taxonomy: classification and nomenclature of viruses: ninth report of the international committee on taxonomy of viruses rapid cdna synthesis and sequencing techniques for the genetic study of bluetongue and other dsrna viruses a new approach to determining whole viral genomic sequences including termini using a single deep sequencing run herpesvirus systematics evolution of viral proteins originated de novo by overprinting overlapping genes produce proteins with unusual sequence properties and offer insight into de novo protein creation uniformity of rotavirus strain nomenclature proposed by the rotavirus classification working group (rcwg) virus nomenclature below the species level: a standardized nomenclature for natural variants of viruses assigned to the family filoviridae virus nomenclature below the species level: a standardized nomenclature for laboratory animal-adapted strains and variants of viruses assigned to the family filoviridae the universal protein resource (uniprot) in 2010 vipr: an open bioinformatics database and analysis resource for virology research towards viral genome annotation standards virus variation resources at the national center for biotechnology information: dengue virus the influenza virus resource at the national center for biotechnology information a web-based genotyping resource for viral sequences human immunodeficiency virus type 1, human protein interaction database at ncbi pairwise sequence comparison (pasc) and its application in the classification of filoviruses improvements to pairwise sequence comparison (pasc): a genome-based web tool for virus classification blast: a more efficient report with usability improvements ncbi blast: a better web interface database resources of the national center for biotechnology information the national center for biotechnology information's protein clusters database we would like to thank vyacheslav chetvernin, boris fedorov, sergey resenchuck, igor tolstoy, tatiana tatusova and jim ostell for their development and support. key: cord-328644-odtue60a authors: comandatore, francesco; chiodi, alice; gabrieli, paolo; biffignandi, gherard batisti; perini, matteo; ricagno, stefano; mascolo, elia; petazzoni, greta; ramazzotti, matteo; rimoldi, sara giordana; gismondo, maria rita; micheli, valeria; sassera, davide; gaiarsa, stefano; bandi, claudio; brilli, matteo title: insurgence and worldwide diffusion of genomic variants in sars-cov-2 genomes date: 2020-05-28 journal: biorxiv doi: 10.1101/2020.04.30.071027 sha: doc_id: 328644 cord_uid: odtue60a the sars-cov-2 pandemic that we are currently experiencing is exerting a massive toll both in human lives and economic impact. one of the challenges we must face is to try to understand if and how different variants of the virus emerge and change their frequency in time. such information can be extremely valuable as it may indicate shifts in aggressiveness, and it could provide useful information to trace the spread of the virus in the population. in this work we identified and traced over time 7 amino acid variants that are present with high frequency in italy and europe, but that were absent or present at very low frequencies during the first stages of the epidemic in china and the initial reports in europe. the analysis of these variants helps defining 6 phylogenetic clades that are currently spreading throughout the world with changes in frequency that are sometimes very fast and dramatic. in the absence of conclusive data at the time of writing, we discuss whether the spread of the variants may be due to a prominent founder effect or if it indicates an adaptive advantage. the worldwide fast spread of sars-cov-2 virus during the first months of this year has caused 316,169 deaths with more than 4,731,458 confirmed cases since the first reports of novel pneumonia in wuhan, hubei province, china (zhou et al. 2020; wu et al. 2020 ) up to may 19 2020 (who 2020b . the virus belongs to the beta-coronavirus, and it is the seventh coronavirus known to infect humans, causing severe respiratory and systemic disorders (rothan and byrareddy 2020) , with a basic r0 index estimated to range from 1.4 to over 6 . the closest known relatives of sars-cov-2 circulate in animals, specifically bats or pangolins (zhang, wu, and zhang 2020) , suggesting that an animal virus crossed species boundaries to efficiently infect humans, possibly through multiple passages in intermediate animal hosts, even though the transmission route has not been yet identified to date (andersen et al. 2020 ). traces of the history of the spread are present in the viral genome and comparative genomics approaches can therefore be used to understand how viruses can adapt to multiple hosts, uncovering key signatures of this adaptation (andersen et al. 2020; wan et al. 2020) , and to trace the infection routes of the virus. at the same time, genomic studies can help tracing viral variants that may be geographically restricted and/or may account for different levels of infectivity and mortality in humans. these variants might arise during the spread of the epidemic, as viruses are known for their high frequency of mutation, particularly in single stranded rna viruses -as in the case of sars-cov-2 (sanjuán and domingo-calap 2016) , which has a single, positive-strand rna genome. randomly generated variants can then spread in the population, due to stochastic reasons (i.e. founder effect, drift) or as a consequence of positive selection exerted by intrinsic biological features (such as the level of virus infectivity and its transmission rate), or extrinsic factors such as use of antivirals or reactions by the immune system or other defence mechanisms put in place by the host (di giorgio et al. 2020) . therefore, haplotype(s) present at the beginning of the epidemic spread can change in time; sometimes novel variants can overcome ancestral ones, and this can be a consequence of different levels of aggressiveness but also of mechanisms beyond selection. if the different variants are identical in terms of their ability to infect and replicate in the host, country-specific switches in haplotype frequency with respect to the most common haplotypes can depend on the very first haplotype(s) arriving in the country (provine 2004) . instead, when haplotype frequency changes globally, then the hypothesis of differential aggressiveness becomes more probable, indicating that the novel variants may be better adapted to infect human hosts; however, even in this case, the complexity of global human mobility or sampling bias, may originate patterns that may seem causal but are not. here, we present a comprehensive study of the coding sequences from sars-cov-2 genome sequences isolated since the beginning of the epidemic. italy was the first european country registering non-imported covid-19 cases requiring hospitalization. the first registered case of covid-19 was on february 20, 2020, a young man in the lombardy region in northern italy, diagnosed with atypical pneumonia (livingston and bucher 2020) . in the next 24 hours 36 more cases were detected, and as of april 27th 2020, italy registered 197,675 cases with 26,644 deaths, one of the highest toll in europe and in the world. who classifies the spreading occurring in italy as community transmission, indicating that the country is experiencing large outbreaks of local transmission with no possibility to trace transmission chains between the cases, and with multiple and unrelated clusters of transmission. the high mortality rate observed in italy, especially at the beginning of the epidemic raised the question whether italian strain(s) might have increased aggressiveness. while the increased mortality observed in italy could be explained by the fact that, at the beginning of the epidemic, swabs were performed only on individuals showing up at a hospital, distorting the sampling toward a group of symptomatic individuals devoid of healthy carriers, we cannot exclude that italian haplotype frequency are partially different, at least genetically. to have a better insight on the history and spread of the covid-19 pandemic in italy and thanks to the sequences deposited in the gisaid database, we identified 7 non synonymous mutations that are differentially frequent in italian sars-cov-2 strains respect to strains circulating globally. these mutations are enriched in italy, but present in strains from other countries as well, as shown by tracing their relative frequency in time both globally and in different countries, therefore we traced their distribution worldwide and complemented with a phylogenetic analysis to understand how the variants are related. genomes were downloaded from the gisaid.org repository on april, 10, 2020, and a second time on april, 28, 2020, and are listed, together with reference to the submitting laboratories, in supplementary table 1 . we extracted coding sequences using a strategy based on tblastn (camacho et al. 2009 ) comparisons using as queries all the proteins from the sars-cov-2 reference sequence deposited in ncbi (accession mn908947). after the comparison, the coordinates of the blast alignments on each genome were used to extract the coding sequence. nucleotide sequences were then aligned and translated, and the alignments were manually checked for the presence of frameshifts, and manually edited. alignments were manually edited to remove partial and poorly aligned sequences, resulting in a variable number of sequences per alignment. this manual curation resulted in alignments containing a minimum number of 2262 sequences for orf3a and a maximum of 5585 sequences for the nucleocapsid protein, with an average of 4222 sequences per alignment. we are aware that by removing sequences with gaps we are likely removing part of the genuine variability present (i.e. indels), however, we observed indels with such a low frequency that we attributed them mostly to sequencing/assembly errors and decided that the clear advantage of a stronger signal outweighs the possible disadvantages deriving from information loss. additional analysis with high quality genome sequences will be necessary to evaluate whether indels represent an important source of variation. regardless, this does not change the results of our analysis as the target of this work are point mutations. sequences from the alignments were used to build amino acid frequency profiles by using the r-package seqinr (charif and jr 2007) . basically, for each protein encoded in the sars-cov-2 we obtained two profiles, one for italian strains and one for the entire set of sequences. as the positions in the alignments for the two groups under examination are congruent, we can then calculate: that is the maximum log ratio of the frequency (f) of all amino acids at a certain position i in the alignment in italian ( it ) and total ( all ) sequences. then we identify positions with s>0.5 that corresponds to the identification of positions where there is a frequency change in one of the residues of at least 2x. variability in multi-alignments was quantified by calculating entropy = − ∑ # -8 #9: -# at each position of each manually curated protein multi-alignment. mutual information was calculated based on (buck and atchley 2005) to assess whether variants tend to co-occur frequently. to build updated time profiles of the identified variants, sequences were downloaded a second time from gisaid.org on april, 28, 2020 (9075 sequences, only high quality and high coverage). sampling dates were retrieved from the database and can be found in the gisaid sequence acknowledgment table (supplementary table 1 ). sequences for which the geographic origin and/or the data were not correctly defined (supplementary table 1 ), were removed, to obtain a database of about 8500 sequences, with some per-site variability due to the variable presence of ns in the sequences. we calculated two different time series, one for the entire set of sequences available up to the moment of sequence download, and one by considering only sequences taken from predefined groups of countries (hereinafter the time series by country). in the latter, we merged nearby countries to increase the precision of the estimation of the relative frequency of variants within predefined time intervals. the total time range since the first sequence available (133 days) was split into 13 (9 for the analysis by country, as the number of sequences per interval is smaller) non-overlapping intervals of about 10 days (~13 days for the time series by country). interval duration was selected looking for a compromise among the number of sequences available on average per interval in the different groups and the time resolution of the time series. for each interval we calculated the frequency of the observed variants and the shannon diversity index considering all concatenated coding sequences (not limited to the positions considered in this work). in supplementary table 2 we show the number of sequences used for each country for each interval. we performed a phylogenetic analysis by concatenating nucleotide aligned on the basis of the manually curated amino acid alignments. we selected the best phylogenetic model for our alignment using modeltest-ng (darriba et al. 2020 ) (gtr+i+ model) and then we performed maximum likelihood phylogenetic estimation in raxml8 (stamatakis 2014) . the obtained phylogenetic tree was then visualized and integrated with additional information about variants and geographical origin using figtree (https://github.com/rambaut/figtree/). our analysis allowed us to identify 7 positions in four proteins that present drastic changes in amino acid frequencies when comparing italian sequences with worldwide sequences available on gisaid.org on april, 10, 2020 ( figure 1 ). however, we discovered that residues found at these positions are not peculiar of italy and, as a matter of fact, they are the most variable genomic positions across available sars-cov-2 sequences (supplementary figure 1) . therefore, we decided to proceed with a worldwide analysis. the sites that we identified are: 1. position 614 in the spike protein, where an aspartate residue is found in high frequency in sequences obtained at the beginning of the pandemic. a variant, with a glycine residue at the same position is however now very common in europe, italy in particular. during the preparation of this manuscript, we realized that some of the variants are also the topic of other papers and preprints predating this work e.g. (korber et al. 2020; banerjee et al. 2020; somasundaram, mondal, and lawarde 2020; begum et al. 2020; bai et al. 2020; yang et al. 2020; pachetti et al. 2020) . the fact that we identify some of the previously known mutations by using a different approach indicates that they represent a genuine signal rather than artefacts; however, this does not necessarily mean that their increase in time (see below) is a consequence of a correspondingly increased transmissibility rate or aggressiveness in general -for which at the moment there is no conclusive data as stated in most of the available preprints. in the following, we summarize a thorough bibliographical analysis looking for functional information associated with these positions or the domains they belong to. however, we stress once more that virus spreading is characterized by haplotype shifts that are not necessarily related to the performances of the haplotypes themselves -for instance genetic drift or founder effect -therefore without targeted functional studies on the variants it is difficult to ascertain whether one or the other explanation is more probable. figure 1 for a confirmation of the variability of these positions in the entire set of sequences. note that to cope with zeros when taking the logarithm of the ratio of amino acid frequencies, we add a small number to every frequency that depends on the number of sequences in the two groups (italian and world). this makes so that positions with identical amino acid frequencies in the two groups do not get an s exactly equal to zero. d614 is located in the sd1 domain of the spike protein. it is positioned in a loop right after a beta-strand and close to a completely solvent exposed disordered region outside and downstream the ace2 interaction domain (wrapp et al. 2020; walls et al. 2020) . the resolution of the three available structures is not high enough to thoroughly discuss h-bond network. it is likely that d614 side chain does not establish close interactions with neighboring residues. conformationally, d614 lies in the region of left-handed helices. thus, from the structural point of view, the mutation d614g is likely to be neutral (no particularly strong interactions lost from the missing carboxylate group). on the other hand, the mutation might even be beneficial for overall protein stability: given the increased conformational freedom of glycine, it is indeed a perfect residue in turns following beta structures. other authors suggest a positive effect of d614g on the virus efficiency in binding the ace2 receptor. korber and colleagues (unpublished, (korber et al. 2020) ) suggest the possibility of structural changes in the protein or even an improved antibody-dependent enhancement effect. in coronaviruses, orf1 is translated to yield the polyprotein that is processed by proteolysis with the production of intermediate and mature nonstructural proteins (nsps). in figure 2 we highlight the position of the variant sites on the polyprotein with respect to known domains, as defined by the conserved domain database ). position 265 corresponds to residue 85 of the non-structural protein 2 (nsp2) which is characterized by four transmembrane helices (th) (angeletti et al. 2020) , whose role is still unknown. more in detail, position 85 lies at the n-terminal, which protrudes from the external face of the membrane. interaction analysis of the corresponding protein from sars virus shows that nsp2 can form dimeric or multimeric complexes that can also involve additional viral proteins (nsp3, nsp4, nsp6, nsp8, nsp11, nsp16, orf3a) (von brunn et al. 2007) suggesting that nsp2 might be involved in the viral life cycle. sars nsp2 also interact with host proteins, such as prohibitin 1 and 2 (cornillez-ty et al. 2009 ), that are involved in cell cycle progression, migration, differentiation, apoptosis and mitochondrial biogenesis (fusaro et al. 2003; merkwirth and langer 2009; rajalingam et al. 2005; sun et al. 2004) suggesting it might be important to manipulate prominent host functions. murine hepatitis virus and sars-cov strains deleted in this portion of the polyprotein (graham et al. 2005 ) have a strongly reduced viral growth and rna synthesis, but are not affected at the level of protein processing. ectopic expression of the nucleotide sequence coding for nsp2 in murine cells infected by strains missing nsp2, allowed to detect its recruitment in viral complexes. it is therefore possible that nsp2 is involved in global rna synthesis, interaction with host proteins and pathogenesis (von brunn et al. 2007 ). in infected cells, nsp2 seems to be present in small vesicular foci, but in absence of additional viral proteins it localizes in cytoplasmic or nuclear membranes, without a specific target (prentice et al. 2004; von brunn et al. 2007 ). unfortunately, mutagenesis experiments are still not available for sars-cov2, and therefore we still do not know whether the variant observed at this site has any effect on in vivo virus properties. position 3606 belongs to nsp6, a protein that induces the formation of autophagosomes, which is a sign of starvation in uninfected cells (cottam, whelband, and wileman 2014) . autophagosomes can act as an innate defense against viral infection but they can be hijacked and support the assembly of coronavirus replicase proteins (orvedahl et al. 2007; suhy, giddings, and kirkegaard 2000; wileman 2006 ). together with nsp3 and nsp4, nsp6 moreover promotes the formation of the double membrane vesicles typically observed in sars disease (angelini et al. 2013 ). nsp6 is a protein with 7 transmembrane domains (oostra et al. 2008 ) and position 3606 lays in the luminal loop between the first and second hydrophobic transmembrane domains. in the wild type sars-cov-2 sequence, we find 3 phenylalanine residues just before l3606. instead, some of the variants in this paper have a stretch of 4 phenylalanine, keeping this region highly hydrophobic. position 4715 belongs instead to the n-terminal domain of the rna-directed rna polymerase (rdrp, also known as nsp12). by inspecting its three-dimensional structure, sars-cov-2 rdrp displays a n-terminal nidovirus-like rdrp-associated nucleotidyltransferase domain (niran) followed by an interface domain and then by the canonical palm and finger domain structure (gao et al. 2020) . p4715 (p323 according to nsp12 numbering) is located in the alpha-beta interface domain which bridges niran with the finger domain. specifically, p4715 sits on a solvent exposed loop region in the groove formed between niran and the finger domain. the mutation p4715l mutation does not change the non-polar nature of the side chain and from the conformational point of view the substitution from p to l should not result in specific structural adjustment of the loop. thus, despite no information concerning mutations at the specific sites identified in this work, the three variant sites in the polyprotein belong to important functional regions of the sequence. orf8 codes for a secreted accessory protein not directly involved in viral replication (dediego et al. 2008; yount et al. 2005; tan et al., 2020) . homologs have been found in some beta-coronavirus, named orf7a (tan et al., 2020) . orf7a of sars codes for a protein with a structure similar to immunoglobulin superfamily proteins, specifically the metazoan ig involved in adhesion (nelson et al. 2005; hänel et al. 2006) . the function of this protein is not entirely clear, but could be related to the modulation of the immune system. studies have demonstrated that sars-orf7a protein localizes mainly in the perinuclear region of the host cell, where it interacts with bst-2 (bone marrow stromal antigen 2) preventing the glycosylation of bst-2 needed for functioning (taylor et al. 2015) . bst-2 inhibits the release of the virus, as observed in hiv-1 (taylor et al. 2015) likely at the level of the endoplasmic reticulum-golgi where it binds budding virions. orf8 has a structure composed of a beta-sandwich fold with seven beta-stands, which highly resembles orf7a's structure, but it lacks the c-terminal transmembrane region and has an additional long insert between strands 3 and 4 which is supposed to be involved in peptide binding (tan et al., 2020) . orf8 is a fast-evolving gene which, together with the absence of the transmembrane region and the presence of the insertion may suggest some functional divergence with respect to its ancestor (tan et al., 2020) . it has been proposed that orf8 may have acquired a function similar to the adenoviral cr1 protein, which interferes with mhc molecules to attenuate the antigen presentation and therefore the capability of the host immune system to detect the virus. in this context, we notice that l84 lays within the orf8 insert, indicating it may represent a further adaptation. because direct functional information on this site is lacking at the moment, it is not possible to ascribe a functional adaptation to these variants; however, variability at this site might counteract mhc interference function, as a fast-evolving region in the insert may have been selected positively to facilitate the interaction with a fast-evolving host molecule (tan et al., 2020) . after having identified variant sites, we explored the time profile of the seven variable sites across all sequences (figure 3 ) evidencing haplotype frequency changes for all of them between february and march, with five changes being moderate (nucleocapsid r203g and g204r, orf8 l84s, polyprotein t265i and l3606f) and two more drastic, now representing the most common haplotypes of non-chinese recent sequences (spike d614g and polyprotein p4715l). the spike variant, in particular, was sequenced only few times in china since the beginning of the epidemy, the first time in zhejiang on january 24; however, in this country it never reached a significant frequency. conversely, after the first sequencing of this variant in germany on january 28 this variant started at very low-frequency and then became the most common haplotype at this position outside china. a similar situation is true for the haplotype with a leucine in position 4715 of the polyprotein -that rapidly increases since the beginning of march. these patterns seem indicative of a functional role for at least some of the variants that undergo an increase in frequency, but in the absence of any functional test or experimental data, we cannot rule out that the observed frequency changes are a consequence of a founder effect in europe followed by a spreading wave from europe to countries where the epidemic started later. a founder effect, however, implies that the founder arrives first, while this is not always the case, at least for some of the variants and part of the countries. at the same time some data both reviewed and unreviewed start to be available, suggesting that d614g on the spike might provide some advantage to the virus. bai (bai et al. 2020 ) and brufsky (brufsky 2020 ) observed a correlation of g614 with increased mortality, while korber and coworkers (korber et al. 2020) found a correlation between the presence of the mutation and a higher viral load in patients. in the legend, we indicate the reference residue with an asterisk, and the variant on which we are focusing with an exclamation mark. additional variants were identified in the sequence dataset downloaded on april, 28, but they never reach significant percentages, at least at the moment. position 614 on the spike and position 4715 on the polyprotein covary in a significant way (data not shown). the most likely explanation for this is the rapid sequential fixation of both mutations in the same strain, together with the absence of recombinations in between the two. the two adjacent sites in the nucleocapsid sequences show a perfect agreement in the time profiles, strongly suggesting that they happened together or within a short time. these two mutations respectively remove and add an arginine (passing from rg to kr); by considering the overall frequencies of the possible amino acid pairs at these positions, we suggest that the second arginine may complement the loss of the first one. indeed, rr is never observed, kg is extremely rare (relative frequency, r.f.= 0.00036), while most genomes either present the original pair rg (r.f.= 0.87464) or kr (r.f.= 0.12500). we speculate that this could indeed be a consequence of the non-neutrality of configurations with no arginine at both positions, an information still not giving hints on the fitness of the fixed variant. next,we reasoned that grouping all the sequences uploaded from different countries may provide a picture averaged over variable and more complicated situations that might characterize the evolution of the virus within different countries or geographical areas. we therefore explored the time profiles of the variants in different geographical macro-regions ( figure 4 and figure 5 ) highlighting a highly heterogeneous situation. we also provide a movie illustrating the changes taking place in variant frequencies across overlapping time windows in macro-regions (supplementary movie 1), which provides a dynamical view of the time profiles of the variants. figure 4 shows that d614g in china never reaches significant frequency while it increased quite rapidly in several areas where it arrived and where it often started at very low frequency with respect to the original haplotype. if a functional role for this mutation will be demonstrated, this pattern seems to indicate that different variants might have different fitness when interacting with different host's haplotypes, i.e. in case asian and european have different haplotypes concerning some of the proteins interacting with the spike, like for instance furin. to conclude, it is clear that since the first appearance of d614g and other of the above variants, their relative frequency underwent a significant increase in several countries, most of the time overcoming in prevalence the original variant(s), except for china, south asia, south america and africa. in figure 6 we summarize the most recent situation, by using sequences from the interval april 10 to 20, 2020. the best phylogenetic model selected according to aic and bic was gtr+i+g, for which the estimated substitution rate matrix contains the following rates, relative to the g<->t rate, taken as unity: a<->c=0.34, a<->g=0.77, a<->t=0.32, c<->t=100. the extremely high rate for c to t (or better u, considering we are dealing with an rna virus) is in agreement with the involvement of host apobec-like editing mechanisms, as proposed in recent works (di giorgio et al. 2020) . besides not being the focus of this paper, these rates provide further evidence that host's mutagen systems may play an important role in the evolution of sars-cov-2 and its detection by the immune system. the phylogenetic tree integrated with additional information is reported in figure 7 , with different coloring schemes and together with an evolutionary model summarizing how the different variants configurations are likely related to each other. when considering the information about the residues found at the seven positions on which we are focusing (hereinafter variants configurations, vcs), as in figure 7a , we find that the phylogenetic clades correspond to the six most frequent ones (accounting for over 96.3% of the sequences), that is rgtlpdl, rgtlpds, rgtfpdl, krtllgl, rgtllgl, rgillgl -obtained by linking the variant residues in the order: protein nucleocapsid sites 203, and 204, polyprotein sites 265, 3606 and 4715, spike site 614 and orf8 site 84; for this reason hereinafter we will use clades a to f interchangeably with the vcs written above. we are aware that the correlation is not perfect, as can be seen by the presence of additional but low frequency variants within each clade, or the presence of vc misplaced with respect to their major clade. misplaced sequences can be a consequence of insufficient phylogenetic signal or of convergent evolution. for instance, as anticipated above one of the chinese sequences carrying spike variant 614g (vc: rgtlpgs), is indeed contained within clade b (rgtlpds), indicating convergent evolution with the spike d614g variant in clades d, e, f, or maybe an artifact. however, the limited number of similar cases supports our simplification and the correspondence among clades in the tree and vcs. our multi-alignment contains one sequence annotated as bat coronavirus (epi_isl_402131), which provides a rooting of the phylogenetic tree that falls in clade b and indeed contains most of the sequences obtained in china at the beginning of the epidemy ( figure 6, panel b) . the tree has two main "radiations", one corresponding to variants configurations rgtlpdl, rgtlpds, rgtfpdl and one to krtllgl, rgtllgl, rgillgl, therefore the main partition of the tree corresponds to the identity of residues at position 614 of the spike and 4715 of the polyprotein. the two radiations are present at different frequencies in the same countries, with the notable exception of chinese sequences within radiation 2, except for epi_isl_422425, the only sequence with a rgtllgl pattern sequenced in china that is also present in the phylogenetic tree (indicated in figure 7a ). this suggests that while most of the diversification of radiation 1 took place in china and was then exported outside, the ancestor of radiation 2 travelled outside china early to start a diversification in the countries where it arrived. this hypothesis is in agreement with the fact that epi_isl_422425 is placed very close to the root of the branch leading to radiation 2, indicating it may indeed represent a strain closely related to the true radiation 2 ancestor. these observations raise important questions (1) about the identity and movements of the first individual carrying the ancestral radiation 2 variant outside china (2) on the timing of the epidemic, but most importantly (3) the reason why it did not increase in frequency in china but elsewhere. following the topology of the tree we propose an evolutionary model (figure 7c ) whereby the ancestral bat variants configuration (rgtvpds or any other present in the presently unknown animal host) evolved into rgtlpds and from this to rgtlpdl. we stress that this likely took place through unobserved states/hosts as the root branch length is close to the total length of the tree and therefore we do not consider mutation v3606l in the polyprotein full sequence as an adaptation to the human host. the latter originated rgtfpdl on one side, and the ancestor of radiation 2 (rgtllgl) that originated both krtllgl (through kg unfit strains?) and later rgillgl. given the almost perfect agreement of the tree with the variants configurations, we analysed the geographical distribution and the time profiles of the seven sites at once, similar to what is done for mlst-based classification of pathogens. in figure 8a we show the time profiles of the relative abundance of sequences belonging to the 6 clades, clearly showing not only the appearance and increase of the novel clades, but most importantly the gradual disappearance of sequences belonging to clade a (see also supplementary movies 2 and 3). when focusing on single clades across all macro-regions previously defined, we find a heterogeneous situation with different variants increasing in time in different countries. however, we can see that all vcs with a proline at position 4715 of the polyprotein and an aspartate at position 614 of the spike almost disappear in time, remaining abundant only in south asia and oceania; in the rest of the world, only vcs with a leucine and a glycine at those positions remain in the last interval of our time range, after replacing the existing vcs. this is indeed clear from figure 8c showing that vc for clade a (rgtlpdl) is no more present in the interval april 10 to 20, 2020. south asia is particular because it is the only area where vcs of radiation 2 (rgtllgl,krtllgl,rgillgl) are present for some time but then disappear with a re-increase of clade c (rgtfpdl). this may suggest these variants have equal phenotypic characteristics, but as written above, we cannot exclude differential fitness depending on host's haplotypes. by calculating the shannon index for sequences within the same intervals of times used to track the changes in frequency of the variants, we were able to trace the evolution of diversity in different places. peaks indicate the emergence (from outside or by evolution of pre-existing strains) of variants and their increase in frequency. values maintaining a high shannon index correspond to situations where different variants coexist at comparable frequencies, while a decrease after a peak means that after the appearance of one or more new variants / clades, one of them (novel or old) takes over -reducing the variability. we also indicate the first sampled sequence for each of the six clades. variability increases when a new variant appears, then it stays more or less stable when the existing variants maintain their relative frequencies. however, if one of the variants becomes dominant, then variability decreases again after a peak. increasing variability in time means that the arrival of novel variants continues steadily, or that the existing ones become more homogeneous in frequency. in this work, we identified seven positions in coding sequences of the sars-cov-2 genome that are characterized by a different pattern of amino acids when comparing italian sequences with the global trends around the world. further analysis revealed that these sites are not peculiar of italian strains, and that different combinations of these variants are present at varying relative frequencies in different geographic areas. we found that the combination of these residues identifies six abundant configurations that corresponds to 6 phylogenetic clades and that cover over 96% of all sequences. this suggests that the characterization of these positions can represent a fast and portable method for the sars-cov-2 typing, but novel variants are emerging that might eventually take over the old ones through mutations at additional sites. using this approach we were able to follow the evolution of the virus over time among continents, showing that the different clades evolved in different moments and that their frequencies vary among continents. these sites are also the most variable among all available sars-cov-2 sequences, raising intriguing questions about their functional effects. variants with a leucine at position 4715 of the polyprotein together with a glycine at position 614 of the spike, underwent an increase in frequency since the end of january in most countries, overcoming the original haplotypes. mutations that might affect the structure of the spike protein are of primary interest, since many vaccine candidates and serological tests rely on the conformation of this protein (who 2020a). this and other works also explore the hypothesis that the variants may indeed provide a selective advantage to the virus. clade prevalence in different countries could be used to check for mortality rate differences and association with variants, but as the rates depend on many other factors (different screening strategies, different ways to define an individual infected and so on), we feel premature discussing such correlations. once the numbers will be standardized for different countries, this kind of associations, if present, will clearly emerge. moreover, to really clarify these issues, experimental data is required, such as for instance in the form of tninsertion mutagenesis, as performed on other viruses in the past (fulton et al. 2017 ) followed by competition experiments in in vitro cultures or the design of genomes carrying well-defined changes. this would allow to understand how the virus tolerate mutations at different sites and might provide information on the importance of different genomic regions for different stages of the infection. query seq. the proximal origin of sars-cov-2 covid-2019: the role of the nsp2 and nsp3 in its pathogenesis severe acute respiratory syndrome coronavirus nonstructural proteins 3, 4, and 6 induce double-membrane vesicles evolution and molecular characteristics of sars-cov-2 genome mutational spectra of sars-cov-2 orf1ab polyprotein and signature mutations in the united states of america analyses of spike protein from first deposited sequences of sars-cov2 from west bengal, india distinct viral clades of sars-cov-2: implications for modeling of viral spread analysis of intraviral protein-protein interactions of the sars coronavirus orfeome networks of coevolving sites in structural and functional domains of serpin proteins blast+: architecture and applications seqinr 1.0-2: a contributed package to the r project for statistical computing devoted to biological sequences retrieval and analysis severe acute respiratory syndrome coronavirus nonstructural protein 2 interacts with a host protein complex involved in mitochondrial biogenesis and intracellular signaling coronavirus nsp6 restricts autophagosome expansion modeltest-ng: a new and scalable tool for the selection of dna and protein evolutionary models pathogenicity of severe acute respiratory coronavirus deletion mutants in hace-2 transgenic mice evidence for host-dependent rna editing in the transcriptome of sars-cov-2 transposon mutagenesis of the zika virus genome highlights regions essential for rna replication and restricted for immune evasion prohibitin induces the transcriptional activity of p53 and is exported from the nucleus upon apoptotic signaling structure of the rna-dependent rna polymerase from covid-19 virus the nsp2 replicase proteins of murine hepatitis virus and severe acute respiratory syndrome coronavirus are dispensable for viral replication solution structure of the x4 protein coded by the sars related coronavirus reveals an immunoglobulin like fold and suggests a binding activity to integrin i domains spike mutation pipeline reveals the emergence of a more transmissible form of sars-cov-2 coronavirus disease 2019 (covid-19) in italy genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding cdd/sparcle: the conserved domain database in 2020 the coronavirus nucleocapsid is a multifunctional protein prohibitin function within mitochondria: essential roles for cell proliferation and cristae morphogenesis structure and intracellular targeting of the sars-coronavirus orf7a accessory protein topology and membrane anchoring of the coronavirus replication complex: not all hydrophobic domains of nsp3 and nsp6 are membrane spanning hsv-1 icp34.5 confers neurovirulence by targeting the beclin 1 autophagy protein emerging sars-cov-2 mutation hot spots include a novel rna-dependent-rna polymerase variant identification and characterization of severe acute respiratory syndrome coronavirus replicase proteins ernst mayr: genetics and speciation prohibitin is required for ras-induced raf-mek-erk activation and epithelial cell migration the epidemiology and pathogenesis of coronavirus disease (covid-19) outbreak mechanisms of viral mutation genomics of indian sars-cov-2: implications in genetic diversity, possible origin and spread of virus raxml version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies remodeling the endoplasmic reticulum by poliovirus infection and by individual viral proteins: an autophagy-like origin for virus-induced vesicles akt binds prohibitin 2 and relieves its repression of myod and muscle differentiation novel immunoglobulin domain proteins provide insights into evolution and pathogenesis mechanisms of sars-related coronaviruses severe acute respiratory syndrome coronavirus orf7a inhibits bone marrow stromal antigen 2 virion tethering through a novel mechanism of glycosylation interference structure, function, and antigenicity of the sars-cov-2 spike glycoprotein receptor recognition by the novel coronavirus from wuhan: an analysis based on decade-long structural studies of sars coronavirus who, coronavirus disease 2019 (covid-19) situation report 120 aggresomes and autophagy generate sites for virus replication cryo-em structure of the 2019-ncov spike in the prefusion conformation a new coronavirus associated with human respiratory disease in china genomic, geographic and temporal distributions of sars-cov-2 mutations severe acute respiratory syndrome coronavirus group-specific open reading frames encode nonessential functions for replication in cell cultures and mice probable pangolin origin of sars-cov-2 associated with the covid-19 outbreak a pneumonia outbreak associated with a new coronavirus of probable bat origin all authors wish to thank the www.gisaid.org and all the researchers that contributed their sequences to the database for sharing fundamental data for research. we think collaboration is the only approach to counteract the spread of sars-cov-2 and other similar endeavors. references for all sequences used in this work are in supplementary key: cord-345552-h6fwi0qn authors: li, q.-g.; lindman, k.; wadell, g. title: hydropathic characteristics of adenovirus hexons date: 1997-07-01 journal: arch virol doi: 10.1007/s007050050162 sha: doc_id: 345552 cord_uid: h6fwi0qn the complete nucleotide sequence and the predicted amino acid sequence of the adenovirus type 7 hexon gene were determined. the hydro-pathy of the hexon proteins from human adenovirus types 2, 3, 4, 5, 7, 12, 16, 40, 41, and 48, bovine adenovirus type 3, murine adenovirus type 1, and avian adenovirus types 1 and 10 was analysed. the presence of purines and pyrimid-ines in the second position of the codons was correlated to hydrophilicity and hydrophobicity, respectively. comparison of the hydrophilicity plots of eight hexons showed seven hypervariable regions to be distributed on the surface. a large portion of the hypervariable regions manifests hydrophilicity. the strength of the surface charge accumulated on the hydrophilic and hydrophobic regions correlated to the tissue tropism of the different adenovirus types. analysis of codon usage for adenovirus hexons showed that among synony-mous codons those with cytidine in the third position were preferably used to a great extent. analysis of the nucleotide and amino acid sequence pair distances and the phylogenetic tree of 14 hexon proteins showed members of subgenera b, d and e to be closely related, especially ad4 and ad16, and subgenus a to be closely related to subgenus f. the family adenoviridae comprises two genera mastadenovirus and aviadenovirus. the genus mastadenovirus consists of 101 species that have been isolated from nine different host species [24] . the most extensively studied group is that of the human adenoviruses. so far, 49 human adenovirus (ad) serotypes have been identi®ed [27] and divided into six different subgenera, a to f, which differ from each other in various characteristics such as tropism. adenovirus type 7 (ad7) is the serotype most frequently associated with severe diseases [17] . adenoviruses are non-enveloped icosahedral viruses. the virion contains at least eleven different structural polypeptides. the hexon is the most abundantly produced protein. each virion contains 240 hexons which form the facets of the icosahedron. the nine central hexon capsomers in each facet are cemented together by polypeptide ix to form a group-of-nine [7] . the hexon is a homotrimer consisting of three identical polypeptide chains. the monomer is an unusually large structural protein with a m r ranging from 102±109 k ( table 2) . as several entire or partial hexon sequences of different serotypes have been published (references cited in table 2 ), the hexons of several different serotypes have been compared at the nucleotide and predicted amino acid sequence levels [9, 14, 31, 36] . the biochemical and immunological properties and even the threedimensional structure of adenoviruses have been extensively studied. the trimeric hexon molecule has a pseudo-hexagonal base with a large central cavity and a triangular top. the base contains two pedestal domains, p1 and p2. the top contains three long loops l1, l2 and l4 [3] . however, the hydropathic character of the adenovirus hexons and the codon usage has not been analysed. the hydropathy of a protein is very important for predicting putative antigenic regions of the protein. the antigenic determinants can be deduced by searching the amino acid sequence for the areas of greatest local hydrophilicity. generally, the highest peak of hydrophilicity correctly predicts an antigenic site [11] . codon usage bias for codons with thymidine (t) or cytidine (c) at second positions in mitochondrial dna has been shown to be correlated with hydrophobicity [21] . having analysed nine adenovirus hexon genes, we found the presence of purine and pyrimidine at the second codon position to be correlated to hydrophilicity and hydrophobicity, respectively [19] . this ®nding was recently con®rmed by analysis of further data obtained from genbank. here, we report the hydropathy analysis of 14 adenovirus hexon sequences predicted from a newly determined ad7 hexon dna sequence and thirteen published hexon sequences of ad2, ad3, ad4, ad5, ad12, ad16, ad40, ad41, ad48, bav3, mav1, fav1 and fav10. there is strong correlation between hydrophilicity and the codons with purine in the second position (cpuss), and between hydrophobicity and the codons with pyrimidine in the second position (cpyss) in adenovirus hexons. comparison of the surface charge on these 14 hexons suggests the strength of the surface charge accumulated in hydrophilic regions or hydrophobic regions to be correlated with the tissue tropism of the different adenovirus types. the ad7 prototype strain gomen, originally from the american type culture collection (atcc), was obtained from dr. g. von zeipel, stockholm county council central microbiological laboratory, stockholm, sweden [34] . this strain was propagated in a549 cells. the viral dna of ad7 strain gomen was prepared as described [18] . the puri®ed dna was used as the template for pcr to amplify ®ve fragments which covered the entire hexon gene. the pcr products were cloned into pt7 blue vector. ligation, transformation and rapid screening were performed using the pt7 blue t-vector kit protocols (novagen, madison, wi, usa). the recombinant pt7-blue plasmid dna was prepared for sequencing. as the pcr method occasionally gave rise to error when the complementary dna strand was synthesised using taq polymerase, the sequence so obtained was con®rmed with the following procedure: the ad7 hexon dna bam hi restriction fragments a and e, which contain the hexon gene, were cloned into plasmid vector pbr322 and multiplied in e. coli strain hb101 using conventional methods described previously [25] . these recombinant plasmid dnas were used as the sequencing templates. the nucleotide sequences were determined for both strands with the dideoxynucleotide chain determination method, following procedure c of the autoread sequencing kit (pharmacia lkb biotech, sweden) and the manufacturer's instructions for the alf dna sequencer (pharmacia). all the sequence data were analysed with the lasergene software (dnastar inc., madison, wi) programs, editseq, megalign and protean. every hexon dna sequence was translated to protein sequence by using program editseq-translation. the codon usage data could be obtained when a protein ®le was opened by editseq. the hydrophilicity plot (kyte-doolittle method) could be shown by program protean. all the eight hydrophilicity plots could be moved together by the program microsoft powerpoint (fig. 1 ). the isoelectric point and the net charge of a protein could be obtained by program protean-composition (table 2) . a tabular data of hydrophilicity and hydrophobicity for each amino acid could be obtained by program protean-tabulardata. the accumulated charge of all hydrophilic regions of a protein could be obtained by deleting manually all the hydrophobic regions according to the tabular data. an easier way was that, ®rst, to delete the hydrophobic regions within dna sequence; second, to translate this dna sequence to an amino acid sequence; then, to obtain the accumulated charge of hydrophilic regions by program protean-composition. the accumulated charge of hydrophobic regions could be obtained in the same way ( table 2) . the data of codon usage of accumulated hydrophilic and hydrophobic regions from 14 different serotypes were analysed by the computer program microsoft excel (tables 1 and 5 ). the complete hexon gene dna sequence of ad7 prototype strain, gomen, was determined. it consists of 2 814 nucleotides. this dna sequence, together with fig. 3 . the ®gures around the boxes denote the amino acid numbers at the start and the end of each hypervariable region two short¯anking regions at each end, has been entered in the european molecular biology laboratory (embl, accession number: z48571). the 5 h anking region has a splice acceptor site. the 3 h¯a nking region ends just before the start codon of proteinase. the sequence of the predicted protein, consisting of 937 amino acids, was obtained with the lasergene software program editseq. the hydropathy data of hexon proteins from human adenovirus types 2, 3, 4, 5, 7, 12, 16, 40, 41, and 48, bovine adenovirus type 3, murine adenovirus type 1, and avian adenovirus types 1 and 10 were derived using the prediction method of kyte-doolittle in the lasergene computer program protean. analysis of the codon usage for the codons corresponding to hydrophilic and hydrophobic regions showed the presence of cpuss and cpyss to be strongly correlated with hydrophilicity and hydrophobicity, respectively. statistical analysis using the chi square test showed a highly signi®cant difference (1 2 192x79y p`0x001) for the total numbers of cpuss and cpyss in hydrophilic and hydrophobic regions ( table 1) . the multiple sequence alignments of the 14 complete nucleotide sequences and the amino acid sequences of adenovirus hexons were determined by the program megalign (data not shown). the nucleotide and amino acid sequence pair distances and the phylogenetic tree of 14 hexon proteins showed serotypes of subgenera b, d and e to be closely related (table 3 and fig. 2) . they manifest 82.0±93.9% amino acid sequence homology. ad4 (subgenus e) is very similar to ad16 (subgenus b), the amino acid sequence homology between these two serotypes reaching 93.4%. this is in agreement with the cross neutralisation seen between antihexon sera speci®c for ad4 and ad16 [22] . the alignment data also showed ad12 (subgenus a) to be closely related to ad40 and ad41 (subgenus f), the level of sequence homology being 81.5±87.1%. in contrast, the sequence similarity of e3a proteins between ad12 and ad41 was very low, only 30% in e3 rl1 and 34% in e3 rl2 [37] . the alignment of the 14 complete sequences of adenovirus hexons showed nucleotide sequence homology to be 77±94% within a subgenus of human adenoviruses, 65±77% between members of different subgenera with the exception between ad4 and ad16 (86.2%); 64±94% between adenoviruses from the same host species and 35±59% between adenoviruses from different host species; and 35±43% between the two adenovirus genera mastadenovirus and aviadenovirus. the alignment of the 14 complete amino acid sequences of adenovirus hexons revealed the existence of seven hypervariable regions (fig. 3) . hydrophilicity plots derived from the hexon sequences of eight serotypes representing eight subgenera of mastadenovirus were compared (fig. 1) . seven hypervariable regions were demonstrated (in boxes, corresponding to the boxes in fig. 3) . the ®rst four hypervariable regions, a1 to a4, covered most of the external loop 1 (l1) of the hexon. they were separated by three short relatively conserved sequences which may stabilise the outer shell structure of the hexon. the l2 contains two hypervariable regions, b1 and b2. all of the hypervariable regions in l1 and l2 contained longer or shorter deletions/insertions. the amino acid sequences of these regions were serotype speci®c) the hypervariable region d of mastadenovirus consisting of 13±14 amino acids was located in l4. amino acid sequence homology showed the region d to be subgenus speci®c. sequence homology was 93±100% for pairs of serotypes within a subgenus, and less than 79% between serotypes belonging to different subgenera. interestingly, the major portion of the hypervariable regions manifest hydrophilicity. according to the ribbon representation of the ad2 hexon subunit [3] , the major portion of the hydrophilic regions in hypervariable areas was located at the surface of the hexon molecule. codons with cytidine in the third position are highly preferred analysis of codon usage for the 14 serotypes showed that among the synonymous codons those with cytidine in the third position (nnc) were highly preferred (table 1) . among cpuss the bias for nnc codons was strong. however, although the codon preference for nnc among cpyss was generally manifest, this was not the case for the nnc codons for the hydrophobic amino acids ile, leu and val. the isoelectric point and the surface charge for hexons of the 14 serotypes were deduced with the program lasergene-protean-composition (table 2) . at ph 7, the charges of the hexon proteins from different subgenera varied widely. the hexons of ad2 and ad5 belonging to subgenus c carried the highest negative charge, à26x73 and à22x57, respectively. these charges were about 2.5 times greater than those of hexons of subgenus f, bav3, mav1 and fav10. more interestingly, the strength of the accumulated charge of hydrophilic or hydrophobic regions correlated with tissue tropism. the prototypes of ad3, ad7, ad2 and ad5 were isolated from respiratory tract specimens. the prototype of ad3 was isolated from nasal washing, ad7 from throat washing, and both ad2 and ad5 from adenoid tissue cultures [6] . these adenoviruses were frequently isolated from patients with respiratory diseases [26] . the hexons of these four serotypes were characterised by strong accumulated charges of à12x3 to à20x97 in hydrophilic regions, but manifestly weaker charges of only à3x62 to à5x85 in hydrophobic regions. the prototypes of ad16 and bav3 were isolated from conjunctival scrapings and conjunctiva, respectively [6] , the hydrophobic regions of their hexons manifest charges of à11x62 and à10x51. ad4 can be isolated both from the eye and the respiratory tract [16] . the negative charges of the hydrophilic and the hydrophobic regions of the ad4 hexon were similar, à8x16 and à9x48, respectively. the prototypes of ad12, ad40, ad41 and fav10 were isolated from faeces or rectal swabs [6, 8, 32] . all four of these enteric adenoviruses were characterised by lower negative charges of à3x81 to à7x78 in both hydrophilic and hydrophobic regions of their hexons. mav1 isolated from spleen and fav1 isolated from allantois [6] manifested individually unique patterns of surface charge. the hydropathic type and value of an amino acid is dependent on its codon usage and its position in a protein structure. scales for evaluating the hydropathic characteristics of amino acids have been developed by many different research groups. however, the hydrophilicity and hydrophobicity values obtained differ substantially [10, 11, 15] . the most frequently used scales, introduced by kyte and doolittle were used in the present study. as ranked on hydropathy scales [15] , the 20 common amino acids can be divided into three different classes. the ®rst is the hydrophobic class including ile, val, leu, phe and cys. of these ®ve amino acids, only cys is encoded by cpuss, the other four being encoded by cpyss. the second is the hydrophilic class consisting of his, glu, gln, asp, asn, lys and arg, all of in general, a a g-rich region (a, adenosine; g, guanine) in a nucleotide sequence contains more cpus codons, and c t-rich region contains more cpys codons. therefore, the part of a protein, which is encoded by a g-rich region usually manifests hydrophilicity and a c t-rich region encoded peptide usually shows hydrophobic characteristics. the hydropathy value of an amino acid in a protein chain is dependent on the protein conformation. the residue side-chains can protrude at the interior or exterior portion of a protein chain. the hydropathy value of an amino acid was determined by averaging over a window which contains several consecutive residues surrounding the amino acid in question [15] . therefore, the strength and even the type of the hydropathy of an individual amino acid may vary according to its location within a peptide chain. in particular, the hydropathy of the amino acids in the intermediate class was found to vary more frequently and in a more pronounced way ( table 1) . the major portion of hydrophobic sequences in a protein will be found in the interior of the native structure, and the major portion of hydrophilic sequences will be found on the exterior [15] . branden and tooze [5] found that the hydropathy plots (kyte-doolittle method) agree with the crystal structure data in a polypeptide of r. sphaeroides. in this study, we found the major portion of the hypervariable areas to manifest hydrophilicity. the major portion of the hydrophilic regions in hypervariable areas is located at the surface of the crystal structure of the hexon. the nnc codons in adenovirus hexons manifested high usage. the result is compatible with that derived from analysis of 28 different genes of ad2 (except the phe) [33] . preference for the synonymous codon for phe, uuc was greater than uuu in 14 hexon genes (table 1) . however, the reverse result was true of 28 different ad2 genes, uuc accounting for 1.44% and uuu was 2.07%. phe with its aromatic side chain is highly hydrophobic, nonpolar and chemically stable. the variability of the phe residue is one of the lowest during divergence of homologous proteins [10] . the patterns of codon usage for phe in hexon genes were shown to be subgenus speci®c ( table 4 ). the codon usage for phe in the hexon genes of the species of each subgenus is highly similar within subgenera b, c and f. the codon preferably used for phe in hexon genes of members of subgenera a, c and f is uuu, whereas, uuc is preferably used in hexon genes of the species of subgenera b, d and e. these ®ndings corroborate the relatedness of the va rna 1 genes [13] , and also the relatedness of the hexon genes of human adenoviruses [4] . we found the nnc codons in adenovirus hexon genes to be highly preferred. analysis of the data obtained from genbank [33] the c g content of different organisms is used as one of the criteria in taxonomy. among various eubacteria the c g content of the genome dna ranges from 25% to 75% [20] . in adenoviridae the dna c g content varies from 48±61% for mastadenoviruses and 54±55% for aviadenoviruses [24] . in the six different subgenera of human adenoviruses the c g content of genome dna ranges from 48% to 58% [35] . the c g content of all 14 serotypes reveals the existence of two groups: higher c g content (54.59±58.23%) group including ad48, ad4, bav3, fav1 and fav10, and lower c g content (47.48±50.78%) group including the remaining nine serotypes ( table 2 ). the c g content of human adenovirus hexons was consistent with the level of c g content of whole adenovirus genome dna with the exception of subgenus c, ad2 and ad5, which has higher c g content (58%) in genome dna. there are 15 amino acids encoded by synonymous codons which contain nnc (table 5 ). of these 15 amino acids, nine manifested nnc high usage, and three (asp, ile and pro) showed at same level of nnc usage, whereas only three (cys, leu and val) showed low nnc usage. cys is a rare amino uuc 12 30 28 30 20 19 38 36 21 17 18 5 31 40 uuu 37 17 20 19 25 28 13 11 29 30 29 46 15 9 acid and only four cysteines in ad7 hexon. among the synonymous codons for val, seven gucs appeared in the hydrophobic region. therefore, the conclusion is that the nnc codons in ad7 hexon were frequently used although ad7 is one of the serotypes which contain lower level of c g content in hexon dnas. fav1 (celo) genes encoding for major core, hexon-associated and hexon proteins the gene for the adenovirus 2 hexon polypeptide the re®ned crystal structure of hexon, the major coat protein of adenovirus type 2, at 2.9 a ê resolution phylogenetic relationships among adenovirus serotypes introduction to protein structure adenovirues. in: american type culture collection (atcc) catalogue of animal & antisera, chlamydiae & rickettsie, 6th ed. american type culture collection the structure of the adenovirus capsid ii. the packing symmetry of hexon and its implications for viral architecture adenoviruses of chickens: serologic groups analysis of 15 adenovirus hexon proteins reveals the location and structure of seven hypervariable regions containing serotype-speci®c residues reconstructing evolution from contemporary sequences (chapter 3.3) and the properties of liquid water and the characteristics of noncovalent interactions in this solvent prediction of protein antigenic determinants from amino acid sequences sequence homology between bovine and human adenoviruses human and simian adenoviruses: phylogenetic inferences from analysis of va rna genes adenovirus hexon: sequence comparison of subgroup c serotypes 2 and 5 a simple method for displaying the hydropathic character of a protein nosocomial conjunctivitis caused by adenovirus type 4 analysis of 15 different genome types of adenovirus type 7 isolated on ®ve continents genetic relationship between thirteen genome types of adenovirus 11, 34 and 35 with different tropisms hydropathic character analysis of nine adenovirus hexon sequences the guanine and cytosine content of genomic dna and bacterial evolution hydrophobicity and phylogeny immunological relationships between hexons of certain human adenoviruses sequence characterisation and comparison of human adenovirus subgenus b and e hexons adenoviridae molecular cloning world-wide epidemiology of human adenovirus infections two new candidate adenovirus serotypes genomic mapping and sequence analysis of the fowl adenovirus serotype 10 hexon gene nucleotide sequence of human adenovirus type 12 dna: comparative functional analysis dna sequence of the adenovirus type 41 hexon gene and predicted structure of the protein the adenovirus type 40 hexon: sequence, predicted structure and relationship to other adenovirus hexons isolation and classi®cation of avian enteric cytopathogenic agents codon usage tabulated from the genbank genetic sequence data demonstration of three different subtypes of adenovirus type 7 by dna restriction site mapping adenoviruses sequence and structural analysis of murine adenovirus type 1 hexon genetic organization, size, and complete sequence of eight region 3 genes of human adenovirus type 41 authors' address: dr. g. wadell, department of virology, umea ê university, s-901 85 umea ê, sweden.received october 25, 1996 key: cord-018963-2lia97db authors: xu, ying; liu, zhijie; cai, liming; xu, dong title: protein structure prediction by protein threading date: 2010-04-29 journal: computational methods for protein structure prediction and modeling doi: 10.1007/978-0-387-68825-1_1 sha: doc_id: 18963 cord_uid: 2lia97db the seminal work of bowie, lüthy, and eisenberg (bowie et al., 1991) on “the inverse protein folding problem” laid the foundation of protein structure prediction by protein threading. by using simple measures for fitness of different amino acid types to local structural environments defined in terms of solvent accessibility and protein secondary structure, the authors derived a simple and yet profoundly novel approach to assessing if a protein sequence fits well with a given protein structural fold. their follow-up work (elofsson et al., 1996; fischer and eisenberg, 1996; fischer et al., 1996a,b) and the work by jones, taylor, and thornton (jones et al., 1992) on protein fold recognition led to the development of a new brand of powerful tools for protein structure prediction, which we now term “protein threading.” these computational tools have played a key role in extending the utility of all the experimentally solved structures by x-ray crystallography and nuclear magnetic resonance (nmr), providing structural models and functional predictions for many of the proteins encoded in the hundreds of genomes that have been sequenced up to now. the seminal work of bowie, liithy, and eisenberg (bowie et ai., 1991) on "the inverse protein folding problem" laid the foundation ofprotein structure prediction by protein threading. by using simple measures for fitness ofdifferent amino acid types to local structural environments defined in terms of solvent accessibility and protein secondary structure, the authors derived a simple and yet profoundly novel approach to assessing if a protein sequence fits well with a given protein structural fold. their follow-up work (elofsson et ai., 1996; fischer and eisenberg, 1996; fischer et ai., 1996a,b) and the work by jones, taylor, and thornton (jones et ai., 1992) on protein fold recognition led to the development of a new brand ofpowerful tools for protein structure prediction, which we now term "protein threading." these computational tools have played a key role in extending the utility of all the experimentally solved structures by x-ray crystallography and nuclear magnetic resonance (nmr), providing structural models and functional predictions for many ofthe proteins encoded in the hundreds of genomes that have been sequenced up to now. what has made protein threading particularly attractive as a protein structure prediction tool is the observation that the number ofunique structural folds in nature is a few orders ofmagnitude smaller than the number ofproteins in nature (finkelstein andptitsyn, 1987; bairochetai., 2005) . although this is still not a fully resolved issue, both theoretical and statistical studies (murzin et ai., 1995; brenner et ai., 1996; li et ai., 1996 li et ai., , 1998 li et ai., ,2002 wang, 1996; orengo et ai., 1997; holm and sander, 1996a; zhang and delisi, 1998) suggest that the number ofunique structural folds in nature ranges from a few hundred to a few thousand. clearly this is a significantly smaller number than the number ofproteins in nature -as we understand now, the number of different living organisms on earth could range from millions to hundreds ofmillions (may, 1988) . since each organism often has at least thousands of different proteins encoded in its genome, the total number of different proteins in nature is at least in the tens ofbillions or possibly significantly higher, even without considering protein variants such as alternatively spliced proteins. this disparity suggests an effective paradigm for possibly solving all protein structures through combining experimental and computational approaches, that is, to solve structures of proteins with unique structural folds using the expensive and time-consuming experimental techniques and computationally model the rest ofthe proteins using the experimental structures as templates. this is the key strategy currently being employed by the worldwide structural genomics efforts (gaasterland, 1998; skolnick et ai., 2000; baker and sali, 2001) . the basic idea of protein threading is to place (align or thread) the amino acids of a query protein sequence, following their sequential order and allowing gaps, into structural positions of a template structure in an optimal way measured by fitness scores outlined above. this procedure will be repeated against a collection of previously solved protein structures for a given query protein. these sequencestructure alignments, i.e., the query sequence against different template structures, will be assessed using statistical or energetic measures for the overall likelihood of the query protein adopting each ofthe structural folds. the "best" sequence-structure alignment provides a prediction of the backbone atoms of the query protein, based on their placements in the template structure. currently, protein threading is being widely used in molecular biology and biochemistry labs, often for initial studies of target proteins, as it may quickly provide structural and functional information, which could be used to guide further experimental design and investigation. as a prediction technique, protein threading has a number of highly challenging computational and modeling problems. these include (a) how to effectively and accurately measure the fitness of a sequence placed in a template structure, (b) how to accurately and efficiently find the best alignment between a query sequence and a template structure based on a given set of fitness measures, (c) how to assess which sequence-structure alignment among the ones against different template structures represents a correct fold recognition and an accurate (backbone) structure prediction, and (d) how to identify which parts of a predicted structure are accurate and which parts are not. as researchers find more effective solutions to these and other challenging problems, we expect that protein threading will play an increasingly significant role in structural and functional studies of proteins. as of now, over a million protein sequences have been determined (bairoch et ai., 2005) , among which~30,000 have had their tertiary structures experimentally solved (dutta and berman, 2005) . given that there could be at least tens of billions of different proteins in nature as discussed above, one interesting question, particularly relevant to the idea of protein threading, is how many unique protein structures or structural folds these proteins might have adopted. to answer this question, we need to first look at the basic structural units of proteins, called protein domains (wetlaufer, 1973; richardson, 1981) . protein domain is extensively discussed in chapter 4 of this book. here we describe briefly the concept of a domain from the perspective of threading. a structural domain is a distinct and compact structural unit that could fold independently of other domains. while many proteins are single-domain proteins, there are proteins with two, three, or even more structural domains (ekman et ai., 2005) . our study shows that in the fssp (fold classification based on structure-structure alignment of proteins) nonredundant set (holm and sander, 1996b) of the pdb, 67% of the proteins have single domains, 21% have two domains, 7% have three domains, and the remaining 5% have four or more domains (xu et ai., 2000a) . this distribution may not necessarily reflect the actual domain distribution for all the proteins in nature as the set of known protein structures in pdb might have overrepresented small proteins due to the relative ease in solving these structures. for eukaryotic organisms, it was estimated that at least two-thirds oftheir proteins are multidomain proteins (gerstein and hegyi, 1998; gerstein, 1997 gerstein, , 1998 doolittle, 1995; apic et ai., 2001a,b) . this number is somewhat smaller for bacterial and archaeal organisms but still represents a significant percentage of all their proteins. previously domain partition of multidomain proteins was typically done manually. to keep up with the rate at which protein structures are being solved, there is a clear need for more automated domain-partitioning methods to process the newly solved structures. currently computer programs are being used for partitioning a protein structure into individual domains. popular programs for this purpose include dali (dietmann and holm, 2001) , domainparser (xu et ai., 2000a) , and pdp (alexandrov and shindyalov, 2003) . a protein domain could be part of different protein structures, through combining with other domains. figure 12.1 shows an example of a domain in different proteins. both proteins, rna 3'-terminal phosphate cyclase and glutathione s-transferase, have the thioredoxin fold domain, which has two layers, one with two a-helices and one with four antiparallel r3-strands, although some details differ in the two proteins. the parts other than the thioredoxin fold domain in the two proteins have no structural relationship. since domains are the basic structural units of proteins, current studies on the number of unique structural folds in nature have been carried out on protein domains rather than whole proteins (hereafter, the term "proteins" will refer to single-domain proteins for simplicity of discussion). to estimate the number of unique folds of proteins, one popular approach is through examining all protein families and the relationships between protein families and unique structural folds. using the definition of scop classification scheme (murzin et ai., 1995; brenner et al., 1996) , a proteinfamily represents a group of orthologous proteins (makarova et ai. 1999; gerlt and babbitt, 2000; tatusov et ai., 2000; gelfand et al., 2000) . the number ofprotein families in nature could possibly be estimated through finding orthologous gene groups covering all the genes in the genomes that have been sequenced. one such estimate suggests that there are 23,100 such protein families (orengo et al., 1994) . this number has been used in later studies on estimating the number of unique structural folds in nature. other studies estimate this number to be in the tens of thousands (koonin et al., 2002) . whether it is 23,100 or some other number ranging from 10,000 to 50,000, the idea is that all proteins in nature fall into one of these families. the estimation on the number of unique structural folds is obtained through estimating the number of families that each structural fold covers and studying the coverage distribution by all the known structural folds. one of the estimates by zhang and delisi (1998) suggests that the number n of unique structural folds is probably around 700. this estimation is based on the observation that the number of structural folds covering x number of protein families follows a power-law distribution, withxbeing a variable. in essence it says that a few structural folds each cover many families (e.g., tim barrel fold covers 31 protein families) while many structural folds each cover only a small number of families ; more generally, the number of structural folds decreases as their coverage of protein families increases . specifically, zhang and delisi proposed a formula which matches well with the known protein families and structural folds. let m and n represent the number of families and the number of unique folds in nature, respectively. the probability that a fold covers exactly x families is given by let m, and n, be the numbers of protein families and unique structural folds currently having solved structures, respectively. through a simple algebraic transformation, zhang and delisi showed that which is used to estimate the number of unique structural folds. using the known numbers of m, = 736 and n, = 361 at the time of the estimation, they estimated that n is roughly about 700, which many researchers will argue to be too small (see following for more discussion). similar estimations have been made by other researchers, estimating the size of n ranging from a few hundred to a few thousand (orengo et ai., 1994; wang, 1996) , based on somewhat different assumptions. coulson and moult (2002) recently developed a new model for estimating n, based on the work of zhang and delisi. using the more recent data on the numbers of genes, gene families, and structural folds, they argued that there are two "special" cases that have not been treated well by previous estimation models. based on their argument, they consider that there are three classes of structural folds, which are termed unifolds, mesofolds, and superfolds. unifolds represent structural folds that each covers only one family of proteins, superfolds represent structural folds, each ofwhich covers many structural folds, and mesofolds represent structural folds in between. for example, tim barrel covers 31 families, while many unifolds exist in sco~based on their observation, they argued that previous models such as zhang and delisi did not fit well with the data of unifolds and superfolds. so a new piecewise model was then developed which treats unifolds, mesofolds, and superfolds, separately. using this new model, coulson and moult (2002) estimated that less than 20% of the protein families belong to unifolds, while 20% of the families belong to a few dozen superfolds and the rest of the protein families belong to mesofolds. considering that the estimated number of protein families ranges from 10,000 to 50,000 (or 23,100 as one of the popular estimates suggests), we can infer that the number of unifolds ranges from 2,000 to 10,000. the number of mesofolds could be estimated using the zhang and delisi model, based on the sizes of m, and n s , after excluding the unifolds and superfolds. hence, coulson and moult concluded that the most probable size for the number ofmesofolds is about 400. the number of superfolds is believed to be very small, possibly in the range of low dozens. overall this model suggests that over 80% ofthe protein families fold into a little over 400 structuralfolds, the majorityofwhichare alreadyknown, whilethe restoftheprotein families each belongs to a unique unifold. the implication ofthis estimation is that about 80% ofthe protein families are amenable for structural modeling using protein threading techniques, assuming that at least one protein in each of the meso-and superfolds has its structure solved. if experimental facilities for structure solution will strategically select their solution targets to maximally cover all the meso-and superfolds, we could expect that at least 80% of the protein families will be structurally modelable in the near future. this is exactly the strategy that the national institute of health (nih) is using in its protein structure initiative (http: //www.nigms.nih.gov/psi/) . for the remaining 20% of protein families, it might take some time to have at least one solved structure in each of the unifolds. hence, the threading technique will be less applicable for this class of proteins, at least in the near future. there are a number of popular schemes and associated databases for classification of proteins into structural folds, including scop (murzin et ai., 1995) , cath (orengo et ai., 1997) , and fssp (holm and sander, 1996b) . these classification schemes classify all solved protein structures into different structural folds and subclasses of structural folds. the classification of protein structures is essentially achieved through grouping protein structures into clusters ofsimilar structures, which can be done computationally through structure-structure alignments (holm and sander, 1996a) . scop (murzin et ai., 1995; brenner et ai., 1996; andreeva et ai., 2004 ) groups all protein structures essentially into a three-level classification tree. at the top level, scop (scop1.65) currently consists of about 800 structural folds, each ofwhich is further divided into superfamilies and then into families. while a family represents a group of orthologous proteins, a superfamily represents a group of homologous proteins, possibly made ofmultiple families. currently scop consists ofabout 1300 superfamilies and about 2400 families. among the 800 structural folds, 489 have only one family each, which might represent unifolds; and 9 cover a large number offamilies each, which are considered as superfolds by coulson and moult. one thing worth noting is that among the 800 scop folds, only 36 represent membrane proteins. this is a reflection of the fact that only 2% of all the solved protein structures are membrane proteins (http: //blanco.biomol.uci.edu/membraneyroteins...xtal.html) . this suggests that threading is generally not applicable to structure prediction of membrane proteins, at the present time. scop's hierarchical classification ofstructural folds provides a convenient tool for applications ofthreading methods, as query proteins falling into a scop protein family are generally expected to have accurate structure predictions, while proteins with structural homologues in a scop superfamily will have a good chance to have the correct structural folds identified and some portions oftheir backbone structures predicted accurately. in general, it still represents a challenge to correctly identify the structural fold by a threading method if a query protein only has a structural analogue (i.e., similar structure but not homologous) in sco~ the realization that protein structures are clustered into structural folds in the structure space and the number ofsuch clusters is possibly quite small has led to a new way ofpredicting protein structures in a more efficient and effective manner. the general belief is that different proteins fold into similar 3d shapes because at some level, they share similar interaction patterns among their residues and between the residues and the environments. it has been shown that these interaction patterns could possibly be captured using simple statistics-based energy models as exemplified by the earlier work ofeisenberg and colleagues (bowie et al., 1991; fischer et al., 1996a,b; fischer and eisenberg, 1996) , the work of sippl and colleagues (sippl, 1990) , and others (jones et al., 1992; rost et al., 1997) . these simple statistics-based energy functions have been used, for many cases, to distinguish the correct structural folds from the incorrect ones and to distinguish the correct placements of the residues in a query protein into the structural positions of a correct structural template from the incorrect ones. placing the (backbone atoms of) residues of a query protein into the correct structural positions in a correct structural fold gives a prediction of the backbone structure of the query protein. to accomplish this, one would need two capabilities: (a) an energy function whose global minimum will correspond to the correct placement of residues into the correct structural template, and (b) a computational algorithm that can find the global minimum of the given energy function. we explain the basic idea of developing such energy functions in this section and leave the algorithmic issues to the next one. in their earlier work (bowie et al., 1991; fischer et al., 1996a,b, fischer and eisenberg, 1996) , eisenberg and colleagues demonstrated that simple residue-based, instead of atom-based, energy functions could provide substantial discriminating power in separating good from poor placements of individual residue types into different structural environments, justifying the usage ofresidue-based energy functions. in their work, structural environments are simply defined in terms oftwo parameters, solvent accessibility sol and secondary structure ss. specifically the quantity sol of solvent accessibility is discretized into a number of intervals, say 30--40% exposed to the solvent. a secondary structure could be a helix, a beta-strand, or a loop, or it could be defined in terms of more refined categories of secondary structures, say including different types oftums. then a structural environment for each residue in a template structure could be defined using (sol, s), say (0-10% exposed, alpha-helix). statistics could be collected from a collection of solved protein structures about how frequent a particular type of amino acid appears in a particular structural environment as we just defined. this can be done by going through all protein structures under consideration to count the number of occurrences of each amino acid type in each encountered structural environment. if we consider, say, three levels of solvent accessibility, {exposed, intermediately exposed, buried}, and three types of secondary structures, we will have nine types of structural environment. under this assumption, the result of counting the numbers of occurrences above will be a 20 by 9 table, with each of the 20 rows representing an amino acid type and each of the 9 columnsrepresenting a structuralenvironment. based on the collected statistics, we can build a simplepreference model to measure how preferred a particular amino acid type is to a particular structural environment. this can be done using the following measure: where oi,} represents the observed frequency of amino acid type i in structural environment j and ei,l represents the expected frequency of amino acid type i in structuralenvironment j. in the work of eisenberg and colleagues, ei,l is estimated using the frequency of amino acid type i in all proteinsunder consideration. hence, if an aminoacidtype i has a higherfrequency in a particularstructuralenvironment j than its frequency overall, it willbe assigneda negative score-in(oi,}i ei,l); otherwise it will get a positivescore or zero (when oi,l = ei,j). the higher oi,l is comparedto ei.j» the more negative the corresponding energy is. a popular name for this type of energyfunction is singletonenergy.by performingstatistical analysis on a database, one can get the scoring functionusing the above formulation for the 20 amino acid types appearing in the nine structuralenvironments. when building such statistics-based energy functions, one needs to be careful in selectingthe data set for statistics collection. forexample, someproteinsin scop (or in pdb) havemore homologous structuresthan the others,which could possibly leadto biased statistics. toremove this type of statisticbias in our data set, one needs to remove homologous structuresin the data set for statisticscollection. there are a numberof databases forthis purpose,suchas the nonredundant sequencerepresentativesin fssp, pdb-select (hobohmet al., 1992) , and pisces (wang and dunbrack, 2003) ,whereno two proteinshavehigher than a certainlevelof sequencesimilarity. another statistics-based energy functionwidelyused in threadingprogramsis often calledpairwise interaction energy. it measures the preference of having two particulartypes of amino acids that are spatiallyclose. one particular form of such an energy function was developed by sippl (1990) . the basic idea of this energy function is to compare the observed frequency of a pair of amino acids within a certain distancein solvedprotein structureswith the expectedfrequency of this pair of aminoacid types in a protein structure. the basic idea of such an energyfunction comes from statistical mechanics where the probability pi} of having a pairwise interaction betweenresidues i and j has the boltzmanncorrelationto its energygij (gibbs free energy), definedas wherek is the boltzmannconstant, t is the temperature, andz is a partitionfunction. when using a residue-averaged state as a reference state g(p), a knowledge-based potential can be derivedusing . specifically, if n o(i, j) and ne(i, j) represent, the observedand expectednumbers of amino acid types i andj within a certain distance, respectively, we can use the following to measurethe preferenceof havingthese two types of amino acids close to each other: while no(i, j) canbe collectedby goingthroughtheproteinstructuresin the sample set, an accurate estimation of ne(i, j) represents a challenge. there have been a number of proposed models for estimatingthis quantity. among these models, the simplestone is the "independentreference state" model (xu et al., 1998) ,in which ne(i, j) is estimatedas follows: table 12 .2 shows a scoring function for the preference between 20 types of amino acids using the above formulation (xu et ai., 1998) . there are more sophisticated models for defining the reference state, one of which is the uniform distribution model (sippl, 1990; dewitte and shakhnovich, 1996; lu and skolnick, 2001; samudrala and moult, 1998) , as discussed later. these more complex models take more factors into consideration in building the reference state, hence making the energy models more likely to be accurate. in addition to using physics-or statistics-based energy functions, researchers have incorporated evolutionary information into energy model building. one of the earlier major improvements in energy function modeling is the incorporation of sequence profile information (panchenko et ai., 2000; zhou and zhou, 2005) derived from homologous proteins into the energy models outlined above. it was noticed that when using a sequence profile ofa protein family (or superfamily) rather than a single (query) protein sequence, threading accuracy could be significantly improved (panchenko et ai., 2000; zhou and zhou, 2005) . the very basic idea ofthis generalized approach is that rather than asking the question "will protein sequence a fit well with structural fold b?" we ask the more general question "will the whole family of protein a fit well with structural fold b?" clearly if done properly, this approach could iron out some ofthe spurious predictions, caused by the appearance ofspecific individual sequences. now a threading problem becomes a fitting problem between a sequence profile and a structural fold. a sequence profile is defined in terms ofa multiple sequence alignment ofthe members ofthe query protein's family (or superfamily), with each element being the frequency distribution ofthe 20 amino acid types in this aligned position rather than a specific amino acid. to generalize the aforementioned energy functions to take into account the profile information, we can simply use the relative frequency of each amino acid in the position-dependent distribution as a weight factor when calculating the energy values for each amino acid or amino acid pairs, and then sum over all possible amino acids or amino acid pairs. specifically,let pibe the relative frequency ofamino acid type i in a particular aligned position with l pi = 1.0. then the eisenberg type of energy can be calculated as similarly, pairwise interaction energy could be generalized as follows: other types of energy functions have also been used in existing threading programs, including fitness scores between specific amino acids and the secondary structures in the template structure and threading alignment gap penalties. typically these energy scores are combined using a simple weighted sum while often the scaling factors are empirically determined, based on some training data. it has been observed that distance-dependent pairwise interaction energy could provide more accurate threading results than distance-independent models as outlined above. a distance-dependent energy could be estimated as follows: where r is the distance between residues i and j (possibly measured between their c-beta atoms), no(i, j, r) is the observed number of pairs of residues (i, j) within a distance bin from r -~r12 to r +~r/2 in a database of folded structures for some bin width sr, and ne(i, j, r) is the expected number of pairs (i, j) within the same distance bin. the challenging issue in accurately estimating the interacting energy u(i,j,r) is how to estimate ne (i, j, r) . under the assumption that we are dealing with an ideal infinite liquid-state system within a volume v and residues are distributed uniformly (sippl, 1990; dewitte and shakhnovich, 1996; lu and skolnick, 2001; samudrala and moult, 1998), ne(i, j, r) can be estimated using where n, and n, are the numbers of amino acid types iand j in the protein database, respectively. researchers have realized that this model needs to be corrected when dealing with finite systems like a protein structure, to make the model more accurate when used in threading programs. twoparticular corrections are made in the dfire (distance-scaled ideal gas reference state) energy model (zhou and zhou, 2002) , a popular energy function for threadirtg. in the first correction, dfire used r" instead of r 2 , considering that the number of interaction pairs in a finite system could not actually reach the level of r 2 as in an infinite system, where a < 2 is determined throughminimizing the distribution fluctuation of interaction distanceon a set of trainingdata. in the secondcorrection, dfireassumes that only short-range interactions need to be considered. that is, interaction energy becomes zero when the distance betweenthe interacting pairs is beyonda cutoffdistancercut. afterthese corrections, the interactionenergy could be estimatedusing the following formula: where constant11 is related to the systemtemperature and can be determinedempirically. these simple energy models have played key roles in making threading programs as popular and as useful as we have seen today. while they have been used to help to solve many structurepredictionproblems, the limitations of these simple models have also become quite clear as we have seen from the recent casp prediction results-the improvement in predictionaccuracyhas been only incremental in the past few casps (kinch et ai., 2003) . one of the key reasons for the incremental improvement comes from the crudeness of the threading energy functions. currently, the algorithmic techniques for protein threadinghaveadvanced to a stage that should be able to handle more sophisticated energy models in the threading framework, which could lead to more accuratepredicted structures. we can expect that more detailedand more physics-based energyfunctions will emergein the near future as the field is clearly in need of more accurate energy models for protein threading. the general form of threadingenergy function could be written as follows: wherees measuresthe overall fitness of puttingindividual residuetypesinto specific structuralenvironments, e p measuresthetotalinteraction energyamongpairswithin the cutoffdistance, and egap represents the total penalty for the gaps in a sequencestructurealignment. the scalingfactorsfi andj3 are typicallydeterminedempirically through optimizingthe performance of a threadingprogram on a representative set of proteinpairs. withthe optimizedfi andj3values, the goal of threadingis to findan alignment(or placement) betweena queryprotein sequence and a templatestructure that optimizesthe energy function. in a sense,proteinthreadingis like sequence-sequence alignmentas it finds an alignmentbetweena sequence of aminoacids and a sequence of structuralpositions in a 3d structure. what makes threading more difficult than a sequence alignment problem is the pairwise interaction energy term ep. for a sequence alignment problem, a simple dynamic programming can guarantee to find the global optimal alignment between two sequences as the problem formulation follows the principle of optimality (brassard and bratley, 1996) . this type of simple dynamic programming algorithm does not work for a threading problem as the global optimal threading alignment could not be easily reduced to a small number of optimal threading alignments for the partial problems as in a sequence alignment problem. intuitively, a simple dynamic programming approach, like the one used for sequence alignment that goes from the starts of the sequences to their ends and extends partial optimal alignments to include more elements at each step, will not work for protein threading as at each current point, we do not know what residues will be available to be assigned to future structural positions, which might have interactions with the previously aligned positions. such interactions will have a global impact on a sequence-structure alignment. it is such a global nature of the problem that makes the threading problem more challenging from the algorithmic point of view. there have been a number of studies attempting to understand the "intrinsic" difficulty, or computational complexity, ofthe threading problem. under the assumption that all pairwise interactions need to be considered, the threading problem was proved to be an np-hard problem by a number of authors (lathrop, 1994; calland, 2003) . while these mathematical proofs provide some evidence that the problem is computationally difficult, they might not be particularly relevant to the true difficulty of a threading problem as previous studies have shown that pairwise interactions beyond certain cutoffdistances (e.g., 10-12 a between c-beta atoms) do not contribute to fold recognition and threading alignment (lund et al., 1997; melo and feytmans, 1997; xu et al., 1998; zhang and skolnick, 2004) , and hence need not be considered. as of now, it remains an open problem regarding whether the threading problem, using a distance cutoff for pairwise interactions, is polynomial-time solvable. the challenging issue in studying the "intrinsic" computational complexity of a threading problem with a cutoff for pairwise interactions is that we do not have a good and realistic characterization for the overall interaction patterns for such a threading problem. because of the algorithmic challenges, earlier threading programs have employed various heuristic strategies for solving the optimal sequence-structure alignment problem. one particular strategy is called "frozen approximation" (westhead et al., 1995) . the basic idea is that it uses a dynamic programming approach to find a sequence-structure alignment and uses an approximation scheme to calculate the interaction energy. when the algorithm assigns an amino acid to a specific structural position from the beginning to the end of the query sequence during the dynamic programming procedure, it calculates the relevant interaction energy using the amino acids in the template structure rather than assigned amino acids from the query protein, for all the unassigned structural positions up to the current point of the dynamic programming procedure, within a certain cutoffdistance. intuitively the algorithm should work to some degree in capturing some ofthe interaction "patterns" encoded in the query protein sequence as some of the position-equivalent residues between the native structure and the native-like template structure should have similar physicochemical properties, suggesting the validity ofthe frozen approximation scheme. practical applications have also confirmed that the frozen approximation, while not guaranteeing global optimal threading alignment, does have an advantage over threading programs that do not consider pairwise interactions (westhead et al., 1995; skolnick and kihara, 2001; zhang et al., 1997) . a number ofrigorous threading algorithms have been developed that guarantee to find the global optimal threading alignments, measured in terms of energy functions that consider pairwise interactions. these include a divide-and-conquer algorithm employed in the prospect threading program and an integer programming algorithm used in the raptor program (xu et ai., 2003a,b) . it was convincingly demonstrated, through applications ofthese programs at the casp contests (xu et ai., 2001; xu and li, 2003) , that threading programs with guaranteed global optimality do have an advantage over programs without this property. one particular advantage of such programs is that they can be used to rigorously benchmark a proposed energy function. when using programs without such a guarantee to assess an energy function, it will be difficult to decide whether it is the energy function or the lack of rigor in the threading algorithm that has resulted in a subpar performance by a particular energy function. the following provides some detailed information about three types of threading algorithms. the basic idea of the divide-and-conquer algorithm in prospect can be outlined as follows. the algorithm first divides the query protein sequence into two subsequences and also divides the template structure into two substructures by cutting at one of its loop regions. then it tries to find the globally optimal threading alignments between the first subsequence/substructure pair and between the second subsequence/substructure pair, respectively. when calculating pairwise interaction energy for each "half" of the sequence-structure alignment problem, we might need information about which amino acids are assigned to which structural positions in the other "half" of the problem. to facilitate this calculation, the algorithm uses a simple data structure for each structural position l in the current substructure, that keeps a list of structural positions in the other substructure that are close enough to l to have interactions with the amino acid to be assigned to it. figure 12 .2 depicts schematically the situation where each of the two substructures has a number ofstructural positions, which are close enough to structural positions in the other substructure so that their alignments with amino acids need to be considered when calculating the interaction energy in the other substructure. these structural positions can be considered as extended parts of the other substructure (shown as extended arms for each substructure in fig. 12.2) . the difference between these extended parts and the original positions ofa substructure is that when doing alignment between the substructure and the corresponding subsequence, we do not have any knowledge about which amino acids are assigned to these positions in the other substructure. hence, the optimal threading alignment for each substructure/subsequence pair depends on the optimal threading alignment for other substructure/subsequence pair. this codependence relationship makes the problem challenging. in prospect, this problem is overcome using the following strategy : consider all possible combinations of amino acids possibly assigned to these extended positions (in the other "half" of the problem); and then solve an optimal threading alignment problem for each substructure/subsequence pair under each possible combination of amino acid assignments to these extended structural positions. assume that we can solve the optimal threading problem for each pair of (extended) substructure/subsequence and for each combination of such an assignment. then it can be checked that the global optimal threading alignment for the whole structure and sequence must be the union of two optimal threading alignments for the two extended subproblems, under one specific combination of the amino acid assignment to the extended part of each subproblem. this realization lays the foundation for the divide-and-conquer algorithm of prospect as it allows reducing a whole threading problem to two smaller threading problems. if we can solve the smaller threading problems, the whole threading problem can be solved by simply going through all combinations of the amino acid assignments to the extended parts for each subproblem to find the one that gives the overall best combined score. during the "conquer" step, it needs to make sure that the two optimal solutions, to be combined, to the subproblems is consistent. to solve a smaller threading problem, we can apply the same divide-and-conquer strategy to reduce it to even smaller problems. this procedure can continue until the size ofthe problem is small enough that it can be solved using a brute force exhaustive search strategy. the trick is how to make this algorithm run efficiently.note that ifnot done carefully, the number of possible combinations of assignments needed to be considered could be very large. in prospect, (sub)structures are divided in such a way which minimizes such a number ofcombinations, through cutting at the "weakest" link with the least interactions between two substructures (xu et ai., 1999) . the overall computational complexity of the divide-and-conquer algorithm is dominantly determined by the thickest link among all weakest links throughout the bipartitioning of a protein structure into a series of small structures during the whole divide-and-conquer scheme. intriguingly, we found that this thickest link is generally a small number for the vast majority ofthe solved protein structures (xu et ai., 1999) , making the actual computing time of prospect threading practically acceptable. this observation has also raised an interesting question: "is there something special about the topology of protein structures, which could be further and more rigorously exploited for efficient threading algorithm development?" 12.4.2 raptor raptor uses a more general framework to rigorously solve the threading problem (xu et ai., 2003a,b) than prospect. it formulates a threading problem as a linear integer programming (lip) problem. it uses an integer variable ("0" or "1") to represent if a particular residue in the query protein sequence is assigned to a particular structural position. then the set ofall feasible solutions to a threading problem could be defined in terms of a set of equalities or inequalities (called constraints), each of which is defined in terms of the above and other integer variables. the global optimal threading problem is then defined to find a feasible threading alignment that optimizes a given energy function. generally a linear integer programming problem requires an exponential computing time to find an optimal solution, and hence intractable for large-size problems. branch-and-bound represents a popular technique for solving linear integer programming problems. typically, an integer programming problem is first relaxed to a linear programming problem, i.e., variables could take real values as possible solutions. there are efficient algorithms for solving linear programming problems as they are polynomial-time solvable (papadimitriou and christos, 1998) . if by chance the solution to the relaxed linear programming problem is all integral, a solution to the original linear integer programming is found. otherwise the linear solution will be used to constrain the search space, through fixing one variable to "0" or"1" and then the algorithm iterates this process until all solutions have integral values. an interesting observation made is that for the vast majority of threading problems, this relaxation procedure stops after a few iterations, solving the threading problem efficiently and also indicating that threading problems seem to have a special structure in terms of the integer programming formulation. such special characteristics could possibly be utilized for developing more efficient threading algorithms, using more specialized algorithmic techniques. one particularly interesting technique which is being actively investigated by a number of researchers is based on the idea of tree decomposition of an interaction graph representing possible alignments between a query sequence and a template structure (song et al., 2005; xu et al., 2005) . in a sense this type of technique is a generalization of the divide-and-conquer outlined above. tree decomposition technique has been widely used for various graph-related optimization problems, for example finding the maximum independent set and dominating set (arnborg and proskurowski, 1989) . we now provide a detailed description of one such algorithm for solving the protein threading problem. using a tree decomposition algorithm, both the template structure and the query sequence are represented as graphs; vertices denote core secondary structures (or simply cores) and edges represent interactions between cores (two cores are considered to be in interaction if their shortest distance is within a predefined cutoff distance). a sequence-structure alignment problem essentially corresponds to finding an isomorphic mapping from the structure graph to a subgraph of the sequence graph. the efficiency of the alignment hinges on the tree width of the structure graph. intuitively, the tree width of a graph measures how much the graph is "treelike." a graph can be represented as a "tree having thick trunks," where the "trunk thickness" is quantified by the tree width of the graph. this technique of "treelike" representation for graphs is called tree decomposition. in a tree decomposition of a graph, vertices of the graph are grouped into possibly overlapping subsets, each of which is associated with a node in the tree. the maximum size of such a subset corresponds to the tree width ofthe tree decomposition. given a tree decomposition of a structure graph with tree width t, a dynamic programming algorithm can be employed to find the optimal sequence-structure alignment in time oik: n 2 ) , for some small integer parameter k and n being the number of amino acids in the template structure. the alignment algorithm is very efficient since the tree width for such structure graphs is small in general (by the nature of protein structures). for example, among 3890 protein tertiary structure templates compiled using pisces (wang and dunbrack, 2003) only 0.8% of them have tree width t > 10 and 92 % have t < 6, when using a 7.5-a c~-c~distance cutoff for defining pairwise interactions [see fig. 12 .5(a)]. we now provide the details of a tree-decompositionbased threading algorithm. a sequence-structure alignment can be formulated as a generalized subgraph isomorphism optimization problem, for which both the template structure and the query sequence are represented as mixed graphs that contain both directed and undirected edges. we use v(g), e(g), and a(g) to denote the vertex set, the undirected edge set, and the directed edge (arc) set of a mixed graph g, respectively. the graph h for the template structure is constructed as follows: each vertex in v(h) represents a core, each undirected edge in e(h) represents the interaction between two cores, and each directed edge (arc) in a(h) represents the loop between two consecutive cores (from the n-terminal to the c-terminal). for technical convenience, both n-and c-terminals are presented as vertices. figure 12 .3 gives a protein tertiary structure and its corresponding structure graph representation. a query sequence is preprocessed so that for each core in the template, all substrings (called candidates) of the query sequence that align well with the core are identified (xu et ai., 2000) . by representing each candidate as a vertex, a query sequence can also be represented as a mixed graph. that is, each edge in e(g) connects a pair of candidates that may possibly interact but do not overlap in the sequence, and each arc in a( g) connects two candidates (from the n-terminal to the c-terminal) that do not overlap. as in a structure graph, both n-and c-terminals are represented as vertices in the sequence graph . figure 12 .4 illustrates the sequence graph with a simple example. the relationship between a core v in the template and its candidates in the query sequence can be constrained using a mapping function m such that m(v) contains all possible candidates of v. the less restricted m is, the more accurate the alignment is expected to be and the more time it may take to compute. xu et a1. (1998) used a similar approach in their divide-and-conquer threading algorithm, which can find all suitable candidates for each core. the maximum size k = 1m(v) lover all cores v is called the map width ofm, an important parameter for the alignment algorithm. now a sequence-structure alignment problem can be formulated as a problem offinding an isomorphism mapping f between the structure graph and a subgraph of the sequence graph g such that the following sum ofthe alignment energy functions l ecore (u, f(u) achieves the minimum, where e eore is the alignment energy score (singleton type of energy) between a core u in the template and its candidate feu) in the sequence, e pair represents interaction energy (pairwise interaction energy) between residues (f(u),f(v)) assigned to cores (u, v) , and el oop is the alignment score between the loop < u, v > in the template and the corresponding 1, especially when e > 1.2, there was a significant energy gap between the optimal alignment and the decoy alignments. hence, this quality could be used as a measure for assessing the significance of a threading alignment. a good method for assessing the statistical significance of a threading score should not only allow comparing threading results on the same footing but also provide a way to indicate if a particular fold is possibly the correct fold for a query sequence, without using other reference information. a typical threading program consists of four key components from the implementation perspective: (a) a database oftemplate protein structures, (b) an energy function, often residue-based, (c) a threading algorithm that can find the optimal threading alignment between a query sequence and a template structure, and (d) a method for calculating "significance" scores of threading alignments. from the prediction accuracy point of view, the larger a template structure database is, the more accurate we can expect the threading prediction will be. however, it might not always be realistic to use the whole pdb database as the template database due to the amount of time required to thread a query sequence against each pdb structure. often a template structure database consists of a representative subset of all the structures in pdb, say pdb-select, which consists of pdb structures with the "redundant" structures removed. here "redundant" refers to structures that have high sequence similarities to other structures in the database. to make prediction more accurate, some threading programs employ a two-stage strategy: (a) thread a query sequence against a representative structure database and to identify a few possible native-like structures, and (b) thread the query sequence against all family/superfamily members ofthe identified structures in (a). certain preprocessing of the template database might be needed for some threading programs. for example, as we discussed in section 12.4, protein structure needs to be represented as structure graphs as required by the threading algorithm. the majority ofthe current threading programs employ the types ofenergy functions outlined in section 12.3 or their variations. these energy functions are statistics, rather than physics, based. they are used to distinguish correct structural folds from the incorrect ones and to distinguish accurate alignments against the correct structural folds from inaccurate alignments. threading programs have been using these simple energy forms, mainly because of the consideration of computational efficiency and also partly due to the constraints of limited available structural data for more sophisticated energy forms when such statistics-based energy functions were first developed. as those simple energies began to reach their limits, we began to see more physics-based energy functions developed, as we discussed in section 12.3. we expect that as the threading algorithms become more efficient, we will see more physics-based energy functions. we expect that one particular type ofextension to the existing energy functions is to consider multibody interactions, which have been mostly ignored by existing threading programs. recent studies have shown that multibody interactions could help to improve the performance ofthreading programs (munson and singh, 1997; li and liang, 2005) , and hence should be considered. existing threading programs employ various algorithmic techniques for solving the sequence-structure alignment problem, including dynamic programming with enhanced heuristics (westhead et ai., 1995; skolnick and kihara, 2001; zhang et ai., 1997) , divide-and-conquer algorithm , and integer programming (xu and li, 2003; xu et ai., 2003a,b) . in this chapter, we presented a new class of threading algorithm based on a tree decomposition of sequence and structure graphs. while integer programming might represent the most general framework for handling sequence-structure alignments, particularly so for threading problems considering multibody interactions, tree-decomposition-based algorithm could prove to be more popular down the road because of its conceptual simplicity and computational efficiency. we expect that a class of more general threading algorithms will begin to emerge to deal with more complex threading problems as the existing threading algorithms become faster and faster. this general class of threading algorithms should be able to handle simultaneous backbone threading and side-chain packing problems, leading to significantly more accurate capabilities in fold recognition and protein structure prediction. existing threading programs use various ideas and techniques to assess the "significance" of threading results. these methods include z-score calculation (sommer et ai., 2002) , normalized threading scores using techniques such as support vector machines or neural network (xu et ai., 2002; ding and dubchak, 2001) , and (panchenko et al., 2000; bryant and altschul, 1995) . while useful to some degree, none ofthese methods have reached the level of performance comparable to p-value calculations for blast sequence alignments (altschul and gish, 1996) . this is possibly due to a combination of the inadequacy of the existing threading energy functions for accurate threading prediction and the lack of general understanding about distinguishing characteristics between correct and incorrect native folds and between correct and incorrect placements of amino acids into structural positions. overall, compared to other areas of protein threading, this is a somewhat underdeveloped area. new ideas and techniques are clearly needed to fill the holes in this area. because of the importance of solving protein structures for functional studies and the power of threading techniques, many protein threading programs have been developed. using these programs, a large number of protein structures have been predicted prior to the solution of their experimental structures, providing highly useful information for guiding experimental design in investigation ofthese proteins. examples ofsuch predictions include an obese gene (madej et al., 1995 ), vitronectin (xuet al., 2001 , and a sars protein (von grotthuss et al., 2003) . table 12 .3 provides a list of popular threading programs and urls for accessing these prediction tools. we now summarize the highlights of some of these threading programs which use different energy functions and different computational techniques, each ofwhich has its strengths and limitations. prospect kim et al., 2003) : the prospect program employs a divide-and-conquer algorithm for rigorously solving the global optimal threading problem, which employs a somewhat standard threading energy function, including a singleton energy term and a pairwise interaction energy term plus a secondary structure fitness score and a gap penalty score. for a typical threading problem, it can find the best alignments against a template structure database of 2000+ within a couple of days on a single cpu while it can virtually get a linear speed-up using multiple cpus when the number ofcpus is smaller than the number of structures in the template structure database. it achieves its computational efficiency by taking advantage of the fact that protein structures generally have small topological complexities (xu et ai., 1998) and through using a filtering procedure to filter out "improbable" alignment positions for each core secondary structure in the template structure. while this heuristic filtering works well for the vast majority of the threading cases, it might filter out the correct alignment positions for some cases. prospect normalizes the threading scores, along with various parameters of the template structure and the query sequence, using a support vector machine. then a z-score is calculated based on the "normalized" threading scores. raptor (xu and li, 2003; xu et ai., 2003a,b) : the raptor program formulates a threading problem as a linear integer programming problem, and solves the problem using a branch-and-bound method plus a standard integer programming solver. raptor employs the same energy functions ofprospect and uses a similar approach for assessing the "statistical significance" of threading results to that of prospect. a unique feature of the program is that its threading algorithm is more rigorous than that of prospect as it does not use a heuristic filter to filter out "improbable" alignment positions. for a typical threading problem, it takes minutes to hours to thread the query sequence against 2000 structures in its structure database. since the program is data-parallelizable, its speed-up is virtually linear when running on multiprocessor computers. genthreader (jones, 1999b) and an improved version mgenthreader (mcguffin and jones, 2003) use psi-blast profile (altschul et ai., 1997) and predicted secondary structures by psipred (jones, 1999a) for threading. it employs a double dynamic programming strategy (jones et ai., 1992) in its threading program. the algorithm does not treat pairwise interactions rigorously but its performance has been among the top threading programs, indicating the effectiveness ofthis strategy. a web server for this program has been set up at http: //bioinfcs.ucl.ac.uklpsipred/. a user in general can expect the return of a threading prediction in minutes. the prediction program runs fast enough that it can be used for genome-scale protein structure predictions. prospector (skolnick and kihara, 2001 ) recognizes native-like structural folds using a hierarchical strategy for obtaining sequence profiles. it uses two types of sequence profiles, one type derived using close homologous sequences whose sequence identity lies between 35% and 90%, and another type constructed using more remote homologous sequences with a fasta e-value less than 10. both types of sequence profiles are incorporated into a typical threading energy as described in section 12.3 to screen a structural database. the program uses a dynamic programming algorithm to find the best threading alignment, and employs z-scores for assessing the significance of each threading alignment. fugue (shi et ai., 2001 ) uses a typical threading energy function as described in section 12.3, with some unique features: (1) its structural environment singleton term includes a term for hydrogen bonding status, and the singleton term was derived from structural alignments in the homstrad database (de bakker et ai., 2001; http://www-cryst.bioc.cam.ac.uk/homstrad/); (2) its gap penalties are structure-dependent based on solvent accessibility, its position relative to the secondary structure elements, and the conservation ofthe secondary structure elements; and (3) its alignment is based on multiple sequences against multiple structures to enrich the conservation/variation information. fugue uses dynamic programming as its threading algorithm and employs z-scores to assess the statistical significance of a threading result. since each of these threading programs has its own strengths and limitations, a popular strategy for predicting a protein structure is to use multiple prediction programs and combine their prediction results. further discussion on this topic is given in chapter 17. we now use prospect as an example to illustrate how to use a threading program for predicting structures of sars-co v proteins (wan et ai., 2005) , which playa role in the development ofthe sars disease. we used the prospect pipeline to survey all ofthe 11 open reading frames (orfs) in sars-coy strain urbani (genbank id: 30027617), one of the first sars-cov genomes. among the 11 orfs, the sand m proteins playa key role in the virus infection process. interestingly, both the m protein and the s2 domain in the s protein are predicted to adopt the fold ofig-like beta sandwich. the structural similarity suggests that the s2 domain and the m protein may be evolutionarily related through gene fusion and duplication, although their sequences do not have significant similarity after a long period of evolution. the threading results might explain how the m protein interacts with the s2 domain, for the virus assembly: since the s2 domain with the fold ofig-like beta sandwich can interact with the s1 domain, the m protein with the same fold could possibly interact with the s1 domain. this suggests that the si domain may act as an on/off switch between the s2 domain and the m protein. such a mechanism may suggest that the m protein could also be involved in the virus-host cell interaction. this hypothesis was supported by a recent study in the murine hepatitis coronavirus study, which showed that glycosylation ofthe m protein affected the interferogenic capacity of the virus (de haan et al., 2004) . threading programs have been used for genome-scale applications. a recent study (guo et ai., 2004) performed structure prediction for all ofthe orfs ofpyrococcusfuriosus, which is found in the marine sand surrounding sulfurous volcanoes and can grow at temperatures above 100 a e.the microbe utilizes peptides, proteins, and some carbohydrates as carbon sources. its entire genome is about 2 mb in length with 2195 annotated orfs. out of a total of 2195 orfs, 540 are predicted to be membrane proteins, and 753 proteins can be predicted with structures in high confidence, among which 190 orfs cannot be detected using psi-blast. recent prediction results in casps indicate that even when the correct structural folds are identified, the threading alignments could often be off. from the same statistics, we found that the overall alignment accuracy has not been improved over the past few casps. in addition, the alignment accuracy of the best casp models using templates with <30% sequence identity ranges from 60% to 90% (venclovas et al., 2003) . all these statistics suggest that there is significant room for improvement in threading alignments. the prediction accuracy of the current threading programs is mainly limited by the inaccuracy of statistics-based and residue-based threading energy functions. while improving the threading energy functions represents one direction to take for improving threading performance, other approaches may also help, which include (a) application ofpartial experimental data as constraints in the threading process and (b) refinement of threaded structures using molecular dynamics and energy minimization. we refer the reader to chapter 11 for related discussions. often partial structural data is available for specific proteins before the determination ofthe detailed structure ofa protein. these partial structural data might be in the form of (a) residue-residue distances such as disulfides between specific cysteines, (b) specific residues involved in particular active sites, binding sites, or other functionally important sites, (c) specific residues known to be on the surface ofa protein structure, or any other information providing geometric information about specific residues in a protein structure. in addition to such information about specific residues, there are experimental techniques that can be used to generate geometric information in a systematic manner. nmr represents one such technique. nmr methods solve a protein structure through generating either distance restraints [called noe, or nuclear overhauser effect, distances (prestegard, 1998) ] between different residues (or more specifically different atoms) or orientations of certain chemical bonds in a protein, called residual dipolar coupling (tolman et ai., 1995) . then a protein structure is solved through finding structural models that are consistent with the collected geometric constraints and have their energy minimized. to accurately solve a protein structure using nmr technique, it typically requires 15-20 distance restraints per residue (clore et ai., 1993) , which will require multiple nmr experiments. partial distance restraints could possibly be collected using fewer nmr experiments, possibly involving labeling of specific amino acid types. similar can be said about orientation information collected through residual dipolar coupling experiments. while these partial nmr data might not be sufficient for solving a protein structure accurately, they provide highly useful constraints for protein fold recognition and backbone structure prediction by a threading method. chemical cross-linking experiments provide another systematic approach to generating partial structural information for a protein. in such experiments, chemical cross-linkers, with customized arm lengths, are designed to link specific types of amino acids within a certain distance range (cohen and sternberg, 1980) . such experiments followed by tandem mass spectrometry experiments and data interpretation could provide distance information between certain amino acids. such an approach has been used for structural data collection for both soluble and membrane proteins (yan et ai., 2005) , which have then been used for protein structure prediction (young et ai., 2000) . one way to use such structural information, such as distances or structural locations, in a threading program is to add an energy term in the threading energy function, which measures the consistency between the collected structural data and a threading alignment. for example, if residues a and b are known to be within a certain distance, then an energy term could be specifically designed to penalize threading alignments which violate this particular geometric constraint. the energy term could be designed so that the bigger the violation, the larger the penalty. similarly, if a residue x is known to be on the surface of a protein structure, an energy term could be specifically designed for this knowledge so that it penalizes threading alignments that do not put x into a surface position in the template structure, and the amount of penalty could be designed to reflect the degree of violation of this particular knowledge. to deal with all geometric constraints, we can design a new energy term eg ,which is the sum of the individual penalty functions for all the specific geometric constraints. the overall scaling factor for this new energy term in a threading energy function could be empirically determined based on a training data set, for which actual partial experimental data is available. the effectiveness of applying such partial structural data in a threading program has been documented in a number of studies (xu et ai., 2000b,c; young et ai., 2000; qu et ai., 2004a,b) . table 12 .4 shows a systematic study on improving the "query" and "temp" represent the pdb codes of the query and template proteins, respectively. "rmsd/rank vs. percentage of assigned ss" are the ca-rmsd between the experimental structure and the predicted structure using mod eller based on threading alignments for the alignable portions in the structure-structure alignment between the query protein and the template, and the rank of the correct template structure among 667 templates. 0%, 20%, 40%, ... , 100% represent the percentage of residues with secondary structure assignment, respectively. the highlighted numbers show improvement between using no secondary structure information and using full secondary structure assignments. threading performance by incrementally increasing the number ofnmrlnoe distance constraints used in a threading process. it is seen that distance constraints help both fold recognition and threading alignment accuracy. in the best scenario, a threading program can provide an accurate prediction of the backbone atoms of a protein structure, which is still a long way from having a detailed all-atom structure. in the most general situation, a threading program could provide a somewhat accurate structure for the backbone atoms in the core secondary structures while predictions for the loop regions are often not accurate. the reason is that threading predicts a structure based on a known template structure. while the core secondary structures among homologous proteins are generally "well" conserved, loops are often not. hence, template-based loop predictions are generally not accurate. fortunately, existing methods for short loop prediction « 14 residues) have reached a level that the predicted loops could be as accurate as the predicted core structure. for example, tile recent work by jacobson et al, (2004) has achieved prediction, accuracy of 0.43 a for 5-residue loops, 0.84 a for 8-residue loops, and 1.63 a for i l-residue loops, using an accurate all-atom energy function and hierarchical refinement protocol. potentially the predicted full backbone structure, after adding loop structures, could be refined using an energy-based approach. to do this, one needs to put all the atoms, backbone and side chains, into a structural model. one can use alignments with the selected templates in fold recognition to produce a 3d atomic model through homology modeling tools, such as modeller (sali and blundell, 1990) , which runs a protocol ofenergy minimization and molecular dynamics simulation to refine a structural model. after a structure model is generated, one can apply structure assessment tools such as whatif (vriend, 1990) and procheck (laskowski et ai., 1993) to evaluate the packing and backbone conformations, the inside/outside occupancies ofhydrophobic and hydrophilic residues, and stereochemical quality of a predicted structure. based on this assessment, a user can pick the best among the multiple structures derived from an alignment. while widely used, the potential of protein threading as a protein structure and function prediction technique is far from being fully realized. there are a number of factors that have limited its wider range of applications. first, fold recognition for structural analogues and some remote homologues is still challenging (kinch et al., 2003; sippl et ai., 2001) . such proteins might account for about 40% of all proteins encoded in a typical genome according to our studies (xu et ai., 2003; guo et ai., 2004) . these structures are theoretically modelable using comparative modeling techniques such as protein threading, but the predictions typically gave a low confidence level and the results may be wrong. second, even when a correct fold is identified, the accuracy ofthreading alignment has been about 60-90% for proteins with less than 30% sequence identity with their template structures (venclovas et al., 2003) . novel ideas and new techniques are clearly needed now to make a major jump in improving the prediction capability of the existing threading methods; this has become quite clear based on the slow and incremental improvements in threading performance in the past few casp contests (venclovas et al., 2003) . the current energy functions are generally coarse gained mainly to achieve fast predictions. given the significant advances in computer hardware and algorithm development, it may be the time to use more sophisticated energy functions. for example, multibody interactions and more physical energy functions may help improve the threading prediction accuracy. although many theoretical studies have been carried out for threading algorithms, there is still significant room for further improving the computational efficiency of threading programs. better search methods against structural templates using advanced database techniques have not been explored thoroughly. more work can be done at the implementation level in a similar way to the implementation of blast, where many low-level operations were implemented in a highly efficient manner. in addition, algorithmic development needs to address new types of energy functions, such as new threading algorithms that could handle simultaneous backbone threading and side-chain packing to fully take advantage of more detailed energy function forms. threading algorithms that could handle multibody interactions and energy functions capturing more global properties (e.g. , compactness) ofproteins are clearly underdeveloped. existing confidence assessments are either too time-consuming in computation or not sufficiently accurate. more rigorous and faster assessment techniques for threading are clearly needed to achieve comparable performance to that of blast. assessments of different alignments using the same fold were basically not studied. furthermore, identification of"reliable" versus "unreliable" parts ofa threaded structure, and quantitative assessment of the structural deviations in terms of rmsd for regions of predicted structures have not been achieved. it has been found that using multiple fold recognition programs to build consensus of structural template is an effective way to increase the prediction accuracy (lundstrom et al., 2001) . furthermore, one can thread subdomain structures and use these substructures from different templates to build a new structure through a shotgun approach (fischer, 2000 (fischer, , 2003 . currently a consensus was built by a simple scheme of majority vote. much statistics can be done to do this in a more scientifically sound way. in addition, how to piece different substructures together to form a global protein structure is another challenging issue. further discussion on consensus building and subdomain threading can be found in chapter 17. as a structure prediction technique, threading potentially applies to at least 80% of all protein families. however, the application ofthreading to membrane proteins has been very limited due to the lack of available structural templates. threading techniques have been widely used for various purposes in biological studies, including (a) functional studies of proteins and experimental design (e.g., targeted mutagenesis) (madej et ai., 1995; xu et al., 2001; von grotthuss et al., 2003) , (b) genome annotation (xu et ai., 2003; mcguffin et al., 2004) , (c) helping solve experimental structures (ye et al., 2004) , (d) modeling protein complex structures (lu et al., 2003) , (e) prediction ofmisfolded structure (see chapter 9), and (f) protein design (sorenson and head-gordon, 1999) . to further increase the utility of threading techniques to meet the needs for genome-scale protein structure prediction to keep up with the rate of genome sequencing and gene prediction, we clearly need a new generation of threading energy functions, threading algorithms, methods for assessing the statistical significance of threading results, and refinement of threaded structures. there are several comprehensive reviews and books on various aspects ofthreading. recent reviews related to threading include fetrow et al. (2002) and godzik (2003) . a number of books also provide some general coverage of threading and protein structure predictions (tsigelny, 2002; jiang et al. 2002; bourne and weissig, 2003) . for scoring function, the readers can find more information in chapters 2 and 3 ofthis book. for more information about general protein structure and structure-function relationship, we recommend branden and tooze (1999) and lesk (2001) . pdp: protein domain parser local alignment statistics gapped blast and psi-blast: a new generation of protein database search programs scop database in 2004: refinements integrate structure and sequence family data domain combinations in archaeal, eubacterial and eukaryotic proteomes an insight into domain combinations linear time algorithms for np-hard problems restricted to partial k-tree the universal protein resource (uniprot) protein structure prediction and structural genomics a strategy for the rapid multiple alignment of protein sequences. confidence levels from tertiary structure comparisons flexible protein sequence patterns. a sensitive method to detect weak structural similarities a linear time algorithm for finding tree-decompositions of small treewidth structural bioinformatics a method to identify protein sequences that fold into a known three-dimensional structure introduction to protein structure fundamentals of algorithmics understanding protein structure: using scop for fold interpretation statistics of sequence-structure threading on the structural complexity ofa protein fold recognition with minimal gaps exploring the limits of precision and accuracy of protein structures determined by nuclear magnetic resonance spectroscopy on the use of chemically derived distance constraints in the prediction ofprotein structure with myoglobin as an example a unifold, mesofold, and superfold model of protein fold use homstrad: adding sequence information to structure-based alignments of homologous protein families cleavage inhibition ofthe murine coronavirus spike protein by a furin-like enzyme affects cell-eell but not virus-eell fusion smog: de novo design method based on simple, fast, and accurate free energy estimates .1. methodology and supporting evidence identification of homology in protein structure classification multi-class protein fold recognition using support vector machines and neural networks the multiplicity of domains in proteins large macromolecular complexes in the protein data bank: a status report multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions a study of combined structure/sequence profiles the protein folding problem: a biophysical enigma why do globular proteins fit the limited set of folding patterns? hybrid fold recognition: combining sequence derived properties with evolutionary information 3d-shotgun: a novel, cooperative, fold-recognition metapredictor protein fold recognition using sequence-derived predictions assessing the performance of fold recognition methods by means of a comprehensive benchmark assigning amino acid sequences to 3-dimensional protein folds planar graph decomposition and all pairs shortest paths structural genomics: bioinformatics in the driver's seat prediction of transcription regulatory sites in archaea by a comparative genomic approach can sequence determine function? a structural census ofgenomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure how representative are the known structures of the proteins in a complete genome? a comprehensive structural census comparing genomes in terms ofprotein structure: surveys of a finite parts list fold recognition methods prospect-pspp: an automatic computational pipeline for protein structure prediction selection of representative protein data sets mapping the protein universe the fssp database: fold classification based on structure-structure alignment of proteins a hierarchical approach to all-atom protein loop prediction current topics in computational molecular biology protein secondary structure prediction based on position-specific scoring matrices genthreader: an efficient, and reliable protein fold recognition method for genomic sequences a new approach to protein fold recognition prospect ii: protein structure prediction program for genome-scale applications casp5 assessment of fold recognition target predictions the structure of the protein universe and genome evolution procheck: a program to check the stereochemical quality of protein structures the protein threading problem with sequence amino acid interaction preferences is np-complete introduction to proteinarchitecture: the structuralbiology ofproteins a unified statistical framework for sequence comparison and structure comparison emergence of preferred structures in a simple model of protein folding are protein folds atypical? designability of protein structures: a lattice-model study using the miyazawa-jernigan matrix a distance-dependent atomic knowledge-based potential for improved protein structure selection geometric cooperativity and anti-cooperativity of threebody interactions in native proteins multimeric threading-based prediction of protein-protein interactions on a genomic scale: application to the saccharomyces cerevisiae proteome protein distance constraints predicted by neural networks and probability density functions peons: a neuralnetwork-based consensus predictor that improves fold recognition threading analysis suggests that the obese gene product may be a helical cytokine comparative genomics ofthe archaea (euryarchaeota): evolution of conserved protein families, the stable core, and the variable shell how many species are there on earth improvement ofthe genthreader method for genomic fold recognition protein structure prediction by protein threading the genomic threading database: a comprehensive resource for structural annotations of the genomes from key organisms novel knowledge-based mean force potential at atomic level statistical significance of protein structure prediction by threading statistical significance of hierarchical multibody potentials based on delaunay tessellation and their application in sequence-structure alignment scop: a structural classification of proteins database for the investigation of sequences and structures protein superfamilies and domain superfolds cath-a hierarchic classification of protein domain structures a local alignment method for protein structure motifs threading with explicit models for evolutionary conservation ofstructure and sequence combination ofthreading potentials and sequence profiles improves fold recognition combinatorial optimization: algorithms and complexity new techniques in structural nmr-anisotropic interactions protein fold recognition through application of residual dipolar coupling data protein structure prediction using sparse dipolar coupling data the anatomy and taxonomy ofprotein structure graph minors .2. algorithmic aspects of tree-width protein fold recognition by predictionbased threading definition of general topological equivalence in protein structures. a procedure involving comparison of properties and relationships through simulated annealing and dynamic programming an all-atom distance-dependent conditional probability discriminatory function for protein structure prediction fugue: sequence-structure homology recognition using environment-specific substitution tables and structuredependent gap penalties calculation ofconformational ensembles from potentials ofmean force. an approach to the knowledge-based prediction of local structures in globular proteins assessment of the casp4 fold recognition category structural genomics and its importance for gene function analysis defrosting the frozen approximation: prospec-tor: a new approach to threading confidence measures for protein fold recognition tree decomposition based protein threading redesigning the hydrophobic core of a model beta-sheet protein: destabilizing traps through a threading approach the cog database: a tool for genome-scale analysis of protein functions and evolution protein structure alignment nuclear magnetic dipole interactions in field-oriented proteins: information for structure determination in solution protein structure prediction: bioinformatic approach assessment of progress over the casp experiments mrna cap-l methyltransferase in the sars genome what if: a molecular modelling and drug design program application of computational biology in understanding emerging infectious diseases: inferring the biological function for s-m complex ofsars-co~in progress in bioinformatics pisces: a protein sequence culling server how many fold types of protein are there in nature protein fold recognition by threading: comparison of algorithms and analysis of results nucleation, rapid folding, and globular intrachain regions in proteins model for the three-dimensional structure ofvitroneetin: predictions for the multi-domain protein from threading and docking characterization of protein structure and function at genome scale with a computational prediction pipeline sequence-structure specificity of a knowledge based energy function at the secondary structure level a tree decomposition approach to protein structure prediction assessment of raptor's linear programming approach in cafasp3 raptor: optimal protein threading by linear programming. 1 bioinform protein threading by linear programming a polynomial-time algorithm for a class of protein threading problems protein threading using prospect: design and evaluation a computational method for nmr-constrained protein threading protein threading by prospect: a prediction experiment in casp3 protein structure determination using protein threading and sparse nmr data protein domain decomposition using a graph-theoretic approach a practical method for interpretation ofthreading scores: an application of neural network an efficient computational method for globally optimal threading a graph-theoretic approach for the separation ofb and y ions in tandem mass spectra probabilistic cross-link analysis and experiment planning for high-throughput elucidation of protein structure high throughput protein fold identification by using experimental constraints derived from intramolecular cross-links and mass spectrometry similarities and differences between nonhomologous proteins with similar folds: evaluation of threading strategies estimating the number ofprotein folds scoring function for automated assessment of protein structure template quality distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment offragments this research was sponsored in part by the u.s. department of energy's genomes to life program (www.doegenomestolife.org) under project "carbon sequestration in synechococcus sp.: from molecular machines to hierarchical modeling" (www genomesllife.orgj. yxandzjl'sworkwasalsosupportedinpartbynsfidbi-0354771, nsf/itr-iis-0407204, and a "distinguished cancer scholar" grant from the georgia cancer coalition. dx's work was also partially funded by nsf/eia-0325386. key: cord-339209-oe8onyr9 authors: vasilakis, nikos; guzman, hilda; firth, cadhla; forrester, naomi l; widen, steven g; wood, thomas g; rossi, shannan l; ghedin, elodie; popov, vsevolov; blasdell, kim r; walker, peter j; tesh, robert b title: mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range date: 2014-05-20 journal: virol j doi: 10.1186/1743-422x-11-97 sha: doc_id: 339209 cord_uid: oe8onyr9 background: the family mesoniviridae (order nidovirales) comprises of a group of positive-sense, single-stranded rna ([+]ssrna) viruses isolated from mosquitoes. findings: thirteen novel insect-specific virus isolates were obtained from mosquitoes collected in indonesia, thailand and the usa. by electron microscopy, the virions appeared as spherical particles with a diameter of ~50 nm. their 20,129 nt to 20,777 nt genomes consist of positive-sense, single-stranded rna with a poly-a tail. four isolates from houston, texas, and one isolate from java, indonesia, were identified as variants of the species alphamesonivirus-1 which also includes nam dinh virus (ndiv) from vietnam and cavally virus (cavv) from côte d’ivoire. the eight other isolates were identified as variants of three new mesoniviruses, based on genome organization and pairwise evolutionary distances: karang sari virus (ksav) from java, bontag baru virus (bbav) from java and kalimantan, and kamphaeng phet virus (kphv) from thailand. in comparison with ndiv, the three new mesoniviruses each contained a long insertion (180 – 588 nt) of unknown function in the 5’ region of orf1a, which accounted for much of the difference in genome size. the insertions contained various short imperfect repeats and may have arisen by recombination or sequence duplication. conclusions: in summary, based on their genome organizations and phylogenetic relationships, thirteen new viruses were identified as members of the family mesoniviridae, order nidovirales. species demarcation criteria employed previously for mesoniviruses would place five of these isolates in the same species as ndiv and cavv (alphamesonivirus-1) and the other eight isolates would represent three new mesonivirus species (alphamesonivirus-5, alphamesonivirus-6 and alphamesonivirus-7). the observed spatiotemporal distribution over widespread geographic regions and broad species host range in mosquitoes suggests that mesoniviruses may be common in mosquito populations worldwide. the recently established virus family mesoniviridae (order nidovirales) comprises a group of positive-sense, singlestranded rna ([+]ssrna) insect viruses [1] . to date, the six described mesoniviruses, cavally (cavv), daknong (dkng), hana (hanav), meno (menov), nam dinh (ndiv) and nse (nsev) have all been isolated from naturally infected mosquitoes collected in just two countries: côte d'ivoire (west africa) and vietnam (southeast asia) [2] [3] [4] [5] . although these mosquito-associated viruses do not appear to infect vertebrates or to cause illness in humans or livestock, they are nonetheless of interest because of the structural and genetic similarities to other members of the order nidovirales, namely viruses in the families coronaviridae, arteriviridae, and roniviridae. furthermore, the basal phylogenetic position of the mesoniviruses in relation to the coronaviridae has led some authors to suggest that the coronaviruses, and possibly other viruses in the order nidovirales, may have evolved in arthropods [3] [4] [5] . indeed, members of the roniviridae naturally infect marine shrimp and can cause severe pathology in these economically important arthropods [6] . in this communication, we report the isolation and characterization of 13 additional mesoniviruses from mosquitoes collected in thailand, indonesia and the united states of america (usa). these 13 viruses appear to represent four distinct species in the family mesoniviridae, three of which are novel. based on their wide geographic distribution, the limited sampling that has been done to date for these mosquito-specific viruses, and their broad species host range in mosquitoes, it seems likely that mesoniviruses are common in mosquito populations worldwide. the potential biological significance and effect of mesoniviruses on mosquito vector competence is also discussed. all isolates had similar ultrastructure. mature virions 50 nm in diameter were located at the surface of infected c6/36 cells, either as individual particles or in groups ( figure 1b,d) , similar to what has been reported recently [7] . they displayed a dense nucleocapsid core~40 nm in diameter and surrounding envelope ( figure 1a ,b). mature virions of the same size could also be found inside intracytoplasmic vacuoles ( figure 1a ,b,d,e,f,h). some infected cells had paracrystalline arrays consisting of empty and full virus particles but with less electron-density than mature virions ( figure 1c ). at the periphery of these arrays, mature virions could be observed either free in the cytosol or inside vacuoles ( figure 1c ). moreover a recent report [8] , suggests that the mesonivirus virion diameter of~40-50 nm observed by us and others [7] may be significantly lower than its actual size, a discrepancy attributed to the method of preparation for alternative imaging technologies. employing cryoelectron tomography, warrilow et al. [8] demonstrated on the surface of some virions the presence of spikelike projections displaying a globular head attached to the virion surface through a low density stalk, similar to what has been observed in other members of the order nidovirales [9] . the complete nucleotide sequences of all 13 isolates were determined by high-throughput illumina sequencing with end-finishing by 5'-and 3'-race. excluding the 3'-poly [a] tail, the (+) ssrna genomes ranged in size from 20,127 nt (v3872) to 20,777 nt (jkt-10701). the organization of each genome was similar to that described previously for the mesoniviruses (ndiv, cavv, hanav, nsev and menov), featuring a long 5'-untranslated region (5'-utr) of 359 to 370 nt, six major long open reading frames (orfs), and a long terminal region of 1780 to 1804 nt preceding the poly[a] tail ( figure 2 ). an alignment that also included the five previously described mesoniviruses revealed block insertions in three groups of isolates: i) kp84-0156, kp84-0192 and kp84-0344 had a 180 nt insertion, ii) jkt-9853, jkt-9876, jkt-9891 and jkt-7774 had a 573 nt insertion, and iii) jkt-10701 had a 588 nt insertion. these insertions accounted for most of the observed difference in genome size in the mesoniviruses ( figure 2 ). excluding this region, which commenced~1300 nt from the 5'terminus, the mesoniviruses shared global nucleotide sequence identity of~43.7%. maximum pairwise divergence was between menov and all other viruses (65.5% -67.6% identity). to determine the phylogenetic relationships of the newly identified insect viruses, maximum likelihood (ml) phylogenetic trees were constructed based on the amino acid alignments of orf2a (unprocessed s protein) and a concatenated region of the highly conserved domains within orf1ab (3cl pro , rdrp and znhel1). the phylogenies exhibited highly similar topologies and strongly indicated that the viruses identified in this study cluster within the previously identified insect-specific nidoviruses in the genus mesonivirus (ndiv, cavv, hanav, nsev and menov) ( figure 3) . a multiple sequence alignment of the 13 new genomes indicated high identity between multiple isolates, which was also confirmed by our phylogenetic analysis. isolates kp84-0156, kp84-0192 and kp84-0344 differed only by ten nucleotide substitutions and a single 6-nt indel and formed a well-supported monophyletic clade in both phylogenies ( figure 3 ). these should be considered isolates of the same virus (designated kamphang figure 2 genome organisation of the mesoniviruses. in orf1a and orf1b the putative ribosomal frame-shift (rfs) element and the locations of conserved functional domains, including the 3c-like serine protease (3clp), rna-dependent rna polymerase (rdrp), zinc-binding (z), helicase (hel), exoribonuclease (exon), n7-methyltransferase (nmt) and 2'-o-methyltransferase (omt) domains, are shown. sequence insertions in orf1a are shown as black blocks. in orf2a encoding the s glycoprotein the sites of proteolytic cleavage to generate s1, s2 and the uncharacterised n-terminal fragment are shown as black inverted triangles. putative orf4 that is relatively conserved in the 3'terminal region of the genome is shown as black rectangles and other small orfs >180 nt that occur variously in the 3'-terminal domains and elsewhere in the genome are shown as grey rectangles. phet virus -kphv). likewise, jkt-9853, jkt-9876 and jkt-9891 differed by only single nucleotide substitutions and jkt-7774 differed by 68 nucleotide substitutions and two indels of 3 and 6 nucleotides, and these four viruses formed a strongly supported monophyletic clade in the ml trees, suggesting that they can be considered isolates of the same virus (designated botang baru virus -bbav). isolates 16740, 16757, v3872 and v3982 were also closely related, differing by only 18-57 nucleotide substitutions (~0.1-0.3%) across the entire genome. the clade formed by these isolates in each ml tree clustered within a closely related monophyletic group that also included ndiv and isolate jkt-9982 ( figure 3 ). these can all be considered strains of ndiv. finally, isolate jkt-10701 was clearly distinct from all other mesoniviruses in our phylogenetic analysis, and was considered a unique virus, karang sari (ksav). it has recently been suggested that pairwise evolutionary distances (ped) can be used as a species demarcation criterion in the genus mesonivirus [6] . we calculated the ped between the 13 new isolates described in this study and those of five previously described mesoniviruses (cavv, hanav, ndiv, nsev and menov) using the conserved protein domains of orf1ab (3cl pro , ). based on this analysis, ndiv (including the ngewontan and houston strains) and cavv can be considered a single species, which has previously been designated at alphamesonivirus-1. the ped between each of these strains was substantially less than the suggested cutoff of 0.032 for species demarcation [6] . by this method, each of the other previously described mesoniviruses -hanav menov, and nsevwould also be considered unique species, as the ped between these viruses and all other mesoniviruses were considerably greater than 0.032. these viruses could be assigned to the species alphamesonivirus-2, alphamesonivirus-3 and alphamesonivirus-4, respectively. similarly, novel viruses bbav, ksav and kphv would each also be assigned to separate mesonovirus species: alphamesonivirus-5, alphamesonivirus-6 and alphamesonivirus-7, respectively. strikingly, the vast majority of the genetic diversity that was measured within the genus mesonivirus was present at the inter-species level (additional file 1: figure s1 ). this pattern strongly supports the assignment of the current diversity of mesoniviruses into seven distinct species. in mesoniviruses, orf1a features a 3c-like protease (3clpro) domain and flanking transmembrane domains, which are highly conserved amongst nidoviruses; these domains were identified in each of the new viruses. however, the block nucleotide sequence insertions resulted in a highly variable region near the n-terminus of orf1a with insertions of various lengths ranging from 13 aa (menv) to 196 aa (ksav) (additional file 2: figure s2 ). the inserted sequences (which apparently are unique to the mesoniviruses) featured variations on a common sequence (skrkgk) at the terminal points. there were also various short stretches of this sequence and other imperfectly repeated sequences within each insertion region suggesting possible origins through recombination or sequence duplication events. as has been reported previously for other mesoniviruses, orf1a is overlapped by orf1b which contains highly conserved regions associated with the replicase complex including rna-dependent rna polymerase (rdrp), multinucleate zinc-binding module and associated helicase (znhel1), exoribonuclease (exon), n 7 -methyltransferase (nmt) and 2'-o-methyltransferase (omt) domains. the orf1a/orf1b overlap includes a putative 'slippery' sequence (ggauuuu), allowing orf1b expression as a pp1ab polyprotein through a −1 ribosomal frame shift. orf2a encodes the s glycoprotein. in cavv, the s protein (p77) has been shown to be generated from the fulllength orf2a polyprotein by internal signalase cleavage at the site 220 nahc|strid and is further processed to generate cleavage products s1 (p23) and s2 (p57), each of which is a structural component of virions [5] . a muscle alignment of all available mesonivirus orf2a polyprotein sequences (additional file 3: figure s3 ) indicated that the signalase cleavage site is relatively variable conforming only to the sequence motif [c/s]|[l/a/s] tridl and that the s1-s2 cleavage site (r|wdssyv) is highly conserved. the s protein, a class i transmembrane glycoprotein with a c-terminal transmembrane domain, features a set of 12 conserved cysteine residues which are likely to form 6 disulphide bridges, and 7 to 11 potential n-glycosylation sites, only 4 of which are conserved across all mesoniviruses. surprisingly, an nterminal signal peptide is strongly predicted for the menov s protein (between amino acids a 20 and s 21 ) but not for those of any of the other mesoniviruses (http:// www.cbs.dtu.dk/services/signalp). there is a relatively high level of amino acid sequence conservation in the s1 and s2 proteins (47.2% and 51.7% global identify, respectively). however, the n-terminal product of signalase cleavage of the orf2a polyprotein displays very low sequence conservation amongst the mesoniviruses (7.4% global identity). this product, which is predicted to be positioned on the inside of the er membrane (tmhmm server; www.cbs.dtu.dk/services/tmhmm-2.0) has not yet been identified in virions or infected cells. in each of the mesoniviruses, the precise region of the genome encoding the highly variable n-terminal signalase cleavage product of the orf2a polyprotein also contains an alternative open reading frame (orf2b) which has been shown in cavv to encode the putative nucleoprotein p25 [5] . our analysis indicated that the predicted molecular weights of the unmodified mesonivirus n proteins range from 23.69 kda (ngev) to 25.38 kda (ksav) and all are highly basic (pi >10). they display a moderate level of overall sequence conservation (34.1% global identity) due primarily to two highly conserved domains corresponding in ksav to g 95 to a 176 (64.3% global sequence identity) and l 195 to f 210 (75.0% global sequence identity). the third coding region, commencing immediately downstream of orf2a, contains two overlapping long open reading frames (orf3a and orf3b) each encoding small hydrophobic proteins (figure 4) . a clustal x alignment of the mesonivirus orf3a proteins and individual structural analyses using signalp and tmhmm and netnglyc (www.expasy.org) indicated that each is a class i transmembrane glycoprotein with a predicted n-termimal signal peptide, an ectodomain containing a conserved set of 6 cysteine residues and a single conserved n-glycosylation site, a transmembrane domain and a c-terminal cytoplasmic domain ( figure 4a, 4d) . the cysteine-rich region of the ectodomain displays moderately high sequence conservation including a stretch of 13 totally conserved amino acids adjacent to the transmembrane domain that is unusually rich in large aromatic residues (y, w, f). the cytoplasmic domain is highly variable in sequence, except for the completely conserved motif (h/s)yipllpr. no similar motif has been detected in a search of available eukaryote sequences. the orf3b proteins are each predicted to be class ii transmembrane proteins with an n-terminal cytoplasmic domain, a transmembrane domain and a short cterminal ectodomain ( figure 4b, 4d) . a single nglycosylation site was predicted in the ectodomain of all the mesoniviruses, except nsev. the orf3b proteins display poor overall sequence conservation and have few features that suggest a biological function. the long 3'-terminal region of all mesoniviruses, except menov, contained a small open reading frame, which was previously designated as orf4. a clustal × multiple sequence alignment of the putative proteins indicated that, although they varied in size (45 aa to 61 aa), there was a remarkable level of sequence identity, particularly in the n-terminal portion of the protein (additional file 4: figure s4 ). however, there are no obvious structural various other short orfs (<150 nt) were detected in the 3'-terminal regions of each mesonivirus but none was obviously conserved. alternative orfs of >180 nt (i.e., 60 amino acids) were also detected within other orfs, including in orf2a in the original strain of ndiv (72 aa) and the ngewontan (72 aa) and houston (61 aa) strains, in which the 5' region is missing (figure 2) . these orfs share a relatively high degree of nucleotide sequence conservation (92%) but a lower level of amino acid sequence conservation (79%) and no obvious structural characteristics indicate the possible function of the encoded proteins. another short orf that displayed some degree of sequence conservation was found in the 3' region of orf1a for all viruses except menov and kphv (figure 2 ). mesoniviruses have been shown previously to express a 3'-nested set of polyadenylated sub-genomic mrnas produced by copy-choice mediated leader-body fusion at sites located immediately upstream of alternative transcription regulatory sequence (trs) elements which occur in the 5'-utr and in the regions preceding orf2a/ orf2b and orf3a/orf3b [5] . each of the previously identified leader-body fusion sites and trs elements [auxxuacuacuacua and agax(x)acucuccca] were completely conserved across all new mesoniviruses examined in this study. in nidoviruses, translation of orf1b characteristically occurs through a ribosomal frame-shift at a 'slippery' sequence in the orf1a/orf1b overlap region to allow read-through synthesis of a polyprotein (pp1ab). this is usually facilitated by an rna pseudoknot in the sequence immediately downstream of the frame-shift site. however, a previous analysis of the ndiv sequence failed to identify a predicted pseudoknot structure, suggesting that a stem-loop structure (predicted using pknotsrg; http://bibiserv.techfak.uni-bielefeld.de/pknotsrg/) in the same region of the genome may facilitate the frame-shift [3] . however, our analysis using pknotsrg of the 13 new mesonivirus sequences and those of cavv, hanav, ndiv, nsev and menov indicated that, although the slippery sequence (ggauuuu) is completely conserved, the stem-loop structure predicted for ndiv is not predicted as the minimum free energy structure for any of the other viruses. indeed, an alignment of the corresponding region of the genome of all mesoniviruses (additional file 5: figure s5 ) indicated that there was poor conservation of this sequence and few compensatory mutations that would preserve the stem-loop structure ( figure 5a ). as functional pseudoknot structures are usually conserved, an alignment of all mesonivirus sequences extending 150 nt from the start of the 'slippery sequence' was analyzed using the ipknot server (http:// rna.naist.jp/ipknot) which predicts the consensus secondary structure from a multiple sequence alignment (by using integer programming to select the maximum expected accuracy (mea) structure) [10] . as shown in figure 5b , the ipknot algorithm predicted a conserved pseudoknot structure, which conformed to all nucleotide substitutions in the available mesoniviruses. analysis of the region of the orf3a/orf3b overlap indicated that it has a similar format to that of the orf1a/orf1b overlap region, featuring a long overlap region and potential slippery sequence (cacuuuu) that could result in read-through by a −1 ribosomal frameshift. structural analyses using signalp and tmhmm and netnglyc (www.expasy.org) indicated that a frame-shift at this site would result in a double-membrane-spanning glycoprotein (p3ab) of approximately 30-33 kda (depending on whether one or two n-glysosylation sites are occupied) with both the n-terminal and c-terminal domains located in the extracellular lumen (figure 4c, d) . a previous analysis of cavv proteins by mass spectrometry identified peptide sequences from proteins migrating in gels at 17, 18 19 and 20 kda that corresponded to the 3a protein (designated the m protein) and these were considered to be variously glycosylated forms; no protein corresponding to the orf3b gene product was detected [5] . as for the orf1a/orf1b region, analysis of the genome region downstream of the putative ribosomal frame-shift element (rfs) element using pknotsrg identified various potential stem-loops but no commonly predicted minimum free energy structure. analysis of a multiple sequence alignment of the region using the ipknot server also failed to predict a convincing pseudoknot structure. in this study we have identified 13 mesonivirus isolates and characterized their genome organization and phylogenetic relationships. based on species demarcation criteria employed previously for mesoniviruses [1] , five of these new isolates would be assigned to the same species as ndiv and cavv (alphamesonivirus-1), other previously described mesoniviruses -hanav menov, and nsev -would be assigned to three new species (alphamesonivirus-2, alphamesonivirus-3 and alphamesonivirus-4, respectively), and eight of the new isolates would represent three new species (alphamesonivirus-5, alpha mesonivirus-6 and alphamesonivirus-7) [6] . however, we consider this basis for species demarcation, which employs only a genetic standard of pairwise sequence divergence to assign viruses to a species, should be re-evaluated following further assessment of the ecology of these viruses and their potential for genetic recombination which may provide a more informed analysis of suitable species demarcation criteria. the isolates of viruses assigned to the species alphamesonivirus-1 illustrate the wide geographic distribution and mosquito host range of some mesoniviruses. the original four isolates of ndiv were made in northern and central vietnam from culex vishnui and cx. tritaeniorhynchus mosquitoes collected indoors during a surveillance program for japanese encephalitis virus (jev) [3] . our ndiv isolate from java, indonesia (ngewontan strain) was also obtained from a pool of cx. vishnui mosquitoes in 1981. the four ndiv isolates from houston, texas (houston strain) were made from cx. quinquefasciatus and aedes albopictus collected outdoors within the houston metropolitan area during west nile virus (wnv) surveillance in 2004 and 2010, respectively. interestingly, all of these isolates were from mosquitoes captured in or near human dwellings and from species that feed on humans. the close similarity of isolates from houston and vietnam may suggest a recent translocation, possibly during the vietnam conflict when houston hosted a major air base for embarkation/disembarkation. isolates of kampaeng phet, botang baru and karang sari viruses, which may be considered three new mesonivirus species, have a geographic distribution extending at least from central thailand to kalimantan and java in indonesia, and have been isolated from at least two species of culex mosquitoes. another consideration is the potential effect of mesonivirus infection on the susceptibility and vector competence of mosquitoes for viral pathogens of vertebrates. for example, both cx. vishnui and cx. tritaeniorhynchus are important vectors of jev in asia, and cx. quinquefasciatus is the major vector of wnv in houston. recent experimental studies with ae. aegypti mosquitoes infected with certain strains of the wolbachia indicate that the presence of the symbiont bacterium interferes with dengue virus replication and decreases . stem-loop structure predicted previously for ndiv [3] using pseudoknotsrg software, illustrating nucleotides that are substituted in various viruses (red circles). nucleotide substitutions (which are primarily non-compensatory) are shown in the boxes. the corresponding sequence alignment is shown in additional file 4: figure s4 . (b). an alternative conserved pseudoknot structure predicted from the multiple sequence alignment using ipknot software. the putative 'slippery' sequence (ggauuuu) at the ribosomal frame-shift site is shaded in grey. vector competence, possibly by upregulating or priming the mosquito's innate immune system [11, 12] . similar results have been reported for wolbachia-infected ae. aegypti and chikungunya virus [11] and with wolbachia-infected cx. quinquefaciatus and wnv [13] . if a bacterial endosymboint can alter a mosquito's vector competence for arboviruses, it seems plausible that a viral symbiont could have a similar effect [14] . this is an important area for future investigation. due to the continuous efforts of the virus discovery program of world reference center for emerging viruses and arboviruses (wrceva), we have continued to isolate mesoniviruses from various insect vectors collected from widespread geographic locations (e.g., nepal, colombia and south florida) suggesting that these viruses are more common than previously thought. zirkel et al. [4] suggested that these viruses may have their origins in pristine rainforests and emergence may have been facilitated through anthropogenic-induced modifications (e.g., altered land use, deforestation). the detailed analysis and comparison of mesonivirus genome architecture conducted here has revealed some unexpected characteristics. firstly, the presence of block insertions of up to 588 nt in the 5' terminal quadrant of orf1a of several mesoniviruses has not been reported previously. the function of this region is presently unknown in mesoniviruses and other nidoviruses and so the structural and functional consequences of these insertions, which contain various imperfect repeats is unclear. although sharing similar genome architecture, nidoviruses vary greatly in genome size. a previous analysis of the evolution of nidovirus genomes concluded that genome expansion has occurred in a wave-like fashion in which the three major coding regions (orf1b, orf1a and the 3'orfs) expanded consecutively in a hierarchy that reflects the roles of their encoded proteins in the virus replication cycle [15] . this implies that nidoviruses have an inherent capacity for genome expansion, most likely associated with the transitional retention of sequences that serve as a resource for the evolution of new functions. the block insertions detected in orf1a appear to be functionally redundant and their potential role in such evolutionary processes is presently unclear. the comparative analysis also revealed that the stemloop structure which had previously been identified at the rfs site of ndiv is not conserved in the other mesoniviruses and so may not be responsible for activating the −1 ribosomal frame shift. secondary structure predictions using ipknot on the aligned sequences downstream of the conserved 'slippery' sequence site revealed a conserved pseudoknot structure that conformed to all nucleotide substitutions. however, the predicted structure featured only four relatively short regions of complementarity and no estimations of minimum free energy for the represented structures are available through this algorithm. a possible 'slippery' sequence (cacuuuu) was also detected in the orf3a/ orf3b overlap region but no conserved stem-loop or pseudoknot structure was predicted by ipknot in the downstream sequence. it is, therefore, unclear whether orf3b is expressed by internal initiation or as a read-through extension of orf3a, which would generate a double-membrane spanning protein. functional analysis of each of the regions corresponding to the rfs in orf1a/orf1b and the putative rfs in orf3a/orf3b would help resolve the mechanisms of mesonivirus gene expression. in conclusion, we have identified and characterized several new mesoniviruses from mosquitoes of human medical importance sampled over time from widespread geographic regions. several important questions related to their transmission, maintenance in insect hosts in nature, their potential impact of infection on the insect's behavior, fertility, fecundity and survival, their evolution, their mechanisms of gene expression and their potential to be developed as biological control agents, warrant further investigation. all viruses used in this study were obtained from the wrceva at the university of texas medical branch. some were isolated by the authors (rbt and hg), during arbovirus field studies; the remainder were isolated by other investigators and sent to the wrceva for identification and further characterization. all isolations were originally made in mosquito cell cultures (c6/36 or ap-61). the proposed names, original sources and geographic origins and genbank accession numbers of the sequences obtained for the 13 viruses included in our study are listed below and in table 1 . jkt-10701 was isolated from a pool of culex vishnui mosquitoes collected on 11/26/1981 at karang sari, cilacap (central java) indonesia. the initial isolation was made at the naval medical research unit #2 (namru-2) in jakarta. jkt-7774 was isolated at namru-2 from a pool of 50 culex vishnui collected at bontag baru, east kalimantan, indonesia in february 1981. strains jkt-9876, jkt-9891 and jkt-9853 were also isolated at namru-2 appear to be almost identical to isolate jkt-7774 in the phylogenetic tree. since jkt-7774 was the first virus in this group to be isolated, it should be the prototype. jkt-9876 was isolated at namru-2 from a pool of tvp16740 and tvp16757 -these two viruses were also isolated at utmb from pools of aedes albopictus mosquitoes collected in june 2010 as part of the harris county arbovirus surveillance program in houston, texas. before sequencing, all virus stocks were grown in cultures of the c6/36 clone of ae. albopictus cells [16] , obtained from the american type culture collection (atcc), manassas, va. infection was characterized by detachment of cells and cell lysis. for ultrastructural analysis in ultrathin sections infected cells were fixed for at least 1 hr in a mixture of 2.5% formaldehyde prepared from paraformaldehyde powder, and 0.1% glutaraldehyde in 0.05 m cacodylate buffer ph 7.3 to which 0.03% picric acid and 0.03% cacl 2 were added. the monolayers were washed in 0.1 m cacodylate buffer, cells were scraped off and processed further as a pellet. the pellets were post-fixed in 1% oso 4 in 0.1 m cacodylate buffer ph 7.3 for 1 h, washed with distilled water and en bloc stained with 2% aqueous uranyl acetate for 20 min at 60°c. the pellets were dehydrated in ethanol, processed through propylene oxide and embedded in poly/bed 812 (polysciences, warrington, pa). ultrathin sections were cut on leica em uc7 ultramicrotome (leica microsystems, buffalo grove, il), stained with lead citrate and examined in a philips 201 transmission electron microscope at 60 kv. viral rna (0.05-1.7 μg) was fragmented by incubation at 94°c for 8 min in 19.5 ul of fragmentation buffer (illumina 15016648). first and second strand synthesis, adapter ligation and amplification of the library were performed using the illumina truseq rna samplec preparation kit under conditions prescribed by the manufacturer (illumina). samples were tracked using the "index tags" incorporated into the adapters as defined by the manufacturer. mesoniviridae: a proposed new family in the order nidovirales formed by a single species of mosquito-borne viruses examining landscape factors influencing relative distribution of mosquito genera and frequency of virus infection discovery of the first insect nidovirus, a missing evolutionary link in the emergence of the largest rna virus genomes an insect nidovirus emerging from a primary tropical rainforest identification and characterization of genetically divergent members of the newly established family mesoniviridae molecular biology and pathogenesis of roniviruses a new nidovirus (namdinh virus ndiv): its ultrastructural characterization in the c6/36 mosquito cell line a new species of mesonivirus from the northern territory, australia supramolecular architecture of severe acute respiratory syndrome coronavirus revealed by electron cryomicroscopy rtips: fast and accurate tools for rna 2d structure prediction using integer programming a wolbachia symbiont in aedes aegypti limits infection with dengue, chikungunya, and plasmodium the relative importance of innate immune priming in wolbachia-mediated dengue interference the native wolbachia endosymbionts of drosophila melanogaster and culex quinquefasciatus increase host resistance to west nile virus infection negevirus: a proposed new taxon of insect-specific viruses with wide geographic distribution the footprint of genome architecture in the largest genome expansion in rna viruses isolation of a singh's aedes albopictus cell clone sensitive to dengue and chikungunya viruses smart 7: recent updates to the protein domain annotation resource smart, a simple modular architecture research tool: identification of signaling domains muscle: multiple sequence alignment with high accuracy and high throughput new algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of phyml 3.0 tree-puzzle: maximum likelihood phylogenetic analysis using quartets and parallel computing sse: a nucleotide and amino acid sequence analysis platform mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range additional file 5: figure s5 . a clustal x multiple sequence alignment of region immediately downstream of the orf1a/orf1b rfs that has been predicted previously to adopt a stem-loop structure in ndiv. the alignment illustrates sequence variations in nucleotides predicted in ndiv to be involved in base pairs that sustain the structure. cluster formation of the library dna templates was performed using the truseq pe cluster kit v3 (illumina) and the illumina cbot workstation using conditions recommended by the manufacturer. paired end 50 base sequencing by synthesis was performed using truseq sbs kit v3 (illumina) on an illumina hiseq 1000 using protocols defined by the manufacturer. cluster density per lane was 645-980 k/mm 2 and post filter reads ranged from 148-178 million per lane. base call conversion to sequence reads wasperformed using casava-1.8.2. virus sequences were edited and assembled using the seqman and nextgen modules of the dnastar lasergene 7 program (bioinformatics pioneer dnastar, inc., madison, wi). in certain cases, prefiltering of reads to remove host sequence enhanced the assembly process. assembly was carried out using a fasta file of aedes albopictus sequences to remove host dna from the assembly thus reducing the number of contigs present. the compiled sequences had their relationship to other viruses determined by a blastx search. the open reading frames were determined using enzyme x (nucleobytes, inc., aalsmeer). relationships to the other two insect-specific nidoviruses were determined using macvector (cary, nc) using their dna identity matrix software. the presence of conserved protein domains was determined using the smart webserver [17, 18] . a region corresponding to the full-length product encoded by orf2a (unprocessed s proteins) was used to determine the relationships within the family mesoniviridae along with a concatenated region of the highly conserved domains within orf1ab (3cl pro , rdrp and znhel1). orf2ab alignments were determined using the muscle algorithm [19] as amino acids before being toggled back to the nucleotides while maintaining the alignment. concatenated orf1ab alignments were performed at the protein level. a maximum likelihood (ml) tree for orf2a was constructed in mega 5.2 using the jones-taylor-thornton substitution model of nucleotide substitution, uniform substitution rates among sites and the nearest neighbor-interchange heuristic method and a very strong branch swap filter. an optimal ml tree was then estimated using the appropriate model and a heuristic search with tree-bisection-reconstruction branch swapping and 10 replicates, estimating variable parameters from the data, where necessary. a maximum likelihood phylogeny of the conserved domains of orf1ab was constructed using phyml 3.0 [20] , the wag + gamma model of amino acid substitution, and the nni + spr method of branch swapping. one thousand bootstrap replicates were calculated for each dataset under the same models and expressed as a percentage. calculation of the ped between the highly conserved protein domains of orf1ab (3cl pro , rdrp and znhel1) of the thirteen new isolates described in this study and those of five previously described mesoniviruses (cavv, hanav, ndiv, nsev and menov) was performed using the ml method in the program treepuzzle [21] and the wag model of amino acid substitution. sliding window analysis was used to calculate the mean amino acid divergences within and between the seven mesonivirus species described in this study. percent divergences were calculated for each of the three largest orfs in the genome (orf 1ab, orf2a, orf2b) using the program sse [22] . additional file 1: figure s1 . sliding window analysis of the pairwise amino acid distances within and between the seven putatively designated mesonivirus species for orf1ab (replicase proteins), orf2a (s) and orf2b (n).additional file 2: figure s2 . a clustal x multiple sequence alignment of mesonivirus pp1ab polyproteins illustrating region containing the block insertions (yellow shading) and various imperfectly repeated sequences that occur at the boundary and within the blocks of inserted sequence.additional file 3: figure s3 . a clustal x multiple sequence alignment of the polypeptides encoded in orf2a (s proteins) of the mesoniviruses. a predicted signal peptide in menov and predicted transmembrane domains in all mesoniviruses are shaded in aqua, predicted n-glycosylation sites are shaded in green, cysteine residues are shaded in yellow and the sites of proteolytic cleavage to generate glycoproteins s1 and s2 and the unidentified n-terminal fragment are shaded in purple.additional file 4: figure s4 . a clustal x multiple alignment of the sequences of putative polypeptides encoded on orf4 which occurs in the 3'-terminal regions of all mesoniviruses except menov. to emphasize the alignment, the ksav orf4 protein has been shown to commence at the next available methionine residue located 34 amino acids downstream of the predicted initiation codon. the authors do hereby declare that they have no competing interests in this scientific work. key: cord-325750-x7jpsnxg authors: mokili, john l; rohwer, forest; dutilh, bas e title: metagenomics and future perspectives in virus discovery date: 2012-01-20 journal: curr opin virol doi: 10.1016/j.coviro.2011.12.004 sha: doc_id: 325750 cord_uid: x7jpsnxg monitoring the emergence and re-emergence of viral diseases with the goal of containing the spread of viral agents requires both adequate preparedness and quick response. identifying the causative agent of a new epidemic is one of the most important steps for effective response to disease outbreaks. traditionally, virus discovery required propagation of the virus in cell culture, a proven technique responsible for the identification of the vast majority of viruses known to date. however, many viruses cannot be easily propagated in cell culture, thus limiting our knowledge of viruses. viral metagenomic analyses of environmental samples suggest that the field of virology has explored less than 1% of the extant viral diversity. in the last decade, the culture-independent and sequence-independent metagenomic approach has permitted the discovery of many viruses in a wide range of samples. phylogenetically, some of these viruses are distantly related to previously discovered viruses. in addition, 60–99% of the sequences generated in different viral metagenomic studies are not homologous to known viruses. in this review, we discuss the advances in the area of viral metagenomics during the last decade and their relevance to virus discovery, clinical microbiology and public health. we discuss the potential of metagenomics for characterization of the normal viral population in a healthy community and identification of viruses that could pose a threat to humans through zoonosis. in addition, we propose a new model of the koch's postulates named the ‘metagenomic koch's postulates’. unlike the original koch's postulates and the molecular koch's postulates as formulated by falkow, the metagenomic koch's postulates focus on the identification of metagenomic traits in disease cases. the metagenomic traits that can be traced after healthy individuals have been exposed to the source of the suspected pathogen. john l mokili 1 , forest rohwer 1, 2 and bas e dutilh 1, 3 monitoring the emergence and re-emergence of viral diseases with the goal of containing the spread of viral agents requires both adequate preparedness and quick response. identifying the causative agent of a new epidemic is one of the most important steps for effective response to disease outbreaks. traditionally, virus discovery required propagation of the virus in cell culture, a proven technique responsible for the identification of the vast majority of viruses known to date. however, many viruses cannot be easily propagated in cell culture, thus limiting our knowledge of viruses. viral metagenomic analyses of environmental samples suggest that the field of virology has explored less than 1% of the extant viral diversity. in the last decade, the cultureindependent and sequence-independent metagenomic approach has permitted the discovery of many viruses in a wide range of samples. phylogenetically, some of these viruses are distantly related to previously discovered viruses. in addition, 60-99% of the sequences generated in different viral metagenomic studies are not homologous to known viruses. in this review, we discuss the advances in the area of viral metagenomics during the last decade and their relevance to virus discovery, clinical microbiology and public health. we discuss the potential of metagenomics for characterization of the normal viral population in a healthy community and identification of viruses that could pose a threat to humans through zoonosis. in addition, we propose a new model of the koch's postulates named the 'metagenomic koch's postulates'. unlike the original koch's postulates and the molecular koch's postulates as formulated by falkow, the metagenomic koch's postulates focus on the identification of metagenomic traits in disease cases. the metagenomic traits that can be traced after healthy individuals have been exposed to the source of the suspected pathogen. direct-count epifluorescence and transmission electron microscopy have shown that viruses are highly abundant in most environments. bergh et al. demonstrated that 1 l of seawater can contain as many as 10 10 virus-like particles (vlps) [1] , approximately 10 times more than the number of prokaryotes. terrestrial environments often have 10 9 vlps per gram. by extrapolation from the estimated number of prokaryotes in different environments [2] , viruses are the most abundant entities in the biosphere totaling an estimated number of 1.2 â 10 30 , 2.6 â 10 30 , 3.5 â 10 31 , and 0.25-2.5 â 10 31 in the open ocean, in soil and in oceanic and terrestrial subsurfaces, respectively. in the human holobiont, the 10 13 human cells are outnumbered 10-fold by bacteria and 100-fold by viruses. viral acquisition starts early in life in utero or perinatally during the first few weeks after birth as demonstrated by studies of the gut viral communities in infants. while no vlps could be detected in the earliest infant stool samples, there were $10 8 virus particles per gram wet weight of feces by the end of the first week [2] . the majority of these vlps appear to be bacteriophages, the bacteria-infecting viruses [2] [3] [4] . culture techniques have been the gold standard for the detection of viruses for over a century. despite the knowledge gained using the cultivation of viruses in cell culture, the consensus is that we have barely begun to chart the viral world, which is the 'dark matter' of the biological universe and a rich source of future discoveries [3] . since the vast majority of viruses are not easily cultivatable, exploration of this dark matter requires culture-independent methods with larger detection coverage than culture. while the sequencing of the 16s fragment of the small subunit of the ribosomal rna (rrna) gene has a proven track record for the detection of known and novel cellular organisms [4] [5] [6] [7] [8] [9] [10] , this technique is not applicable to viruses because they lack the gene. indeed, viruses do not share any common gene that could similarly qualify as a unified phylogenetic marker [11] . metagenomics is an alternative culture-independent and sequence-independent approach that does not rely on the presence of any particular gene in all the subject entities. this approach was originally developed as a tool for 'functional and sequence-based analysis of collective microbial genomes contained in environmental samples' [12, 13] . early metagenomic studies analyzing the genetic content of environmental samples yielded the identification of metabolic traits, the characterization of organisms and the discovery of new antibiotics and enzymes [12] [13] [14] [15] [16] . metagenomic studies now encompass a wide scope of research fields including marine environmental research, plant and agricultural biotechnology, human genetics and diagnostics of human diseases. accordingly, the number of metagenomics papers in peer-reviewed journals has increased greatly since 2002 ( figure 1a ). the scope of applications for metagenomics will likely widen from environmental microbiome studies to routine clinical diagnostics for palliative care of patients, public health, industry and beyond. the first application of metagenomics to the field of virology was in the analysis of the viral communities sampled at two near-shore marine locations in san diego [17 ] . since then, it has been used to survey viruses in numerous environments including freshwater, marine sediment, soil and the human gut. figure 1b shows an overview of diverse areas where the metagenomic approach has been applied for virus discovery since 2002. the success of these studies relied upon the advances observed in the past decade in the area of sequencing technology and in bioinformatics. although the fundamental concept of metagenomics has not changed, several technical advances have proven valuable for the discovery of previously unidentified, uncultured viruses. while metagenomics originally depended upon cloning for the analysis of doublestranded dna genomes [17 ,18,19,20 ] , high-throughput sequencing technologies can now be applied to all types of genomes, including single-stranded dna and rna [21] . 62, 66, 70, 71, 74, [84] [85] [86] 88, [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] 148, 161, 162] . m: main characterization method used: 454ngs, 454 high-throughput sequencing using gs flx or gs titanium platform; sg-sanger, shotgun library with sanger sequencing method. s: sample, symbols used for sample type: ip, insect pool; sb, skunk brain; int: intestine; panc: pancreas; hf: human feces; se, sewer effluent; ms: marine sediment; nasopharyngeal aspirates (npa). historically, diseases caused by viruses have been known before the discovery of their causative agents. the acquired immunodeficiency syndrome (aids), poliomyelitis, cervical cancers, and burkitt's lymphoma were identified before their causative agents. whereas poliomyelitis was documented in ancient egyptian literature as early as approximately 3700 bc [22] , poliomyelitis virus was not discovered by landsteiner and popper until 1909 [23] . descriptions of clinical conditions likely to have smallpox have been found in ancient literature from egypt (1100-1580 bc), china (1122 bc) and india (1500 bc)-long before both jenner's discovery of smallpox vaccination and the later isolation of variola virus [24] [25] [26] . the future perspectives in virology appear that, the metagenomic approach will generate a plethora of genetic information from unknown and potentially infectious agents, some of which could be associated with human diseases. the discovery of viruses will start to precede the characterization of the diseases they cause, well before the pathogenicity of these agents is defined. at this turning point in history, important questions need to be answered. for example, how far has this new viral metagenomics discipline evolved in its first decade? what has been learned so far that can be applied to viral discovery and the forecasting of future viral outbreaks? in this article, we review virus discovery techniques with a focus on metagenomic approaches that employ high-throughput sequencing technologies to characterize novel viruses. before the advent of molecular methods, many techniques including filtration, tissue culture, electron microscopy (em), serology and vaccination have been used for the detection of viruses. in 1892, ivanovski demonstrated the presence of infectious agents, coined 'virus' by beijerinck in 1898, in filtrate of infected leaves passed through a chamberland filter. this marks the discovery of the tobacco mosaic virus [27] and the birth of a new era in virology. until then, the field of virology was not clearly defined. the instrumentation, from the discovery of tissue culture to modern molecular biology methods, has shaped the field and helped to discover many viruses. since the invention of the technique of tissue culture in 1907 and the propagation of poliovirus in animal cells in 1909, cultivation of viruses has remained the gold standard for virus discovery for over a century [28] [29] [30] . despite the achievements made by the culture technique, several limitations have hindered the discovery and detection of viruses in routine laboratory settings. virus propagation requires the development of controlled conditions that mimic the natural ecosystem shared between viruses and their hosts [31 ] . the invention of the electron microscope in 1933 provided the first visual proof of a virus. however, this technique is relatively expensive, tedious and lacks both sensitivity and specificity. alternatively, serology can provide a hint of the acquisition of novel viruses -as was the case for hepatitis c virus [32, 33] -before the viral agents have been cultured or viewed by electron microscopy. the immune sera method has shown little value for virus discovery. the inoculation method, however, not only helped to identify novel viruses, but also was used as an immunization method to confer crossprotection against closely related viruses. indeed, the cowpox-based inoculation developed by jenner in 1796 was the first effective vaccine against an infectious disease. nearly two centuries later, this strategy was used to eradicate smallpox. however, it is unlikely that jenner's method would pass the scrutiny of modern ethical review boards for vaccine or virus discovery [34] . the trends in clinical virology practices show gradual substitution of the traditional virus discovery methods with novel molecular biology technology. nevertheless, traditional and the newer molecular biology techniques to isolate, identify, and characterize viruses play complementary roles in the viral discovery effort. for a comprehensive list and detailed description of molecular methods used for virus discovery, readers are referred to reviews by delwart [31 ] and tang [35 ] . here, we focus on the viruses discovered using these methods and their future applications in clinical microbiology and public health settings. two types of molecular methods have been used for the virus discovery effort: sequence-dependent and sequence-independent methods. sequence-dependent methods, including pcr using consensus primers and hybridization methods such as microarrays, require the knowledge of the nucleic acid for the detection of novel viruses. indeed, consensus sequences of previously known viruses have been used to identify novel viruses including highly divergent clades of human immunodeficiency virus [36] , simian retroviruses [37] [38] [39] [40] , and hepatitis e virus [41] . however, pcr using consensus primers based on previously characterized viruses have little or no value in detecting completely novel viruses. the microarray techniques were first introduced in 1995 to monitor the expression of multiple genes simultaneously [42] . for virus discovery, microarrays can be prepared with probes that hybridize known viral sequences and potentially novel viruses with sufficient sequence similarity. the method has been applied to detect a wide range of known viruses as well as novel highly divergent viral taxa [43] . microarray screening has led to the identification and characterization of a novel gammaretrovirus, xenotropic murine leukemia virusrelated virus (xmrv), in prostate tumors [43, 44] . subsequent studies did not confirm these initial findings [45, 46] , which points to potential limitations of the method. another example of a well-known virus discovered with microarrays is sars-cov, a highly divergent coronavirus discovered amid a worldwide outbreak of the severe acute respiratory syndrome (sars) in 2003 [43] . reproducibility of results between microarray tests is frequently poor [47] . unlike pcr and microarrays, the sequence-independent viral metagenomic approaches do not rely on prior knowledge of viruses in the samples. the suppression subtractive hybridization (ssh) and representational difference analysis (rda) are examples of sequence-independent virus discovery methods. ssh was used first to study gene expression [48] and was later applied to investigate the etiology of diseases of unknown origin [49] . by hybridizing dna obtained from patients and control subjects, nucleic acid from an unknown pathogen(s) can be detected [49] [50] [51] . use of rda led to the discovery of human herpes simplex virus type 8 (hhv8) [52] , torque teno virus (ttv) [53] , gbv-a, gbv-b viruses [54] and a novel highly divergent murine norovirus [55] . this method lacks sufficient sensitivity to detect viruses when the viral burden is low or when the dna sequence of the suspected etiological agent is not clearly distinguishable from the control sample [31] . sequence-independent single-primer amplification (sispa) circumvents the viral load limitation of ssh. although there are several variations to the original protocol published by reyes et al. [56] , the main strategy of sispa is to exploit the sensitivity and the specificity of pcr amplification using primers that bind oligonucleotide fragments ligated to any putative viral dna materials in the sample. sispa has been modified to allow the detection of both dna and rna viruses after the removal of genomic and contaminating nucleic acids [57] . the sispa method was used successfully for the discovery of hepatitis e virus [58, 59] , norwalk virus [60] , human astrovirus [61, 62] , and parvoviruses 2 and 3 [63] . another sequence-independent technique, the viral metagenomics (described in detail below), provides superior capability to detect known and unknown viruses than the traditional and molecular sequence-dependent and sequence-independent methods. compared to virus discovery approaches outlined above, viral metagenomics is less biased. potentially, any viruses in the samples, culturable or unculturable, known or novel can be readily detected with the viral metagenomic approach. viral metagenomic methods have evolved significantly since they were first developed. in early studies [17 ,18,19,20 ] , preliminary sample preparation involved shearing of dna and cloning. these steps were required in order to obtain sufficient dna given the low amount of viral dna in environmental samples ($10 mg/100 l of sea water). because viral dna often contains modified nucleotides and because some viral genes (e.g. holins and lysozymes) are toxic to cells, the dna was randomly sheared to produce small fragments before cloning [17 ,18,19,20 ] . the process of sample preparation has since been streamlined and the sequencing speed increased with the advent of high-throughput sequencing technologies. the replacement of cloning with highthroughput methods has revolutionized metagenomics. there are several high-throughput sequencing platforms commercially available that vary by the sequencing principle, the sequencing speed, the cost and read length. an overview of a typical viral metagenomic protocol that can be used in a virus discovery study is provided in figure 2 . essentially, a metagenomic analysis involves three main steps: (1) sample preparation, (2) high-throughput sequencing and (3) bioinformatic analysis. below we provide an outline of each of these steps. more detailed descriptions have been previously published [64 ] . sample preparation. theoretically, any type of sample can be analyzed using the metagenomic approach, including seawater [65] , blood [66] , horse feces [67] , stool [20 , [68] [69] [70] [71] , marine sediments [18], coral tissues [72, 73] , and hot springs [74] . because viral genomes are relatively short, bacterial or eukaryotic nucleic acids can severely interfere with the isolation and detection of viral dna or rna that typically represents only a small fraction. thus, removal 66 environmental virology high throughput sequencing flow chart for the generation of a viral metagenome using highthroughput sequencing. of non-viral nucleic acid is necessary [64 ,75] . homogenization, filtration and ultracentrifugation are often necessary to concentrate the viral particles present in the sample ( figure 2 ). to ensure that viruses are not lost during the virus preparation, epifluorescence microscopy with sybr-gold staining is used on aliquots of samples obtained after the homogenization, filtration, and chloroform treatments to monitor the presence of vlps [64 ] . chloroform treatment followed by dnase digestion is used to remove contaminating dna. the chloroform disrupts mitochondrial, bacterial and eukaryotic membranes, thereby exposing non-viral dna to the subsequent nuclease treatment [76, 77] . unfortunately, chloroform treatment may also cause enveloped viruses to lose their protective lipid membrane, thereby rendering their dna subject to dnase digestion [66] . moreover, dnase treatment does not always completely eliminate non-viral dna in the sample [63, 64 ] . after extraction, dna may need to be amplified with random primers [78, 79] . the whole transcriptome amplification (wta) kit can be used for the synthesis of cdna from viral rna [80] . single virus genomics (svg) was introduced by allen and collaborators to selectively isolate viruses before sequencing [81] . svg uses flow cytometry to sort viruses based on a method originally described by brussard et al. [82] . following the sorting, dna of different sizes is immobilized in agarose gel, and then amplified using the multiple displacement amplification (mda) method. the svg approach can also be applied to rna viruses provided a reverse-transcription step is inserted between the flow cytometry and mda. high-throughput sequencing. early metagenomic applications involved the generation of shotgun libraries and direct sequencing of the total dna content using the sanger enzymatic dideoxy-sequencing method. this approach permitted the discovery of novel phages in marine environments [61, 66] . the sanger technique had been the standard method for sequencing since it was first described in 1977 [83] . development of the 'next-generation' sequencing platforms offered the combined advantages of speed, automation and high-throughput, thereby increased sequencing capabilities by a factor of 100 to a million relative to the sanger technology. the illumina/solexa and roche 454 next-generation sequencing platforms have been used most often in virus discovery (figure 1 ). the illumina/solexa method is based on sequencing-by-synthesis chemistry using fragments of the sample dna ligated to oligonucleotide adapters. the adapters on a solid support act as primers for dna polymerase to incorporate reversible terminator nucleotides, each labeled with a different fluorescent dye. a typical sequencing run can generate up to 18 gigabases of data with an average read length of 75-100 nucleotides [21] . the sweetpotato badnavirus and the sweetpotato mastrevirus are examples of viruses discovered using the illumina/solexa sequencing platform [84] . the 454 flx titanium pyrosequencer commercialized by roche has been the most used for the discovery and characterization of novel viruses (http://www.454.com/ publications-and-resources/publications.asp?postback= true). this platform was used for the identification of an uncharacterized mycovirus [85] , solenopsis invicta virus 3 [86] , merino walk virus and a new arenavirus [87, 88] , among others (figure 1b) . for sequencing, dna is fragmented and ligated to biotinylated specific linkers. the complex dna/linkers fragment is attached to streptavidin-coated beads that anchor the dna inside a droplet of water and pcr reagents in oil emulsion. each fragment is first amplified to produce the template for sequencing reaction. sequencing is carried out by annealing primers to the linker portion of the template complex, followed by the incorporation of nucleotides by dna polymerase, which facilitates the extension of the complementary dna. the pyrophosphate released by this process is measurable by the production of light [89, 90] . the roche 454 system measures the pyrophosphate released as the result of nucleotide incorporation during dna synthesis mediated by dna polymerase. the amount of light released is proportional to the intensity of the light signal captured by a charge-coupled device (ccd) camera, which then converts light signals into digital data [91, 92] . a typical optimum run using a 454 pyrosequencer yields about one million reads with an average length of 350-450 nucleotides, totaling about 0.4 gigabases. bioinformatic analyses. the analysis of the copious data generated by high-throughput sequencing is the most challenging aspect of metagenomics. an inherent difficulty in assigning taxonomic designations to viral sequences is that there is no universally homologous nucleic acid component present in all viruses that can be used to build phylogenetic trees -a factor that also fuels the debate over whether or not viruses belong in the tree of life [11, [93] [94] [95] [96] . in most metagenomic studies, sequences generated by high-throughput sequencing are queried by homology search tools to previously documented sequences stored either in a local database or in public databases such as the genbank. unfortunately, homology searches against known sequences in genbank cannot characterize unknown viruses (figure 3 ). the analysis of metagenomic libraries requires fast computation and the right algorithms to characterize sequences as belonging to putative viruses. to ensure that bioinformatic analyses are performed only on high quality data, the reads are typically processed through a software pipeline to remove any background sequences including host and bacterial dna that had not been removed by the filtration, chloroform, and dnase i treatments [97] [98] [99] . the resulting sequence reads are assembled with strict parameters to generate contigs, each made of sequences derived from the same organism quasi-species. using a stringent assembly parameter is critical to avoid sequence chimerization. the contigs sequences are then compared to the genbank non-redundant nucleotide database using blast [100] or usearch [101] . note that using a database containing only viral sequences will not be able to identify bacterial, archaeal or eukaryotic sequences and lead to an overestimation of the fraction of unknowns (see below). with the increasing number of data generated from different studies, there is a need for a cross-metagenome meta-analysis [102, 103] . this is particularly important because of the diversity of different viral metagenomic protocols and the lack of standard algorithm for downstream data analysis. the following items should be included in any report on viral metagenomic studies: firstly, the sequencing platform and its version number; secondly, raw sequence data accession numbers in a public database; thirdly, details about the bioinformatic analysis, including the homology search tool and the database being used to assign the taxonomy, and their versions; fourthly, a list of known and previously unknown viruses found, clearly showing if the 'novel' viruses are new strains of a previously described species or completely different viruses; and fifthly, causality evidence if any. the most intriguing aspect of viral metagenomics is the fact that a large number -usually the majority -of sequences has no significant similarity to anything known. in this review, we refer to these sequences as the 'unknown' (figure 3a) . a typical human or environmental viral metagenome can contain between 60% and 99% unknown sequences (figure 3) searches (figure 3b ). depending on how they are viewed, the unknowns can represent either a formidable challenge or a treasure trove for virus discovery. although researchers often tend to consider the unknowns as 'junk,' these sequences could be a valuable blueprint for the discovery of novel viruses [112, 124, 125] . thus far, there is a lack of suitable bioinformatic methods to characterize the unknown sequences. a tentative solution is to compare the sequences between samples in order to at least gain some insight about the viral entities that are shared between them. a program such as phaccs (phage communities from contig spectra) can be used to assess the biodiversity of uncultured viral communities by mathematically modeling the community structure using the contig spectrum of metagenome assemblies [126] . this method was extended to assess crossassemblies of reads from different samples [65] , providing a homology-independent tool for the comparison of metagenomes with a high proportion of unknown sequences. although phaccs may provide a glimpse of the composition and difference between metagenomes, it has limited value for the characterization of novel viruses. two tools can be used to predict whether unknown sequences are from bacteriophages undergoing lytic and lysogenic lifestyles. one such tool described by deschavanne et al. [127] compares the genome signatures of query sequences against those of their host genome in order to identify host-phage relationship and information about the phage lifestyle. the second method, phacts, depends on residual homology between the putative unknown sequence and sets of randomly selected viral proteins from known viruses (k mcnair et al., phacts: a computational approach to classifying the lifestyle of phages, unpublished data). alternatively, viruses may be classified by basic sequence properties. for instance, the circularity of the contig, its oligonucleotide profile [128] , and the open reading frame (orf) structure (s akhter et al., phispy: a novel algorithm for finding prophages in bacterial genomes that combines similarity-based and composition-based strategies, under review) may all provide clues whether the unknown sequence could be from a potential novel virus. these properties can be combined into a prediction network used to classify viruses into lifestyle groups or taxonomic clades. although newly discovered viruses are often labeled 'novel,' the question remains whether these sequences represent truly novel viruses or ancient viruses that simply have never been observed before. the age of a sequence has traditionally been determined by multiple alignments of query sequences with their homologs and by calculating the divergence times from a common ancestral node on a phylogenetic tree. dates can be estimated using either a molecular clock [129] or by assigning a calibration date to a specific node in the tree based on fossil or other evidence [130] [131] [132] . for viral metagenomic sequences, however, building a phylogenetic tree is itself problematic because often the sequenced reads may represent non-overlapping subregions of an unknown viral genome. moreover, there is no fossil data available to calibrate the age of nodes in the tree. a promising approach might be to estimate divergence times from assembled viral contigs. de novo assembly allows non-overlapping regions to be combined into a single consensus sequence. for a given molecular clock, snp analysis of the contributing reads could provide an estimation of how long ago the sequenced reads diverged. such estimates may be critical when addressing the question of the origin of a newly identified infectious agent. until recently, virus discoveries were made in the context of disease etiology. thus, virus discovery studies were biased mainly because of the use of convenient samples available from patients. because of the difficulties involved, the investment of efforts and resources required to isolate viruses often could not be justified outside the disease context. it is likely that the context of the diseases has also led to the misconception that all viruses are pathogenic. this dogma was challenged by the discovery of viruses such as torque teno virus (ttv) and hepatitis g virus (gbv-c), originally associated with post-transfusion hepatitis [53, [133] [134] [135] , and then were subsequently shown be classical examples of viral commensals [136, 137] . the widely accepted notion that viruses act as obligatory pathogens is beginning to give way to the concept that viruses can be part of the normal flora of the human body. considering their high abundance in the gastrointestinal tract, on skin and even in blood and lungs [138] it is unlikely that viruses could only be pathogenic without any benefits for their hosts. the abundance of viruses, particularly phages, in the lung -an environment previously thought to be sterile -may reflect their beneficial role in keeping bacterial populations in check [138] . the pathogenicity of the gbv-c has shifted to a more radical designation as a 'good' virus in cases of coinfection with hiv. indeed, gbv-c has been associated with a more favorable prognosis for patients with hiv infection by slowing the progression to aids [139, 140] . similarly, dengue virus, a known pathogen, has been shown to limit hiv-1 replication and to reduce the viral load [141] . these examples need to be taken into account when metagenomic approach is applied to virus discovery. the characterization of a novel virus can be easily achieved in silico with limited bioinformatics tools but the determination of causation may not always be trivial. the causality is not always conclusive even when the suspect virus is found in the scene of the crime. in other words, finding a virus in a sample from a patient with an illness of unknown etiology and even demonstrating the association does not always prove causation. for this reason, strict guidelines proposed by robert koch and later modified by rivers [142] have been used to assign causality to infectious agents. one of koch's postulates requires that the candidate etiological agent be isolated from a diseased organism and grown in pure culture. however, many viruses cannot be propagated by culture techniques [143] . new molecular biology techniques have been used for virus discovery bypassing the prerequisite of the koch's postulates. for instance, the merkel cell polyomavirus (mcv) was identified as the causative agent of merkel's cell carcinoma without satisfying all of the requisites of koch's postulates [144] . similarly, the sea turtle tornovirus 1 was associated with fibropapillomatosis using a culture-independent metagenomic approach [118] . the methodological shift, from culture to metagenomics, will likely create a paradigm shift in the demonstration of disease causation. in many instances koch's postulates will no longer be satisfied if culture techniques are used to prove causality. falkow [145 ] proposed the modified koch's postulates which uses molecular methods to monitor the role played by genes in distinct bacterial virulence. to satisfy the revised molecular koch's postulates, a strong association must be established between the phenotype or property under investigation and the pathogenic members of a genus or pathogenic strains of a species. the gene of interest should be found in all pathogenic members of the genus or species but be absent in nonpathogenic strains. at best, the nonpathogenic strains could carry the gene with critical mutations that could render the strain non-virulent. however, new molecular methods do not always distinctively characterize virulence genes and make a clear association with a disease of unknown etiology. this could be because genes can be expressed at different time-points during infection. genes can be turned on and off and may require intrinsic factors in order to trigger the disease process. alternatively, we propose the metagenomic koch's postulates, which focus on the identification of metagenomic traits in disease subjects. the metagenomic traits are molecular markers such as sequence reads, assembled contigs, genes or full-genomes that can uniquely distinguish diseased metagenomes from those obtained from matched healthy control subjects (figure 4) . the metagenomic traits found in diseased patients can be monitored in healthy individuals exposed to the suspected infectious agent. although this novel approach requires separation or isolation of remaining co-occurring disease candidates (figure 4.3) , it does not necessarily require the isolation of the pathogen in tissue culture or pure culture media unlike the original koch's postulates. therefore, the genetic make-up of the agent responsible for a disease can provide early clues before its isolation by tissue culture. the modified metagenomic koch's postulates proposed in this paper require that: firstly, the diseased metagenome be significantly different from the metagenome constructed with the same sample type obtained from a healthy matched control subject. the suspected metagenomic traits must be present and more abundant in the diseased subject compared to matched control (figure 4.1) . secondly, inoculating a healthy individual with a sample from a diseased subject must result in disease state (figure 4 .2). differential metagenomic traits in step (1) recovered in the newly induced diseased subject may be the biomarker of the candidate etiological agent; and finally, selective inoculation of samples from the disease subject (in step 2) must induce disease in another healthy control subject if the metagenomic contains the trait associated with the etiological agent of the disease, or phenotype under investigation (figure 4.3) . assuming that the metagenomic trait 'e' (figure 4 .3) is a contig sequence from a previously unknown and unculturable virus, its early identification using the metagenomic approach could spearhead the effort to generate diagnostic assays such as elisa and pcr, well before the isolation and the characterization of the viruses by culture techniques. fulfilling this metagenomic model of the koch's postulates is possible when one or multiple viral agents are involved in disease causation. with the original koch's postulates or the modified molecular koch's postulates, it is difficult enough to prove causality with one suspected agent using the culturing prerequisite. the complexity is even greater when multiple viruses are involved in the causation of a disease. a similar approach, the sirna-ome used previously by kreuze et al. [84] led to the detection of etiological viruses causing diseases in plants despite the low copy number of the suspected traits [84] . the modified metagenomic koch's postulates could also be tested in human diseases such as the murine mink cell leukemia caused by a c-type retrovirus, named the mink cell focus-inducing virus (mcfiv) [146] . mcfiv requires the cooperative interaction with other viruses to increase its propensity to cause leukemia [146] . the burkitt's lymphoma caused by others epstein-barr virus (ebv) in regions holoendemic for plasmodium falciparum, the etiology of malaria [147] . metagenomics could become the future method of choice enabling the simultaneous analysis of multiple agents in a sample and assessment of the association and disease causality without the limitations imposed by culture techniques [138, 148, 149] . the etiology of many diseases remains unknown. these ailments are collectively defined as diseases of unknown etiology when all conventional testing laboratory techniques are unsuccessful. yet, the diseases with unknown origin have high rates of morbidity and mortality. for example, as many as 40% of cases of the infantile diarrhea, which alone claims $1.8 million fatalities annually, have no known specific causative agent [112] . infantile diarrhea, the pyrexia of unknown origin, influenza-like illnesses, chronic fatigue syndrome, alzheimer's disease, various forms of tumors such as diffuse large b-cell lymphoma and many other diseases of unknown origin can benefit directly from the metagenomic technology. the success of metagenomics in identifying novel viruses in a wide variety of samples opens doors to new application areas particularly in public health and the prevention of infectious diseases. although the metagenomic technology is not yet part of the routine diagnostics, results from clinical virology research provides valuable proof of concepts for a new era in clinical virology practices. for example, finkbeiner et al. analyzed samples from 12 children using metagenomics and identified a large number of known eukaryotic viruses as well as sequences from putatively novel viruses [112] . another study identified a corona-like virus, the human cosavirus e1 (hcosv-e1), in a child with acute diarrhea [150] . these initial studies identified promising viral candidates to establish the etiology in these cases of diarrhea. the 2009 pandemic of influenza a (2009 h1n1) provided proof of concept in that metagenomics was effective to rapidly characterize the full genome of the flu virus [151] . using the metagenomic approach, palacios et al. discovered an arenavirus in samples which had tested negative by culture, pcr, serology and a microarray assay using oligonucleotide probes from a wide range of infectious agents [87] , suggesting a potential causative agent for unexplained cases of posttransplantation death. in another study, towner et al. described a new ebola virus responsible for an outbreak of a hemorrhagic fever in the district of bindibugyo, uganda [152] . rapid identification of these agents would provide the blueprint for the development of therapeutic regimen or preventive vaccine. prevention is better than cure. potentially, a single or multiple jump of an animal virus to humans can have serious consequences. one way to prevent infectious diseases is through vaccine development. but the development of a vaccine takes time and demands a huge amount of resources. preventing the introduction of an unknown virus to human populations is rather a farreaching goal unless the methods of virus identification and characterization are put in place. a simple and practical strategy would be to assess the danger posed by viruses that thrive in animals and could cross to human through zoonosis. zoonosis is a source of up to 75% of emerging infectious diseases in humans [153] . as such, cross-species transfer from animals to humans has serious repercussions not only in public health but also in the socio-economical and political stability [68, [154] [155] [156] [157] [158] . the detection and characterization of novel viruses are of paramount importance in the forecasting of future outbreaks of viral diseases in humans. surveying natural reservoirs for potential zoonotic infection [69] and human populations such as bush meat hunters who are exposed to animals could help prevent major outbreaks before the wide spread of viruses to human population. data obtained in early identification of viruses are valuable for forecasting new emerging and re-emerging viral epidemics. the experience gained from studying marine environments and hostile mine environments can be applied in public health programs that seek to determine the normal viral population and monitor changes in different geographical settings. we have termed such an approach as public health viral metagenomics surveillance (phvms). viral metagenomics surveillance is defined as the survey of the functional and taxonomic signatures representing the viruses normally circulating within that population in the absence of noticeable epidemics. in the event of a zoonotic outbreak, these functional and taxonomic signatures of the virome will likely show detectable shifts. figure 5 shows a hypothetical rank abundance curve for six viruses (a-f). the introduction of a highly pathogenic species (g) can be expected to result in a disruption of the normal virome, including the appearance of opportunistic viral infections (h). using phaccs analysis [126] , several parameters can be compared between the normal and disturbed viromes including the total number of viral species (richness) and their relative abundance (evenness). another approach would be to determine the normal virome, a background viral metagenome to refer to in case of an outbreak. lessons learned from studies of bacterial microbial metagenomes suggest that different environments often have different microbial signatures [159] , including the functional metabolic information, the nucleotide usage, proportion of different species. disrupting key metabolic processes of an environment can lead to disruption of the balance in that ecosystem. similarly, the viromes in different human populations in different locations may display functional profiles characteristic of their respective environment, lifestyle and viruses circulating in each region. the magnitude of disturbance of the virome profile will depend on the fitness and virulence of the newly introduced pathogens and the immune fitness of the host. the viral communities in two different metagenomes can be compared using xipe [160] . this statistical approach was developed for comparing metagenomic sequences derived from samples collected from the sargasso sea and from acid mine drainage and was able to accurately predict the physiology, metabolic potential and ecology of each ecosystem [160] . during the last decade, we have witnessed the emergence of metagenomics as a powerful novel tool with endless areas of applications in virology. epidemiological data suggest that novel viruses are likely to be introduced into the human population through zoonosis [153, 158] . also, the danger of intentional introduction of viruses through bioterrorism cannot be ignored. viral metagenomics is a powerful, fast and sensitive technique available for identifying viruses including those that cannot be detected by conventional culture and sequence-dependent methods. monitoring of emerging infectious diseases using a metagenomic approach. a hypothetical example of the potential use of the public health viral metagenomics surveillance (phvms) approach for virus discovery based on comparison of viromes sampled before (i) and during (ii) an epidemic. depicted here are the rank abundance curves for viral species (a-h), where g represents a newly introduced, highly pathogenic species and h a less virulent virus. papers of particular interest, published within the period of review, have been highlighted as: of special interest of outstanding interest high abundance of viruses found in aquatic environments prokaryotes: the unseen majority consider something viral in your research analysis of a marine picoplankton community by 16s rrna gene cloning and sequencing comparing bacterial communities inferred from 16s rrna gene sequencing and shotgun metagenomics bacterial 16s sequence analysis of severe caries in young permanent teeth a renaissance for the pioneering 16s rrna gene a comparison of random sequence reads versus 16s rdna sequences for estimating the biodiversity of a metagenomic library metagenomics -the key to the uncultured microbes a census of rrna genes and linked genomic sequences within a soil metagenomic library the phage proteomic tree: a genomebased taxonomy for phage metagenomics: genomic analysis of microbial communities biotechnological prospects from metagenomics opportunities to improve fiber degradation in the rumen: microbiology, ecology, and genomics cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms next-generation dna sequencing techniques a history of poliomyelitis. yale studies in the history of science and medicine the discovery of the poliovirus smallpox and its eradication the greatest killer -smallpox in history discovery of the first virus, the tobacco mosaic virus: 1892 or 1898? methods to detect infectious human enteric viruses in environmental water samples role of cell culture for virus detection in the age of technology rapid viral diagnostic techniques delwart el: viral metagenomics a very comprehensive description of metagenomic methods and important benchmarks achieved in the virus discovery effort isolation of a cdna clone derived from a blood-borne non-a, non-b viral hepatitis genome an assay for circulating antibodies to a major etiologic virus of human non-a, non-b hepatitis ethical reflections on edward jenner's experimental treatment metagenomics for the discovery of novel human viruses a comprehensive review describing metagenomic methods and important benchmarks achieved identification of a novel clade of human immunodeficiency virus type 1 in democratic republic of congo a novel simian immunodeficiency virus from black mangabey (lophocebus aterrimus) in the democratic republic of congo characterization of a novel simian immunodeficiency virus (sivmonng1) genome sequence from a mona monkey (cercopithecus mona) isolation and partial characterization of a lentivirus from talapoin monkeys (myopithecus talapoin) a novel simian immunodeficiency virus (sivdrl) pol sequence from the drill monkey, mandrillus leucophaeus isolation of a cdna from the virus responsible for enterically transmitted non-a, non-b hepatitis quantitative monitoring of gene expression patterns with a complementary dna microarray viral discovery and sequence recovery using dna microarrays identification of a novel gammaretrovirus in prostate tumors of patients homozygous for r462q rnasel variant prostate cancer: xmrv -contaminant, not cause? no association of xenotropic murine leukemia virus-related viruses with prostate cancer reliability and reproducibility issues in dna microarray measurements efficient isolation of genes differentially expressed on cellulose by suppression subtractive hybridization in agaricus bisporus virus discovery by sequenceindependent genome amplification suppression subtraction hybridization (ssh) and macroarray techniques reveal differential gene expression profiles in brain of sea bream infected with nodavirus suppression subtractive hybridization: a versatile method for identifying differentially expressed genes identification of herpesvirus-like dna sequences in aids-associated kaposi's sarcoma a novel dna virus (ttv) associated with elevated transaminase levels in posttransfusion hepatitis of unknown etiology identification of two flavivirus-like genomes in the gb hepatitis agent stat1-dependent innate immunity to a norwalk-like virus sequence-independent, single-primer amplification (sispa) of complex dna populations metagenomics and the molecular identification of novel viruses viruses in the faecal microbiota of monozygotic twins and their mothers hepatitis e virus (hev): the novel agent responsible for enterically transmitted non-a, non-b hepatitis the isolation and characterization of a norwalk virus-specific cdna identification of a novel astrovirus (astrovirus va1) associated with an outbreak of acute gastroenteritis detection of a novel astrovirus in brain tissue of mink suffering from shaking mink syndrome by use of viral metagenomics a virus discovery method incorporating dnase treatment and its application to the identification of two bovine parvovirus species laboratory procedures to generate viral metagenomes an excellent compilation of standard operating procedures to perform metagenomic analysis on different types of samples the marine viromes of four oceanic regions method for discovering novel dna viruses in blood using viral particle selection and shotgun sequencing analysis of the virus population present in equine faeces indicates the presence of hundreds of uncharacterized virus genomes multiple diverse circoviruses infect farm animals and are commonly found in human and chimpanzee feces bat guano virome: predominance of dietary viruses from insects and plants plus novel mammalian viruses viral diversity and dynamics in an infant gut rna viral community in human feces: prevalence of plant pathogenic viruses viral communities associated with healthy and bleaching corals metagenomic analysis of stressed coral holobionts assembly of viral metagenomes from yellowstone hot springs using pyrosequencing to shed light on deep mine microbial ecology microbes and health sackler colloquium: metagenomic detection of phage-encoded platelet-binding factors in the human oral cavity extraction of high molecular weight genomic dna from soils and sediments rapid amplification of plasmid and phage dna using phi 29 dna polymerase and multiply-primed rolling circle amplification assessment of whole genome amplification-induced bias through highthroughput, massively parallel whole genome sequencing whole transcriptome amplification for gene expression profiling and development of molecular archives single virus genomics: a new tool for virus discovery flow cytometric detection of viruses dna sequencing with chainterminating inhibitors complete viral genome sequence and discovery of novel viruses by deep sequencing of small rnas: a generic method for diagnosis, discovery and sequencing of viruses arbovirus detection in insect vectors by rapid, highthroughput pyrosequencing isolation and characterization of solenopsis invicta virus 3, a new positive-strand rna virus infecting the red imported fire ant, solenopsis invicta a new arenavirus in a cluster of fatal transplant-associated diseases genomic and phylogenetic characterization of merino walk virus, a novel arenavirus isolated in south africa parallel tagged sequencing on the 454 platform targeted high-throughput sequencing of tagged nucleic acid samples the history of pyrosequencing a new method of sequencing dna the not so universal tree of life or the place of viruses in the living world reasons to include viruses in the tree of life viral genomes are part of the phylogenetic tree of life there is no such thing as a tree of life (and of course viruses are out!) quality control and preprocessing of metagenomic datasets fast identification and removal of sequence contamination from genomic and metagenomic datasets tagcleaner: identification and removal of tag sequences from genomic and metagenomic datasets basic local alignment search tool the minimum information about a genome sequence (migs) specification get the most out of your metagenome: computational analysis of environmental sequence data cloning of a human parvovirus by molecular screening of respiratory tract samples metagenomic analysis of coastal rna virus communities identification of a third human polyomavirus metagenomic characterization of chesapeake bay virioplankton a metagenomic survey of microbes in honey bee colony collapse disorder metagenomic and small-subunit rrna analyses reveal the genetic diversity of bacteria, archaea, fungi, and viruses in soil biodiversity and biogeography of phages in modern stromatolites and thrombolites viral genome sequencing by random priming methods metagenomic analysis of human diarrhea: viral detection and discovery novel borna virus in psittacine birds with proventricular dilatation disease rapid identification of known and new rna viruses from animal tissues next-generation sequencing and metagenomic analysis: a universal diagnostic tool in plant virology genetic detection and characterization of lujo virus, a new hemorrhagic fever-associated arenavirus from southern africa the complete genome of klassevirus -a novel picornavirus in pediatric stool discovery of a novel single-stranded dna virus from a sea turtle fibropapilloma by using viral metagenomics novel anellovirus discovered from a mortality event of captive california sea lions metagenomic analysis of viruses in reclaimed water novel circular dna viruses in stool samples of wild-living chimpanzees novel picornavirus in turkey poults with hepatitis the fecal virome of pigs on a high-density farm systematic artifacts in metagenomes from complex microbial communities metagenomics: facts and artifacts, and computational challenges phaccs, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information the use of genomic signature distance between bacteriophages and their hosts displays evolutionary relationships and phage growth cycle determination metagenomic signatures of 86 microbial and viral metagenomes molecular dating in the evolution of vertebrate poxviruses genomic fossils calibrate the long-term evolution of hepadnaviruses fossil record of an archaeal hk97-like provirus r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock detection of a novel dna virus (ttv) in blood donors and blood products prevalence of gbv-c and hepatitis g virus variants in patients with fulminant hepatic failure in japan a prospective study of transfusion-transmitted gb virus c infection: similar frequency but different clinical presentation compared with hepatitis c virus transfusion transmission of highly prevalent commensal human viruses chronic viral hepatitis in hemodialysis patients metagenomic analysis of respiratory tract dna viral communities in cystic fibrosis and non-cystic fibrosis individuals persistent gb virus c infection is associated with decreased hiv-1 disease progression in the amsterdam cohort study gbv-c/hepatitis g virus (hgv) rna load in immunodeficient individuals and in immunocompetent individuals decrease in human immunodeficiency virus type 1 load during acute dengue fever viruses and koch's postulates sequence-based identification of microbial pathogens: a reconsideration of koch's postulates clonal integration of a polyomavirus in human merkel cell carcinoma significant paradigm change and a challenge to the existing koch's postulates and the proposal to use molecular methods to assign etiology to infectious agents a virus-virus interaction circumvents the virus receptor requirement for infection by pathogenic retroviruses etiology of endemic burkitt's lymphoma deep sequencing analysis of rnas from a grapevine showing syrah decline symptoms reveals a multiple virus infection that includes a novel virus the prostate cancer-associated human retrovirus xmrv lacks direct transforming activity but can induce low rates of transformation in cultured cells identification of a novel picornavirus related to cosaviruses in a child with acute diarrhea a metagenomic analysis of pandemic influenza a (2009 h1n1) infection in patients from north america newly discovered ebola virus associated with hemorrhagic fever outbreak in uganda risk factors for human disease emergence emerging disease: looking for trouble bushmeat hunting, deforestation, and prediction of zoonoses emergence emergence of unique primate t-lymphotropic viruses among central african bushmeat hunters naturally acquired simian retrovirus infections in central african hunters applying the theory of island biogeography to emerging pathogens: toward predicting the sources of future emerging zoonotic and vector-borne diseases functional metagenomic profiling of nine biomes an application of statistics to comparative metagenomics new dna viruses identified in patients with acute viral infection syndrome novel, divergent simian hemorrhagic fever viruses in a wild ugandan red colobus monkey discovered using direct pyrosequencing we are grateful to merry youle for helpful suggestions and editing of the manuscript. jm was supported by a grant from the ucsd center for aids research (niaid 5 p30 ai36214) and moores ucsd cancer center (nci 5p30 ca23100). bed was supported by the dutch science foundation (nwo) veni grant (016.111.075). key: cord-348427-worgd0xu authors: hatcher, eneida l.; zhdanov, sergey a.; bao, yiming; blinkova, olga; nawrocki, eric p.; ostapchuck, yuri; schäffer, alejandro a.; brister, j. rodney title: virus variation resource – improved response to emergent viral outbreaks date: 2017-01-04 journal: nucleic acids res doi: 10.1093/nar/gkw1065 sha: doc_id: 348427 cord_uid: worgd0xu the virus variation resource is a value-added viral sequence data resource hosted by the national center for biotechnology information. the resource is located at http://www.ncbi.nlm.nih.gov/genome/viruses/variation/ and includes modules for seven viral groups: influenza virus, dengue virus, west nile virus, ebolavirus, mers coronavirus, rotavirus a and zika virus. each module is supported by pipelines that scan newly released genbank records, annotate genes and proteins and parse sample descriptors and then map them to controlled vocabulary. these processes in turn support a purpose-built search interface where users can select sequences based on standardized gene, protein and metadata terms. once sequences are selected, a suite of tools for downloading data, multi-sequence alignment and tree building supports a variety of user directed activities. this manuscript describes a series of features and functionalities recently added to the virus variation resource. genome sequences have the potential to define evolutionary relationships, elucidate disease determinants and inform public health policy decisions. the public databases that comprise the international nucleotide sequence database consortium (insdc) are an invaluable resource to a variety of genome-related sequence analysis projects (1) . this collaboration between the national center for biotechnology information (ncbi), the european bioinformatics institute and the dna databank of japan supports free and unrestricted access to stored sequence data that are maintained as part of the scientific record. as nucleotide sequencing efforts extend into the future, the archival insdc databases will support comparisons between samples collected over generations and provide infrastructure to study the evolution and impact of viruses in real time. despite this potential, there are fundamental issues with archival databases that can only be resolved through resources that provide enhanced data such as the ncbi virus variation resource (http://www.ncbi.nlm.nih.gov/genome/ viruses/variation/), which is described in this manuscript. genbank records (2) and other insdc sequence records are archival by design, and changes to them can be made only by one of the original submitters. hence, it is likely that the gene and protein annotations and information about the source of the sequence will remain unchanged after a sequence is deposited in an insdc database. this is problematic because even if communities develop sequence annotation standards, the pace of biochemical and genetic research effectively guarantees that annotations become outdated as new genetic features are characterized and naming conventions change. for example, while it has been known for some time that flavivirus genomes encode a polyprotein that is cleaved into mature peptides, sometimes with two rounds of cleavage (3) (4) (5) (6) , recently, several flavivirus proteins have been identified that are translated (at least partially) from alternative reading frames (7) . these alternative reading frame proteins and mature peptides, especially the products of the second round of cleavage, are not annotated in the vast majority of current genbank records for flavivirus genomes. the limitations of an archival database can be illustrated by considering a common way in which it might be used -to obtain all of the nucleotide sequences that encode a particular gene of interest. take, for example, the rnadependent rna polymerase (rdrp) of the ebolavirus. one would need to know that this gene is also sometimes called l-protein or l-polymerase and search the database with all three names to find all relevant protein sequences. in addition, not all genes or proteins are annotated in all database entries, so one would still likely miss some potential sequences. alternatively, a nucleotide blast search could be performed using the rdrp coding region from the zaire ebolavirus reference sequence (refseq accession number nc 002549.1). however, when matching sequences are obtained, there would still be no indication of potential prob-nucleic acids research, 2017, vol. 45, database issue d483 lems with the sequences, such as frameshifts, which may affect the biological function of the resulting protein. even when an annotation pipeline is available to validate retrieved sequences, several additional steps would be needed to associate metadata, such as country of isolation or host, to the sequences. issues regarding the long term usability of sequence data were addressed in the ncbi influenza virus resource (8) . this resource leveraged machine processing of gen-bank records, human curation and a unique search and retrieval interface to build a value-added user experience where researchers could search for sequences using defined, standardized terms (table 1 ). an annotation pipeline was added later to standardize gene and protein annotation and nomenclature across all sequences. this feature supports not only standardized annotation of sequences when submitted, but also provides a mechanism to update previously submitted sequences as new genes and proteins are described. in many ways, the ncbi influenza virus resource paved a path for a variety of other resources that share the common goal of making viral sequence data more accessible (9) (10) (11) (12) . these include the ncbi virus variation resource where the influenza virus resource data model was extended to include dengue and west nile viruses (13, 14) . while the initial release of this resource provided a range of functionalities, the necessity of in-house annotation pipelines and internally developed tools imposed long development cycles making it difficult to quickly provide new modules in response to emerging outbreaks and associated nucleotide sequencing efforts. here, we document a series of updates and improvements designed to make viral sequences more easily accessible and usable through the virus variation resource, a value-added database, as well as tools that make it simple to analyze genomic relationships. the resource now includes expanded data processing pipelines and analysis tools, and supports selection and retrieval of nucleotide and protein sequences from four new viral groups: ebolaviruses, mers coronavirus, rotavirus, and zika virus ( table 2 ). the latest package of updates includes a variety of features designed to improve data usability and ease data retrieval. new processes have been added to parse source descriptor terms from gen-bank records and map these to controlled vocabulary, and the resource now supports retrieval of sequences based on standardized isolation source and host terms in addition to standardized gene and protein names. a new set of filters has also been developed to identify laboratory isolates, vaccine strains or environmental samples so that they can be included or excluded from searches. a variety of updates have been made to the search interface and results table to better leverage these features, and a new set of multi-sequence alignment and tree building tools has been implemented to allow robust analysis of retrieved sequences. the ncbi virus variation resource provides users with a convenient way in which to search, download, and analyze viral nucleotide and protein sequences. the resource includes data processing pipelines that retrieve sequences from genbank, provide standardized gene and protein an-notation, and map sequence source descriptors (i.e. metadata) to uniform vocabularies. this data processing enables users to select sequences based on standardized gene, protein and metadata terms using a purposely-designed interface. once selected, sequences can then be downloaded with the standardized metadata in a variety of formats or analyzed using web-based alignment and tree building tools. there are currently seven discrete virus variation modules--dengue virus, ebolavirus, influenza virus, mers coronavirus, rotavirus a, west nile virus, and zika virus--and these include a total of nearly 550 000 nucleotide sequences (see table 2 ). example usages of the resources for dengue virus, ebolavirus, and rotavirus are klema et al. current development efforts have focused on expanding the virus variation model to include more viruses, enhancing the functionality of the resource and providing rapid support to emergent sequencing efforts. this last point has been particularly relevant over the past several years as emerging viral outbreaks of ebola and zika viruses and others have quickly led to large sequencing efforts. there was a clear need to support these sequencing efforts with bioinformatics resources, but timelines prevented traditional development paths where new virus modules and features were added over the course of months. the first rapid deployment of a virus variation module was during the western african ebola virus outbreak that began in december of 2013. the outbreak was declared a public health emergency of international concern by the world health organization on august 8, 2014 (http://www.who.int/mediacentre/ news/statements/2014/ebola-20140808/en/). by september, a virus variation resource specific to ebolaviruses was available to help access the sequences that had begun to pour into the insdc databases. similarly, a virus variation resource module was developed in september 2014 in response to the outbreak of middle east respiratory syndrome-related coronavirus (mers-cov). most recently, this rapid response model was repeated for the zika virus module, which was put in place in march 2016. this need-based deployment strategy is likely a model for future efforts, and much of our current development is geared toward harmonizing processes and interfaces among individual data and software modules so as to provide more support for more virus species within the resource and to respond more efficiently to emergent large-scale sequencing efforts. accurate gene and protein annotation is necessary both to identify sequences of interest and to analyze them. the virus variation resource employs annotation pipelines that support consistent gene and protein naming. initial processing for each annotation pipeline is the same: newly released genbank records are retrieved hourly based on their listed taxonomy. retrieved sequences are compared to nucleotide references for that virus group using blastn, and the best match is determined (8, 13, 18) . this step confirms species taxonomy, identifies segment assignment if applicable and provides information about the lineage, genotype, type or subtype. the references used are listed in table 3 , and sequences that fail to match a reference within established metrics are pushed to a curation interface where they can be reviewed manually. once a sequence has been matched to a reference, one of three pipelines is employed to determine the span of gene and protein features and to assign standardized names to these features. the first pipeline uses a reference protein guided approach based on the prosplign tool as described previously (8, 13, 18) . here, protein reference sequences are aligned with potential translations of the query sequence. the highest scoring translation alignment to any protein reference is then chosen and parsed to determine that it meets specific criteria -the presence of a start codon, exact matches to mature peptide cleavage sites or premature stop sites. post transcriptional and translational exceptions can be accounted for by this tool by adjusting parameters and allowing multiple transitions from different open reading frames to be assembled into a single alignment. one advantage of this approach is that new viruses can be incorporated by adding new reference protein sequences and adjusting the criteria used for validating a particular translation. such was the case for zika virus annotation where the existing dengue virus pipeline was updated with new zika virus reference sequences (see table 3 ). a second approach to gene and protein annotation was implemented in the ebola virus and mers coronavirus rapid deployment modules. here, there was a need to quickly develop a pipeline that could validate the annotation on genbank records and assign consistent gene and protein names so that these could be accurately used as search criteria. to accomplish this, a blast-based pipeline was developed that compares genes and proteins as annotated on genbank records to reference proteins derived from the best reference nucleotide match. if a protein matches the reference sequence with >70% identity as measured by blastp then the presence of this protein is stored. genes are validated in the same manner using blastn and reference nucleotide sequences. sequences with genes and proteins that cannot be validated are pushed to the curation interface where they can be manually examined. ultimately these approaches support both search and analysis functionality but are not capable of generating standardized annotation across all sequences belonging to a particular virus. our experience has emphasized the importance of accurate annotation pipelines that can be applied to new viruses rapidly in response to emergent needs. though our current pipelines are effective, they are also very specific to particular viruses and application to new viruses requires much work developing reference sequences, defining processing parameters and manually reviewing annotation results. with that in mind we are now implementing a new, third approach to annotation that can be adapted rapidly when needed and is scalable to multiple virus groups. this new approach is built around two important considerations. first, it uses annotations contained within the so-call reference sequence records (19) that are created by our group to represent important taxonomic and sequence space groups. the nucleotide and protein sequences within these records can be invaluable for the unambiguous assignment of sequences to defined groups and can also serve as repositories of reference sequence feature annotation maintained by in-house curation efforts often in collaboration with other scientists (20) (21) (22) (23) (24) . second, this approach includes a comprehensive list of error flags that provide extensive information about sequences and can provide warnings about potential problems. this error coding not only allows staff to quickly sort through thousands of annotations during the development of new pipelines, but also provides potential criteria for the selection or filtering of sequences to resource users. this new approach was used to annotate polyprotein and mature peptide genomic intervals in west nile virus (wnv), and this annotation will be available soon through the virus variation resource. these annotations were calculated as follows: first, genbank west nile sequences were classified as one of the two common lineages of wnv (lineage 1 or lineage 2) using a combination of blastn (25) against the two refseq sequences and expert knowledge. the principal characteristic that distinguishes lineage 1 from lineage 2 is that the additional protein warf4 occurs only in lineage 1 wnv genomes and is believed to occur in most of them (7) . there is some evidence that a small proportion of wnv genomes do not fit neatly into lineage 1 or lineage 2 (7), but these were classified as lineage 2 in our annotations. second, the annotation pipeline built a covariance model (cm) for each of 16 mature peptides present in the nc 009942 refseq annotation and for the 15 mature peptides in the nc 001563 refseq. the cms are built using the cmbuild program of the infernal homology search software package (26) . infernal is typically used for modeling the sequence and secondary structure of rnas, and because the sequences we are modeling lack structure (i.e. basepairs between positions), the cms we created are effectively identical to sequence-only profile hidden markov models. in the current version of our pipeline, each model was derived from the single refseq nucleotide sequence encoding each mature peptide. third, the cms built from the refseq to which that genome was assigned were used to predict each mature peptide coding sequence using infernal's cmscan program. the annotation software then runs a variety of validation checks and produces error codes that assist in curation of sequences. for example, the pipeline checks for the exis-tence of any in-frame stop codons within the predicted regions. if one or more is found, the prediction boundaries are modified to terminate at the 5 -most stop found. coding sequence (cds) coordinates are determined implicitly based on the predicted mature peptide coordinates. lineage 1 (nc 009942) has three cdss and lineage 2 (nc 001563) has two cdss. for each cds, the predictions for the corresponding mature peptides that make up each cds are tested for consistency by ensuring that mature peptide coding sequences that are adjacent (separated by 0 nucleotides) in the refseq are also adjacent in the predictions. the start position of the first mature peptide and end position of the final mature peptide that comprise each cds are then used as the start and stop position for that cds. cds annotations are not made if the mature peptide consistency check fails. in addition to checking for early stop codons and the adjacency of mature peptide coding sequences, the annotation pipeline identifies other unusual or unexpected features in each sequence and reports those as 'error codes'. there are 17 possible error codes, which provide an easy way for users to gauge the quality of each sequence and its annotations, and should facilitate the selection of subsets of the sequence data that meet specific user-defined quality standards. a more detailed description of the new annotation pipeline and error flags will be included in full detail eventually in a separate manuscript, as well as in the help documents available at the virus variation resource. another important aspect of sequence analysis is to place a given sequence within biological, temporal and geospatial contexts. such associations can provide profound health policy and scientific insights, but unfortunately, descriptors that provide information about the source of nucleotide sequences are notoriously inconsistent. to resolve this issue, the virus variation database loading pipeline parses gen-bank records, identifies important metadata terms, such as sample isolation host, date, country and source, and maps these to a standardized vocabulary using a hierarchical approach. for example, isolation host terms are first identified in the host field and failing that, then isolate or strain fields, then isolation source, note and finally organism name. this vocabulary mapping strategy follows the insdc practice of separating isolation host from source. in this convention host refers to an organism--and hence has an organism's name that can be mapped to the ncbi taxonomy tree--and isolation source refers to a physical, en-vironmental or local geographic location (1) . for human pathogens isolation source often refers to a host tissue or bodily fluid, and the virus variation vocabulary mapping strategy attempts to combine similar clinical terms into biologically relevant groups. for example, the parsed terms 'serum,' 'plasma' and 'lymphocytes' are all mapped to the standardized vocabulary term 'blood'. to support more efficient data retrieval, host terms are mapped in a hierarchy, and once a species term such as 'accipiter cooperii' is identified, it is mapped to both the group name 'bird' and the common name 'accipiter. ' other metadata terms such as those for disease associations and clinical/laboratory manipulations are more difficult to parse. to this end, laboratory isolates, vaccine strains and environmental samples are identified by searching for key terms, such as 'tissue culture' or 'sewage,' from all fields. disease terms for dengue virus are also found using a similar strategy. in all cases these strategies require extensive examination of sequence records and documentation of specific terms that can be accurately mapped to controlled vocabulary gleaned from established ontologies such as the environmental ontology (https:// bioportal.bioontology.org/ontologies/envo) and the infectious disease ontology (https://bioportal.bioontology. org/ontologies/ido). this process is supported by a curation interface that lists records where parsing fails to identify expected terms, leading to good old-fashioned manual curation and the identification of new terms, common misspellings, regional spelling differences and the manual incorporation of metadata from relevant literature into the virus variation database. in total, these vocabulary remapping strategies can have a profound impact on data usability as large numbers of parsed terms can be mapped to controlled vocabularies (table 4 ). the virus variation annotation and metadata mapping pipelines create standardized terms that can then be leveraged by the resource search interface. a link to this interface can be found on the home page of each virus module, which also includes links to help documents, other ncbi resources, and relevant external resources (for an example, please see http://www.ncbi.nlm.nih.gov/genome/ viruses/variation/dengue/). to access the search interface from the module home page, select the link to 'search nucleotide and protein sequences.' here, users can select between protein and nucleotide searches (see figure 1 ). when searching protein sequences, selecting 'full-length sequences only' filter, limits retrieved sequences to those with a complete coding region as determined to the relevant reference. the same filter limits nucleotide searches to full-length genomes, where the completeness of a given genome is operationally determined by comparing the genes/proteins present on a given sequence to those on the relevant, full-length reference genome. currently, noncoding, terminal regions are not included in this determination. during both protein and nucleotide searches, users can define explicitly the genomic regions present on retrieved sequences using drop-down menus that support multiple se-lections. additionally, sequences can be filtered using standardized source metadata terms for host, region/country and isolation source using similar pull down menus. the host and country menus are arranged so that aggregate terms are listed in the top portion of the menu and more discrete terms below. in addition to these common filters, there are module-specific filters for species, types, and disease for ebolaviruses and dengue virus respectively. the influenza virus module also provides some module-specific search options. for example, a user can select 'full length only' to include sequences with complete coding regions or 'full length plus' to include sequences with complete coding regions, but no start and/or stop codon. several other specific filters are also available on the influenza module search interface, such as h and n subtypes, minimum or maximum sequence lengths, and inclusion or exclusion of pandemic h1n1 viruses. a second set of functions and filters is included within the 'additional filters' menu. here users can search for keywords in the genbank record deflines or strings within sequences. there are also filters to include or exclude laboratory isolates, vaccine strains, and environmental isolates. one can also select specific rotavirus segment types based on assignment by the rotavirus classification working group (27, 28) , or by selecting specific sequences by genbank accession. once the parameters for a specific search are selected, a user can choose to add the query to the query builder and define another search, or they can go directly to the results. several searches can be run and added to the query builder where the combination of filters and number of retrieved sequences is displayed for each search. the number of unique sequences can be displayed using the 'collapse identical sequences' checkbox. individual searches can then be selected and/or combined and sent to the results page for further refinement and analysis. the results page supports selection of sequences from the search set for analysis or download. search parameters are displayed at the top of the results page, and a table displays retrieved sequences and associated metadata. the individual columns within the table can be selected to display specific sets of metadata and hyperlinked genbank and biosample accessions (29) . biosample records store an extended set of sample descriptors and are linked to sequence read archive (sra) (30) records, allowing users to easily find sequence read data associated with retrieved genbank sequences when available. one new feature is the ability to collapse identical retrieved sequences for all viruses as described in the preceding section. when identical sequences are collapsed on the query page, they will be represented by a single sequence on the results page with the number of collapsed sequences shown in the 'identical sequences' column (see figure 2 ). clicking the arrows in the 'identical sequences' column displays the individual sequences and makes them selectable. users can now customize sequence titles including the fasta defline of downloaded sequences and tree labels using the 'customize label' tool. the defline can be modified to include various types of data such as the sequence accession number, calculated genomic the ebolavirus module search interface with all elements opened and several example searches displayed in the query builder. the search page is divided into three elements. the first element supports selection of protein or nucleotide sequences based on standardized metadata terms generated by processing pipelines described in the text. menus support filtering of sequences based on gene or protein names, host, isolation country and isolation source, and collection and release dates ranges can be set with text boxes. additional filters are accessible with a drop-down arrow revealing options for environmental or laboratory isolates, vaccine strains, keyword or sequence string searches, and optional menus tailored to specific viruses. the second element supports searches based on genbank accessions -either using the text box or by uploading a text file of accessions. the third element includes the query builder where the number of sequences retrieved from individual searches can be viewed by clicking one of the 'add query' buttons. when multiple searches are added to the query builder, the total number of unique sequence records is also summed. a checkbox is provided that allows identical sequences to be collapsed and represented by the oldest sequence on the results table. clicking the 'show results' button opens a separate browser tab and displays all of the sequences meeting the criteria in each of the checked queries in the results interface. region, host, isolation source, collection date or country, as well as field-separators such as pipes or slashes. userselected titles will also be displayed in multi-sequence alignments and trees as described in the following section. users can build multiple sequence alignments or trees from selected sequences, and these in turn can be downloaded in various formats. the influenza module uses previously described tools for these functions (8, 13, 31) , but a new set of tools has been developed for other viruses. multi-ple sequence alignments are constructed using an optimized version of muscle, and rooted trees are generated using the unweighted pair group method with six base nucleotide or amino acid k-mers (32) (see figure 3 ). the multiple sequence alignment display includes a navigation map above the alignment, a variation histogram and a consensus sequence. characters are colored to indicate variable positions. the alignment can be downloaded in fasta, clustal, phylip, nexus, or asn.1 formats. the tree display supports a variety of layouts including rectangular and slanted cladograms, radial trees and circular trees, the image can be downloaded as a pdf, and the tree file can be downloaded in asn text or binary, newick, or nexus formats. these options are accessible through the 'tools' menu in the viewer. the data labels on multi-sequence alignments and trees can be customized from the results table before the tree is calculated using the 'customize label' options, making it easier to identify the distribution of sample/sequence characteristics. when certain download formats are selected, customized labels will be included in the downloaded files (fasta and asn.1 for the multiple alignments, and all files for the trees). a url is also provided to make sharing a tree easy. the virus variation resource described here provides a number of features that improve the usability of archival sequence data. the resource now includes more than 20% of the genbank sequences that are assigned viral taxonomy. further improvement will be dependent on which viruses are added in the future and on updates to the various pipelines, interfaces and tools so that they can further support user needs. our plan is to increase the pace at which new virus species are added to the virus variation resource, and we are currently developing layers of data processingthe least transformative of which could be applied across all viral sequences but still provide basic information about a sequence. the search interface and data displays will be revised so that they better support user-required comparative genomic functions across a much larger number of viral species from the same query page. we also intend to support searches based on author names and more detailed sample information, such as clinical symptoms or laboratory handling. though we will begin parsing the potentially rich metadata data sets from biosample records, the success of this effort will ultimately rest on improved community awareness and more consistent submission of metadata to public databases. given the unbridled growth and clear potential of nucleotide sequencing efforts, one must assume the current virus variation resource is just scratching the surface of future bioinformatic needs. the current resource model is suited to viruses that have experimentally validated annotation, and similar modules are in development for additional viral species. however, the vast majority of viruses do not have strong experimental evidence for protein coding regions, making it difficult to build a virus variation module including an annotation pipeline. in these cases annotation will need to be inferred from related, experimentally studied viruses, requiring new approaches and better ways of standardizing gene and protein information across multiple groups of viruses. our current annotation pipeline development is directed toward these goals, and we intend to extend public access to these pipelines beyond our current influenza virus module. we also intend to reveal resourcederived annotation as tracks on multiple sequence alignments, making annotated sequences available for download and improving access to our data sets. this will also enable users to limit downloads and multiple sequence alignments to selected mature peptides for polyprotein sequences, and trees to be built from selected genomic regions. finally, there are a variety of enhancements to our tools under development. we are developing improved tree visualizations that support better search and markup functions, similar to those currently used in the influenza virus module. some limitations of the tree function will be addressed at a later time by giving the user the option of viewing the quick tree which is currently offered, or a more the international nucleotide sequence database collaboration flaviviridae replication organelles: oh, what a tangled web we weave natural history of hepatitis c west nile virus a glance at subgenomic flavivirus rnas and micrornas in flavivirus infections west nile alternative open reading frame (n-ns4b/warf4) is produced in infected west nile virus (wnv) cells and induces humoral response in wnv infected individuals the influenza virus resource at the national center for biotechnology information hiv sequence compendium database issue institute of allergy and infectious diseases bioinformatics resource centers: new assets for pathogen informatics vipr: an open bioinformatics database and analysis resource for virology research the papillomavirus episteme: a central resource for papillomavirus sequence data and analysis virus variation resources at the national center for biotechnology information: dengue virus virus variation resource-recent updates and future directions dengue virus nonstructural protein 5 (ns5) assembles into a dimer with a unique methyltransferase and polymerase interface genome sequence analysis of ebola virus in clinical samples from three british healthcare workers genomic constellation and evolution of ghanaian g2p[4] rotavirus strains from a global perspective flan: a web server for influenza virus genome annotation reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation uniformity of rotavirus strain nomenclature proposed by the rotavirus classification working group (rcwg) towards viral genome annotation standards, report from the 2010 ncbi annotation workshop. viruses microbial virus genome annotation-mustering the troops to fight the sequence onslaught filovirus refseq entries: evaluation and selection of filovirus type variants, type sequences, and names ncbi viral genomes resource gapped blast and psi-blast: a new generation of protein database search programs infernal 1.1: 100-fold faster rna homology searches recommendations for the classification of group a rotaviruses using all 11 genomic rna segments rotac: a web-based tool for the complete genome classification of group a rotaviruses bioproject and biosample databases at ncbi: facilitating capture and organization of metadata the sequence read archive: explosive growth of sequencing data visualization of large influenza virus sequence datasets using adaptively aggregated trees with sampling-based subscale representation muscle: multiple sequence alignment with high accuracy and high throughput nucleic acids research, 2017, vol. 45 , database issue d489 figure 3 . virus variation resource tree and multi-sequence alignment displays. (a) a sample tree is shown depicting the use of standardized metadata terms as sequence labels. the tree was built from 31 west nile virus complete polyprotein sequences collected since 2013. sequence labels are based on genbank accessions, host, country of isolation and isolation date. left clicking a node highlights the lineage, and hovering over a node with the cursor displays a menu that includes descriptors for that particular sample, including genbank accession and available standardized metadata terms for host, country, isolation source, etc. the menu also includes a function to reroot the tree around that sequence. (b) a multi-sequence alignment is shown for the same 31 west nile polyprotein sequences. individual genbank accessions are listed to the left next to sequences. left clicking the accession displays a menu that includes the standardized metadata label chosen in the results interface, a link to the sequence in genbank, a function to use that sequence as an anchor for the alignment. differences between residues in a given sequence and the consensus are highlighted in red. a histogram above the alignment shows coverage in blue and the frequency of changes in red. sophisticated combination of muscle-multiple sequence alignment and phylogenetic tree. we are also interested in supporting blast-based searches within our data sets to support more precise sequence associations. ultimately, the presumed very large sequencing datasets of the future will ultimately require better ways to evaluate data retrieved from searches which, in turn, will require better integration of search functions with data visualizations such as trees.members of the scientific community are encouraged to contact the ncbi help desk (ncbi-help@ncbi.nlm.nih.gov) to make suggestions to improve the virus variation resource, or to assist with establishing annotation or metadata standards. key: cord-310734-6v7oru2l authors: bolatti, elisa m.; zorec, tomaž m.; montani, maría e.; hošnjak, lea; chouhy, diego; viarengo, gastón; casal, pablo e.; barquez, rubén m.; poljak, mario; giri, adriana a. title: a preliminary study of the virome of the south american free-tailed bats (tadarida brasiliensis) and identification of two novel mammalian viruses date: 2020-04-09 journal: viruses doi: 10.3390/v12040422 sha: doc_id: 310734 cord_uid: 6v7oru2l bats provide important ecosystem services as pollinators, seed dispersers, and/or insect controllers, but they have also been found harboring different viruses with zoonotic potential. virome studies in bats distributed in asia, africa, europe, and north america have increased dramatically over the past decade, whereas information on viruses infecting south american species is scarce. we explored the virome of tadarida brasiliensis, an insectivorous new world bat species inhabiting a maternity colony in rosario (argentina), by a metagenomic approach. the analysis of five pooled oral/anal swab samples indicated the presence of 43 different taxonomic viral families infecting a wide range of hosts. by conventional nucleic acid detection techniques and/or bioinformatics approaches, the genomes of two novel viruses were completely covered clustering into the papillomaviridae (tadarida brasiliensis papillomavirus type 1, tbrapv1) and genomoviridae (tadarida brasiliensis gemykibivirus 1, tbgkyv1) families. tbrapv1 is the first papillomavirus type identified in this host and the prototype of a novel genus. tbgkyv1 is the first genomovirus reported in new world bats and constitutes a new species within the genus gemykibivirus. our findings extend the knowledge about oral/anal viromes of a south american bat species and contribute to understand the evolution and genetic diversity of the novel characterized viruses. bats belong to the order chiroptera, which is the second-largest mammalian group, comprising 21 families and 1411 species distributed globally, with the exception of polar areas [1, 2] . approximately 25% of the world's bat species are endangered, causing concerns about the negative conservation impact and its influence on the ecosystem services these bats provide, such as arthropod regulation, seed dispersal, and pollination [1, 3] . on the other hand, certain specific aspects of bats-including their relatively long lifespan in relation to their body size [4] , the reliance of some species on prolonged torpor [5] , and flight-may make them suitable for hosting a wide variety of viruses [6] , including zoonotic viruses highly pathogenic to humans [6] , such as severe acute respiratory syndrome (sars)-related coronavirus, ebola virus, nipah virus, and hendra virus [7] [8] [9] [10] . nevertheless, little is known about their own pathogens [3] . in addition, the gregarious behavior of many bat species, such as free-tailed bats tadarida brasiliensis (i. geoffroy saint-hilaire, 1824), may facilitate rapid transmission of pathogens between bats and other species [6] . t. brasiliensis is the most abundant migratory and cosmopolitan species of the new world bats, widespread throughout the americas [11] [12] [13] and protected by international agreements [14] . using next-generation sequencing (ngs) technologies, an enormous variety of viral species and genotypes [15, 16] have been identified in the tissues and feces of bats mainly inhabiting asia [17, 18] , africa [19, 20] , europe [21, 22] , and north america [23, 24] . on the other hand, the viromes of bats from south america remain understudied [25, 26] . for example, the identification of viruses infecting t. brasiliensis is principally limited to detection of specific viral families, such as rabies lyssavirus [27] , alphacoronaviruses [28] , polyomaviruses [29] , circoviruses [30] , and anelloviruses [31] . in order to contribute to the preservation of t. brasiliensis and to evaluate its possible role as a pathogen reservoir, greater efforts directed at identifying the viruses present in this species are needed. in this study we report a detailed description of two novel complete genome sequences, one describing a new papillomavirus genus and the other representing a novel variant of an existing gemykibivirus species. in addition, we report a preliminary overview of the t. brasiliensis virome composition. altogether, our findings add to the knowledge of viral diversity in a south american bat species, providing insights for understanding their role as reservoirs, as well as their own pathogens, which may have consequences for the animals' health. the bat colony investigated occupies the attic of the law school building at the universidad nacional de rosario in downtown rosario, argentina (32 • 56 36.76 s 60 • 39 02.09 w) [32] . in this place, t. brasiliensis (molossidae) establishes a maternity colony every year that can reach about 30,000 individuals during the maternity season (november to march), after which they migrate [32, 33] . a total of 98 swab samples (49 oral and 49 anal) were collected from 49 adult female specimens inhabiting this colony from december 2015 to february 2016. briefly, bats were manually captured from the walls and held in individual cotton bags for determination of their species based on anatomical and morphological characters, reproductive condition, and age [33] . the oral cavity and anal regions of each individual were sampled using individual sterile cottontipped swabs (deltalab, barcelona, spain), rolled back and forth (10 times), suspended in 200 µl of saline solution (nacl 0.9%), and stored at 4 • c until further processing. the bats were rehydrated and released. during this study, every effort was made to minimize interference with and suffering of the animals; no breeding or pregnant females were captured, and no animals were sacrificed. sample collection was conducted by trained professionals as approved by the ministry of environment of the argentinian santa fe province (file 021010016257-1) and facultad de ciencias bioquímicas y farmacéuticas (universidad nacional de rosario) animal ethics committee (file 6060/243, 20 march 2015). samples were processed according to previously published protocols that have been successfully applied for identification of papillomavirus (pv) in human skin swab samples [34] [35] [36] . briefly, the cells were centrifuged at 13,000× g for 5 min and the pellets were resuspended in 100 µl te buffer (qiagen, hilden, germany) containing 100 µg of proteinase k (qiagen), and incubated overnight at 55 • c. following proteinase k inactivation (95 • c for 10 min), the lysates were stored at −20 • c. subsequently, the obtained samples were tested for the presence of pv dna using improved versions of fap [37, 38] and cut pcrs [35] , as described previously [36, 39] . circular dna molecules in lysates of five selected pv-positive samples (four anal and one oral swab) were enriched using rolling-circle amplification (rca) with the illustra templiphi 100 amplification kit (ge healthcare, chicago, il, usa) [40] [41] [42] . the pool of rca-enriched samples was sequenced on an illumina hiseq 4000 instrument at the sequencing facility of gatc biotech (ebersberg, germany). sequencing libraries were prepared using the gatc automatic library preparation approach, and the sequencing reads were sequestered in the format of 2 × 150 bp. reads were subjected to quality trimming and filtering using the bbduk program (bbtools v38.42). end trimming was performed on the first and last 15 bases of each read, clipping bases with phred scores below 15. trimmed reads shorter than 120 bp and with an average phred score below 20 were discarded. the read pairs contained in the metagenomic sample, which shared k-mers, sliding-window subsequences of 27 nt, with the sequencing datasets of samples (six in total) that were processed and analyzed in the same sequencing batch, were discarded using the bbduk program (referred to as laboratory-batch background screen in figure 1 ). the primary purpose of this step was to conservatively limit the possibility of falsely identifying viral taxa that did not originate from the bat metagenomics sample and that could have been introduced by aerosol during sample processing or index hopping during sequencing. in order to limit the content of bacterial reads, the metagenomic dataset was mapped to the bacterial reference-index files (obtained 6 november 2017, from ftp://ftp.ccb.jhu.edu/pub/infphilo/ centrifuge/data/p_compressed.tar.gz) using the centrifuge sequence classification system (centrifuge version 1.0.3-beta) [43] . reads not mapping to any bacterial taxon were used in further metagenomic analyses (unless stated otherwise). two types of metagenomic characterization workflows were used: (1) taxonomic classification of ngs read pairs and (2) taxonomic classification of contigs assembled de novo from ngs read pairs. in both cases, the centrifuge metagenomic classification system with the reference nucleic sequence index files, obtained from ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz (version 12 june 2016, downloaded 6 november 2017), was used to obtain the final taxonomic calls (default parameter settings). taxonomic classification of sequences was further summarized to the taxonomic level of family using pavian [44] . de novo assembly was performed with two different de brujin graph assembly tools: spades (v3.11) [45] and unicycler (obtained from github: 27 october 2017; github commit: 220d5daebc8267d3 7378f191e14acb5c5a1ff757), adapting various parameter settings. altogether, six different metagenomic de novo assemblies were constructed, using settings specified in table s1 . all contigs assembled de novo by any of the six approaches exhibiting a minimum length of 500 nt were collected and subjected to taxonomic classification (workflow 2). the circularity of the complete genome assemblies was determined by matching the sequence stretches (minimum match length 50 nt) at their 5 and 3 ends. coverage statistics of the novel complete viral genome sequences were obtained by remapping the trimmed read dataset to the constructed genome assemblies using bowtie2 (v2.2.6) [46] . (1) and contigs assembled de novo (2) . pair reads quality filtering and trimming were performed with the bbduk program (bbtools v38.42). the centrifuge metagenomics classification system was used for the taxonomic classification of pair reads and contigs (centrifuge version 1.0.3-beta) [43] . de novo assembly was performed using spades v3.11 [45] and unicycler. (1) and contigs assembled de novo (2) . pair reads quality filtering and trimming were performed with the bbduk program (bbtools v38.42). the centrifuge metagenomics classification system was used for the taxonomic classification of pair reads and contigs (centrifuge version 1.0.3-beta) [43] . de novo assembly was performed using spades v3.11 [45] and unicycler. sequences of the e1, e2, l2, and l1 genes of 376 reference pv genomes, downloaded from pave (http://pave.niaid.nih.gov/ on 13 march 2019), and the corresponding genes from the novel pv, tadarida brasiliensis papillomavirus type 1 (tbrapv1), were used in the phylogenetic analysis. additional information on nucleotide sequences (genbank accession number and virus name abbreviations) used in these analyses is summarized in table s2 . the e1e2l2l1 concatenation was constructed by first obtaining the amino acid-guided multiple sequence alignments of each gene. multiple sequence alignments were obtained using muscle (v3.8.31) [47] . as suggested by bernard et al. (2010) [48] , the multiple sequence alignments used for phylogenetic analysis of tbrapv1 were guided by the amino acid alignments and the pv identity calculation was based on patristic distance measurements, as determined by seaview [49] . phylogenetic clustering was conducted using iq-tree [50] . the most appropriate substitution models were determined using modelfinder [51] , according to the bayesian information criterion. branch support values were calculated using uf bootstrap (1000 replicates) [52] , sh-alrt (1000 replicates), and abayes tests [53] . phylogenetic analysis of genomoviridae was conducted using a reference dataset of 166 complete genome nucleotide sequences and 166 rep protein sequences downloaded from ncbi genbank (19 march 2019). the complete genome sequences were rotated to all begin in the start codon of the rep gene using circulator (version: github commit a4befb8c9dbbcd4b3ad1899a95aa3e689d58b638) [54] , and the two subunits of the rep gene were concatenated into a single protein sequence in which the genbank record indicated them as parts of different genes/coding sequences. pairwise sequence identity values used for taxonomic classification of tadarida brasiliensis gemykibivirus 1 (tbgkyv1) were obtained using sequence demarcation toolkit (sdt v1.0) [55] . in this scope, the pairwise sequence alignments were produced using muscle (v3.8.1) [47] . phylogenetic trees were rendered using figtree (v1.4.4) (http://tree.bio.ed.ac.uk/software/figtree/), and sequence identity histograms were visualized using gnuplot (v1.5). open reading frames (orfs) of novel viral genome sequences were marked using orffinder (ncbi); the manual annotation process of the orfs was guided by the use of ncbi blastp, and the identification of viral-family specific sequence motifs was performed using regular expressions with the linux grep utility (v2.16). members of papillomaviridae and genomoviridae identified in bat species so far are summarized in table s3 . the complete genome sequence of the novel pv type (tbrapv1) was obtained by generating four overlapping amplicons in different pcr reactions, using 1 x pcr buffer, 3.5 mm of mgcl 2 , 200 µm of each dntp (thermo fisher, walthem, ma, usa), 1.25 u of gotaq hotstart polymerase (promega, madison, wi, usa), and 0.8 µm of each of the primers (tbrapv1-1f 5 -cagggtattcagggtgtttctcc-3 and tbrapv1-1r 5 -aatgtttctaatctgcaacc-3 ; tbrapv1-2f 5 -gtgcgcggcgacttctcatactta-3 and tbrapv1-2r 5tcagcctcattgtcctcatcattg-3 ; tbrapv1-3f 5 -tgggcttgaaacctggacactaca-3 and tbrapv1-3r 5 -atgcccgggaa tatggatgga-3 ; tbrapv1-4f 5 -ggcctgcaagaccacctac-3 and tbrapv1-4r 5 -gggggcatctgacctgtta-3 ). cycling conditions for the four reactions were the same and were performed as follows: initial denaturation at 95 • c for 2 min, followed by 45 cycles of 40 s at 94 • c, 40 s at 50 • c, and 2 min at 72 • c, with a final extension at 72 • c for 5 min. the amplicons were resolved in a 1% agarose gel electrophoresis and the~2 kb fragments were gel purified, ligated into the pgem-t easy vector (promega), and transformed into e. coli cells. sanger sequencing was performed using sequencing facilities at macrogen inc. (seoul, korea). in august 2019, dna clones and the corresponding nucleotide sequences were subsequently submitted to the animal papillomavirus reference center (http://www.animalpv.org/) for its confirmation and official designation. the genbank/embl/ddbj accession numbers for the novel viruses reported in this paper are tbrapv1 (mn329804) and tbgkyv1 (mn329805). the relevant raw high throughput sequencing data obtained in this study was deposited at the ncbi sequence read archives (sra) with the following accession number: prjna615356. the contigs, obtained by de novo assembly as part of the metagenomic workflow (2), have also been made available for download (supplementary data s1). ngs data analysis workflows and centrifuge-based taxonomic assignments of reads and contigs are depicted in figure 1 and table 1 , respectively. briefly, a total of 10,409,798 read pairs were sequestered from the rca-enriched samples, and 10,220,118 of them passed the quality filtering and trimming procedures. out of these, 6,738,566 read pairs were removed during the laboratory-batch background screen and an additional 878,852 read pairs were identified as originating from bacteria. the final metagenomic characterization was carried out using the remaining 2,602,700 read pairs ( figure 1 ). metagenomic analysis revealed that only a small proportion of read pairs (13,897 read pairs; 0.534%) and de novo assembled contigs longer than 500 nt (153 out of total 42,891; 0.357%) mapped to viral taxa. overall, a large number of phage-related sequences were detected (77.3% of viral read pairs and 39.9% of viral contigs), likely representing the most abundant entities infecting bacteria present in the bat digestive system, which exhibited similarity mostly to the families inoviridae, siphoviridae, and myoviridae ( table 1 ). the eukaryotic viral sequences (insect, invertebrate, plant, protist, and vertebrate viruses) could be summarized into a total of 35 viral families, 22 corresponding to viral families with dna genomes, and 13 to families with rna genomes. sequences of 10 different viral families infecting insects and crustaceans, mostly found related to the families baculoviridae, ascoviridae, iridoviridae, and nimaviridae, were detected (0.900% of viral read pairs and 12.4% of assembled viral contigs). on the other hand, sequences related to viruses infecting plants (five viral families, 0.446% of viral read pairs, and 4.57% of viral contigs) were mostly associated with phycodnaviridae or potyviridae, whereas those related to viruses infecting protists (five viral families, 13.6% of viral read pairs, 0.654% of viral contigs) clustered predominantly in the family mimiviridae. sequences classified as originating from vertebrates, predominantly mammalian viruses, were represented by 7.07% of viral read pairs and 35.9% of viral contigs. the principal viral families identified included retroviridae, genomoviridae, herpesviridae, papillomaviridae, and poxviridae. the analysis also identified (although in low counts) viral sequences related to the family alloherpesviridae, which infects fish and amphibians. of note, the metagenomic analysis indicated that 205 out of 461 read pairs were assigned to the family retroviridae (1.48% of viral read pairs), and 22 assembled contigs (14.4% of viral contigs; table 1 ) exhibited resemblance to the nucleotide sequence of desmodus rotundus endogenous retrovirus isolate 824 (genbank accession number: nc_027117) [56] . however, a more detailed analysis of this sequence revealed the presence of two flanking regions at the 5 and 3 ends of approximately 1000 nt, which probably derived from the host genome (desmodus rotundus). in fact, contigs and reads previously classified as retroviridae in our study aligned with these flanking regions. this finding explained the initial misclassification and indicated that great care should be taken when using genbank sequence data as reference material because a large portion of sequences and their respective annotations may not have been curated adequately. finally, a total of 96 read pairs and 10 assembled contigs were classified as similar to viruses unassigned to taxonomical families (table s4 ). the metagenomic analysis suggested that a total of 90 read pairs (workflow 1) and 14 contigs (workflow 2) could be attributed to pvs (table 1 ). the longest contig obtained by assembly de novo covered the complete genome of tbrapv1, which was subsequently confirmed by conventional molecular methods (pcr, cloning, and sanger sequencing; data not shown). remapping read pairs (quality-, background-, and bacteria-filtered read pairs) to the confirmed tbrapv1 genome sequence indicated a complete sequence length of 8151 nt, with a gc content of 46%. the complete novel genome sequence was covered on average 252.6× by a total of 6935 read pairs (13,870 reads) . detailed analysis of the tbrapv1 viral genome (table 2 and figure 2a) , showed a typical genomic organization of bat pvs, potentially encoding four early genes (e6, e7, e1, and e2) and two late genes (l2 and l1) [57, 58] . a putative e4 gene was found overlapping the e2 gene ( figure 2a) , with its own start and stop codons, and the presence of e4-characteristic proline-rich stretches (12.7%) with an important role in cell cycle arrest was found [59] . typical domains were additionally identified in the putative viral proteins encoded by tbrapv1. the e6 protein contained two characteristic zinc-binding domains, separated by 36 amino acids [61] , and four internal and likely not functional pdz-binding motifs (rtnv, isdl, ssil, lssl) [62] . the e7 protein contained a prb-binding motif (lwcde) [63] and a single zinc-binding domain ( table 2) . analysis of the e1 protein, the largest protein encoded by tbrapv1 (table 2) , showed the typical atpbinding site of the atp-dependent helicase (gpsnsgks) [64] , and several cdk-phosphorylation and the upstream regulatory region (urr) of tbrapv1 contained two typical tata boxes, three putative e2 protein binding sites, an e1 protein binding site [60] , and three putative polyadenilation sites for late gene transcripts (table 2) . multiple potential binding sites for transcriptional regulatory factors, such as ap-1, nf-1, and sp-1, were also present within the urr (data not shown). typical domains were additionally identified in the putative viral proteins encoded by tbrapv1. the e6 protein contained two characteristic zinc-binding domains, separated by 36 amino acids [61] , and four internal and likely not functional pdz-binding motifs (rtnv, isdl, ssil, lssl) [62] . the e7 protein contained a prb-binding motif (lwcde) [63] and a single zinc-binding domain ( table 2) . analysis of the e1 protein, the largest protein encoded by tbrapv1 (table 2) , showed the typical atp-binding site of the atp-dependent helicase (gpsnsgks) [64] , and several cdk-phosphorylation and cyclin-binding sites. a highly conserved bipartite-like nuclear localization signal (nls) and a leucine-rich crm1-dependant nuclear export signal (nes) (lspvlekvti), which together allow shuttling of the e1 protein between the cell nucleus and the cytoplasm in most human pvs [65] [66] [67] , were identified at the n-termini of the e1 protein. no conserved leucine zipper domain was present at the c-termini of the putative tbrapv1 e2 protein, in agreement with other bat pvs (espv2 and rfpv1) [58] . at the n-termini of the l2 protein, a highly conserved furine cleavage motif (rrkr), as well as a transmembrane-like domain (gtggggrgvpigprvatgrpggpinsvg) [68] , were identified. in addition, a canonical polyadenylation site, necessary for regulation of early viral transcripts [69] , was also found in the tbrapv1 l2 gene (table 2) . phylogenetic analysis, based on 377 pv l1 gene nucleotide sequences, indicated peak sequence identities of tbrapv1 to hpv41 and rfpv1, which amounted to 61.5 and 60.6%, respectively (figure 3 , charts a2 and a4). maximum likelihood phylogenetic clustering of l1 sequences (figure 4 ) suggested common ancestry of tbrapv1, edpv1, and hpv41, and that tbrapv1 branched away prior to the delineation of edpv1 and hpv41, with high sh-alrt, abayes, and uf bootstrap support values. further phylogenetic analyses were conducted based on the concatenated alignments of 377 e1, e2, l2, and l1 gene nucleotide sequences, and they indicated a peak sequence identity of tbrapv1 to rfpv1 (54.1%, figure 3 , charts a1 and a3), whereas the maximum likelihood phylogenetic clustering indicated analogous common ancestry to the concatenated pv genes ( figure 5) , with high sh-alrt, abayes, and uf bootstrap support values. the metagenomic analysis indicated the presence of several genomovirus-related read pairs and sequence contigs (table 1) , and the complete genome sequence (2196 nt) of tbgkyv1 was recovered using de novo assembly. remapping to the tbgkyv1 genome sequence indicated a mean coverage of 59.4× by a total of 430 read pairs (860 paired reads + 7 unpaired reads) and did not reveal any abnormalities that would indicate misassembly. completeness of the circular genome sequence was determined by matching 5 and 3 ends, and the sequence was rotated to begin with the characteristic genomoviridae nonanucleotide motif [70] . three non-overlapping orfs, with a minimum protein length of 120 aa, were found, exhibiting peak amino acid similarities to the genomoviridae rep and cp/cap proteins. the rep protein of tbgkyv1 was encoded across two different rep-encoding orfs separated by an intron, representing a catalytic and a central protein domain ( figure 2b , table 3 ), which is characteristic for replication-associated proteins encoded by single-stranded (cress) dna viruses [71] . phylogenetic analysis and pairwise sequence comparison demonstrated that tbgkyv1 shares its highest identity in the rep protein (77.5%) and the complete genome level (nucleotide sequence, 77.2%) with the "mongoose associated gemykibivirus 1" (genbank accession number kp263545) [70] ( figures 3b and 6) . nucleic sequence identity histograms of the 377 pv nucleotide sequences (a1 and a2) and tbrapv1 (a3 and a4) based on the concatenated e1, e2, l2, and l1 gene sequences (a1 and a3) and on the l1 gene sequence (a2 and a4). multiple sequence alignments were constructed using muscle (v3.8.31) [47] , and the distance matrices were estimated using seaview v4.7 [49] , as suggested in bernard et al. (2010) . red arrows indicate the maximum sequence similarities of tbrapv1 for each of the sequence contexts (e1, e2, l2, and l1 concatenation and the l1 gene sequence). the blue arrows (a1 and a2) indicate the overall maximum sequence identity in the depicted context (e1, e2, l2, and l1 concatenation and the l1 gene sequence). the histograms were visualized using gnuplot (v5). (b) pairwise sequence identity histograms based on 167 complete genome nucleotide (purple) and rep gene (green) amino acid sequences from the genomoviridae family, based on the entire pairwise identity distance matrices (b1) and on the matrix slices representing pairwise identities of tbgkyv1 to all other 166 genomoviridae sequences (b2). the pairwise similarity matrices were obtained through pairwise sequence alignments (muscle v3.8.31) [47] , using sequence demarcation toolkit (sdt v1) [55] . for the arrows in the histogram: red = the species demarcation threshold/criteria (sdc) for genomoviridae [70] ; green and blue = maximum pairwise identity values of tbgkyv1 for the complete genome (green) and the rep protein sequence contexts (blue). histograms were visualized using gnuplot (v5). **nt = nucleotide, aa = amino acid, scd = sequence demarcation threshold/criteria [70] . cg: complete genome. nucleic sequence identity histograms of the 377 pv nucleotide sequences (a1 and a2) and tbrapv1 (a3 and a4) based on the concatenated e1, e2, l2, and l1 gene sequences (a1 and a3) and on the l1 gene sequence (a2 and a4). multiple sequence alignments were constructed using muscle (v3.8.31) [47] , and the distance matrices were estimated using seaview v4.7 [49] , as suggested in bernard et al. (2010) . red arrows indicate the maximum sequence similarities of tbrapv1 for each of the sequence contexts (e1, e2, l2, and l1 concatenation and the l1 gene sequence). the blue arrows (a1 and a2) indicate the overall maximum sequence identity in the depicted context (e1, e2, l2, and l1 concatenation and the l1 gene sequence). the histograms were visualized using gnuplot (v5). (b) pairwise sequence identity histograms based on 167 complete genome nucleotide (purple) and rep gene (green) amino acid sequences from the genomoviridae family, based on the entire pairwise identity distance matrices (b1) and on the matrix slices representing pairwise identities of tbgkyv1 to all other 166 genomoviridae sequences (b2). the pairwise similarity matrices were obtained through pairwise sequence alignments (muscle v3.8.31) [47] , using sequence demarcation toolkit (sdt v1) [55] . for the arrows in the histogram: red = the species demarcation threshold/criteria (sdc) for genomoviridae [70] ; green and blue = maximum pairwise identity values of tbgkyv1 for the complete genome (green) and the rep protein sequence contexts (blue). histograms were visualized using gnuplot (v5). nt = nucleotide, aa = amino acid, scd = sequence demarcation threshold/criteria [70] . cg: complete genome. figure 6 . phylogenetic tree of the genomovirus rep amino acid sequence. the tree was constructed using the lg+i+g4 substitution model, and branches are annotated with sh-alrt (1000 replicates), abayes, and uf bootstrap support (1000 replicates) values, respectively. maximum support values are shown with asterisks (*). novel tbgkyv1 is depicted in bold. genomoviridae genera were classified according to varsani and krupovic (2017) . unclassified sequences were not depicted. table 3 and figure 2b . the large intergenic region of the novel virus contains the characteristic nonanucleotide (tataaatag) motif, which is likely to be important for rolling-circle replication initiation [72] [73] [74] . the rep protein's catalytic domain of tbgkyv1 contained rolling circle replication motif i (lftysq), possibly involved in the recognition of iterative dna sequences associated with the origin of replication [70] , motif ii (thlhv), which may regulate the nicking/joining endonuclease activity at the origin of dna replication [73, 75] , motif iii (yatk), involved in the double-strand dna cleavage [73, 76] , and a geminivirus rep protein sequence (grs) (rlfdvenfhpnivpsr), which allows appropriate spatial arrangements of motifs ii and iii [77] . furthermore, rep helicase motifs walker-a (gpsrtgkt), walker-b (vfddi), and walker-c (wlmn) [78] [79] [80] [81] were identified in the central rep protein domain. walker motifs contribute to atp binding, which is used as an energy source to unwind the dsdna intermediate in the 3 -5 direction by the rep helicase [80, 81] . more than 200 viruses from 27 taxonomic families have been isolated from or detected in bats so far [16] , with a few of them implicated in the etiology of several severe diseases in humans. except for rabies, no direct evidence of zoonotic diseases transmitted by new world bats has been found [27] . new world, especially south american, and old world bat species had different evolutionary histories, leading to distinct immunological features [3, 16] . viruses infecting south american bat species have been poorly studied [24, 26] , and further research focused on evaluating their viromes is required. here, a first attempt to assess to the virome composition in pooled oral and anal swabs of t. brasiliensis was presented. during our study a total of 6,738,566 read pairs were removed during the laboratory batchbackground screening, due to traces of sequences from co-processed samples. the power of ngs stems from non-specific sampling of nucleic acids and automated repetition, yielding vast numbers of sequencing reads, providing the opportunity to characterize populations of nucleic acids with unprecedented sensitivity, accuracy, and non-specificity. due to its super-sensitivity, even the slightest addition of environmental nucleic acids to a sample may be detected using ngs and can potentially further complicate the interpretation of the results. laboratory background and/or dna isolation kit-derived contamination has been addressed previously as a major factor that can severely impede the interpretation of high throughput sequencing data, and the use of negative controls has been proposed [82] . moreover, positive/negative control samples have been recommended in metagenomic experiments aimed at detecting pathogens in clinical samples [83] . in this study, the metagenomic sample was processed alongside six samples of molluscum contagiosum skin lesions, and laboratory background filtering was initially considered due to the detection of approximately 90 molluscum contagiosum virus read pairs in the trimmed read dataset. in order to prevent the identification of human skin microflora in the pooled bat swab sample, a strict k-mer based negative filtering was used, effectively removing any read pair that contained at least one 27 bp subsequence that could be identified in any of the read pairs from the six background datasets. because the background samples also originated from mammalian (human) skin, it could be that a large portion of mammalian reads were removed during this step, explaining the somewhat extreme number of read pairs classified as laboratory-batch background. moreover, it is also likely that the viral composition of human and bat skin could be shared to some degree, but due to the filtering scheme reads originating from bats' anal and oral microflora sharing nucleotide sequence similarity to that of human skin may have also been removed prior to metagenomic classification. as a consequence, taken strictly, only the subset of viruses that are present in t. brasiliensis, but not in human skin, was explored here. although dna spillovers could be suspected and controlled for to some degree, in this case it may be a greater challenge in studies where multiple samples from different species or anatomical sites of bats are processed. in light of the present results, it would likely be beneficial to process the samples in such studies as independently as possible and to include negative controls that would characterize the laboratory background, such as sequencing libraries of buffer solutions that underwent the same treatment as the samples, as suggested previously [84, 85] . specifically, a total of 13,897 virus-related read pairs (0.53% out of 2,602,700) and 153 virus-related contigs (0.36% out of 42,891) were assembled de novo, mapped to viral taxa, and identified by ngs. although the proportion (and number) of virus-related sequences detected in this study (<1%) is comparable to reports of previous studies of bat viromes based on illumina sequencing [19, 21, 86] , it may be that additional physical viral dna/rna enrichment steps, such as centrifugation, filtration, and/or nuclease-treatment, could further augment the viral read yields, as suggested previously [87] . initially, this study was focused on the identification of pvs in t. brasiliensis, in order to explore their diversity in different hosts. accordingly, swab samples included in this work were first processed using experimental protocols, designed previously to suit our aforementioned initial aim [34] [35] [36] . viral dna was enriched using rca, as suggested previously by others [40] [41] [42] . however, it should be noted that rca may have favorably facilitated the amplification of circular genomes and, as a consequence, hindered the detection of linear genomes. thus, more than 80% of the classified viral sequences (11,162/13,801) identified in our analysis corresponded to circular viral genomes. in this study about 4.5% of viral reads and 60% of contigs corresponded to sequences from 35 eukaryotic viral families, mostly with dna genomes. interestingly, virus-associated sequences from rna viruses belonging to 14 families were also detected, most likely reflecting the presence of traces of viral rna. limited but highly accurate reverse transcriptase activity has indeed previously been reported for the phi29 dna polymerase, used in rca [88] . most of the insect-infecting viral sequences detected belonged to viral families infecting lepidopteran adults or larvae, which may represent the diet of t. brasiliensis, as detected previously in feces and anal swabs from various insectivorous bat species (myotis sp., rhinolophus sp., molossus sp., neoromicia sp.), including t. brasiliensis [19, 21, 23, 26] . in addition, the detection of various plant viral families in this study could reflect the plant diet of the insects ingested by the bats. a total of 15 different mammalian viral families were identified in t. brasiliensis samples, representing approximately 43% (15/35) of the eukaryotic viral families interrogated herein. several mammalian viral families, supported by the contigs and sequencing reads, have been identified previously in new world [23, 24, 26] and old world [17, 18, 89] bat species. the mammalian viral families identified in t. brasiliensis included typical zoonotic viruses identified previously in bats, such as polyomaviridae [29] , rhabdoviridae [90] , coronaviridae [23, 28] , poxviridae, flaviviridae, and adenoviridae [23] . the identification of circoviridae and astroviridae in t. brasiliensis was also in line with the results of previous studies [23, 31] . on the other hand, this study indicated the presence of genomoviridae, alloherpesviridae, papillomaviridae, herpesviridae, paramyxoviridae, and reoviridae in t. brasiliensis for the first time. notably, the presence of incorrect annotations in public databases, such as the sequences assigned to the retroviridae family in this study, highlight the need for the curation of data (whenever possible) to avoid the under-and/or overestimation of the classified sequences derived from metagenomics studies. pvs are a highly diverse family of non-enveloped and double-stranded circular dna viruses that are known to infect a wide variety of mammals, as well as birds, reptiles, and fish [91, 92] . various human and non-human pvs, including bat pvs, have frequently been identified in healthy epithelia and may represent part of the native epithelial microflora [34, 57, 58] . several studies have suggested the presence of pvs in old world bat species using conventional [57, 58, 93] or ngs aproches [20, 41, 94, 95] . the only pv type (mmopv1) identified in a new world bat species (molossus molossus) has recently been described [26] , suggesting a crude sampling imbalance and a severe lack of information to elucidate the evolutionary mechanisms driving pv diversification on the global scale. in this study, tbrapv1 has been identified in pooled oral and anal swabs of t. brasiliensis by ngs, and its sequence has been completely characterized by conventional molecular techniques. tbrapv1 is the first reported pv type found in t. brasiliensis and the second pv type identified in new world bat species. according to the current ictv papillomaviridae classification guidelines (published in june 2018), based on the nucleotide sequence of the l1 gene [96] , tbrapv1 is the founding member of a novel pv genus in the firstpapillomavirinae taxonomical subfamily, sharing more than 45% sequence identity to other pv types included in this subfamily. although nucleotide sequences analysis in the l1 gene indicated that tbrapv1 shares a 61.5% identity with hpv41 (nu-pv) and should be included within the same genus (more than 60% of nucleotide identity in l1 gene) [91] , the mentioned demarcation criteria suggests a visual inspection of phylogenetic trees derived from concatenated e1, e2, l1, and l2 nucleotide sequences to delineate pv genera [96] . in the present study, such analysis clustered tbrapv1 basal to the delineation of hpv41 (nu-pv) and edpv1 (sigma-pv), identified in a north american porcupine (erethizon dorsatum) and, therefore, may represent a novel genus within the papillomaviridae family. this phylogenetic clustering also indicated that tbrapv1 shares common ancestry with other bat pvs such as espv1, espv3, rfpv1, ehpv1, and mscpv2. on the other hand, tbrapv1 is only distantly related to rapv1, ehpv2, ehpv3, espv2, mscpv1, and mrpv1, which have been identified from different tissues and bats species. the idea that different bat pvs evolved during a process of strict host coevolution is further refuted by the observation that different bat pvs appear scattered around the papillomaviridae phylogenetic tree in a highly polyphyletic manner [57, 58] . in addition, under strict host coevolution it would be expected for tbrapv1 and mmopv1, both molossid pvs, to be closely related; nevertheless, mmopv1 has a basal taxonomic position with respect to tbrapv1. these observations support the idea of multiple evolutionary forces as drivers of pv evolution, including coevolution, adaptive radiation, broad host range, host switch, and recombination [58] . genomoviruses are single-stranded circular dna viruses that belong to the recently proposed genomoviridae family [97] . members of this family encode two genes-the cap/cp and the rolling-circle replication-associated protein (rep)-and an intergenic region. it has been proposed that a novel viral complete genome sequence of the same species exhibits more than 78% similarity to any other complete genomovirus genome [70] . in addition, in previous studies, the authors aimed to establish nine genera within the family genomoviridae based on pairwise comparisons of complete genome sequences [70] . cress dna viruses, including genomoviruses, have been found in association with a great variety of animal species, such as camels [98] , bats [89] , mongooses, badgers [99] , wolves [100] , pigs [101] , and humans [102] , as well as in environmental-associated [103] and plant-associated [104] samples. however, no direct implication with a disease has been demonstrated so far. in particular, bat-associated genomoviruses have been identified from feces [105, 106] or pharyngeal and anal swab samples [89] of asiatic [89, 105] or european [106] bat species and have been attributed to various taxonomic genera of the genomoviridae family [70] . tbgkyv1 is a novel species within the gemykibivirus genus according to the classification criteria [70] . it should be noted that the rep and cap proteins of tbgkyv1 exhibited different percentages of similarity, the cap being considerably more divergent than the rep, indicating differences in their evolutionary histories due to their respective molecular functions [70, 106] . to the best of our knowledge, this is the first report demonstrating the presence of genomoviral sequences in mucosal swab samples of a new world bat species. finally, it is worth noting that the results of our metagenomic screening of the pooled t. brasiliensis oral and anal swab samples is effectively provided at three different levels of specificity/sensitivity. the two complete genome sequences (tbrapv1 and tbgkyv1), that have been described at the highest level of detail, also confer the highest level of confidence. in other words, we have complete confidence that these two viruses were present in the pooled nucleic acid samples. assigning sequences that did not assemble at the level of complete viral genomes to taxonomical families could therefore be a valid approach offering a higher level of sensitivity than only the complete genome assemblies, but at the cost of diminished specificity. in addition, identifying a sequence fragment that resembles a known viral genome in a given genomic region, may not always be sufficient to infer that that specific virus was present in the sample. viruses are highly promiscuous entities, which can easily exchange parts of their genomes with their hosts, be integrated or even naturalized into the host genomes [107] . taxonomical viral families identified among the assembled contigs could be interpreted as viral families that were probably represented in our samples. lastly, and due to the low number of assembled contigs, that were found related to known viral sequences, we attempted to increase the sensitivity even further by obtaining taxonomical family mappings also for the source read pairs. these results, however, should be interpreted with utmost caution, because they likely confer a very low level of specificity due to the limited sequence length (2 × 150 bp). taxonomical viral families identified by read-pair taxonomy mapping only, merely suggest the possibility that these families were present and should be replicated in the future by taxonomical mappings conferring greater specificity, for example with longer sequences, such as those assembled from illumina or nanopore reads. this study presents an initial description of the oral/anal virome composition of t. brasiliensis, a widely-distributed new world bat species living in close contact with the human population, for the first time. although their biological significance is not clear, this work contributes to a better understanding of the evolution and genetic diversity of these viruses. using conventional nucleic acid detection techniques and/or bioinformatics approaches, the whole genomes of two novel viruses were completely covered, tbrapv1 and tbgkyv1, clustering into the papillomaviridae and genomoviridae families. tbrapv1 is the first pv type identified in this host and the prototype of a novel genus in the firstpapillomavirinae taxonomic subfamily. tbgkyv1 is the first genomovirus reported in new world bats and constitutes a new species within the genus gemykibivirus. future studies are required to investigate the possible health impact of the viruses described on bat colonies and to identify the factors that contribute to their dispersal. supplementary materials: the following are available online at http://www.mdpi.com/1999-4915/12/4/422/s1, supplementary data 1: details of the assembled nucleotide sequences obtained by de novo assembly and used in the taxonomic classification as a part of the metagenomic workflow (2) (figure 1 ). table s1 : de novo assembly settings in metagenomic workflow: taxonomic classification of contigs assembled de novo. post-assembly contig correction was applied in all cases. in the case of unicycler assembly, the post-assembly contig-correction program was pilon v1.22 [108] . table s2 : list of the 376 reference pv genomes used for the phylogenetic analyses shown in figures 4 and 5 . details of viral names, abbreviations, host names and genbank ids are provided. nucleotide sequences were downloaded from pave (http://pave.niaid.nih.gov/). table s3 : members of papillomaviridae and genomoviridae identified in bat species so far. details of viruses' taxonomic classification and isolation source as well as host phylogeny and distribution are provided. table s4 : read pairs and contigs classified as similar to viruses and not taxonomical assigned to viral families identified in anal and oral swab samples of tadarida brasiliensis obtained by metagenomics using illumina technology. a world of science and mystery bat species of the world: a taxonomic and geographic database diseases and causes of death in european bats: dynamics in disease susceptibility and infection rates bats and birds: exceptional longevity despite high metabolic rates periodic arousal from hibernation is necessary for initiation of immune responses in ground squirrels a comparison of bats and rodents as reservoirs of zoonotic viruses: are bats special? bats are natural reservoirs of sars-like coronaviruses fruit bats as reservoirs of ebola virus characterization of nipah virus from naturally infected pteropus vampyrus bats isolation of hendra virus from pteropid bats: a natural reservoir of hendra virus the iucn red list of threatened species 2015: e.t21314a22121621. available online bonn convention on the conservation of migratory species of wild animals 1979 bats: important reservoir hosts of emerging viruses bats and zoonotic viruses: can we confidently link bats with emerging deadly viruses? virome profiling of bats from myanmar by metagenomic analysis of tissue samples reveals more novel mammalian viruses virome analysis for identification of novel mammalian viruses in bats from southeast china a metagenomic viral discovery approach identifies potential zoonotic and novel mammalian viruses in neoromicia bats within south africa metagenomic study of the viruses of african straw-coloured fruit bats: detection of a chiropteran poxvirus and isolation of a novel adenovirus a preliminary study of viral metagenomics of french bat species in contact with humans: identification of new mammalian viruses european bats as carriers of viruses with zoonotic potential bat guano virome: predominance of dietary viruses from insects and plants plus novel mammalian viruses metagenomic analysis of the viromes of three north american bat species: viral diversity among different bat species that share a common habitat bats host major mammalian paramyxoviruses virome analysis of two sympatric bat species (desmodus rotundus and molossus molossus) in french guiana high diversity of rabies viruses associated with insectivorous bats in argentina: presence of several independent enzootics bat coronavirus in brazil related to appalachian ridge and porcine epidemic diarrhea viruses genomic characterization of two novel polyomaviruses in brazilian insectivorous bats genomic characterization of novel circular ssdna viruses from insectivorous bats in southern brazil a novel anelloviridae species detected in tadarida brasiliensis bats: first sequence of a chiropteran anellovirus behavior and demography in an urban colony of tadarida brasiliensis (chiroptera: molossidae) in rosario estado actual de la colonia de tadarida brasiliensis (chiroptera, molossidae) del sicom "facultad de derecho natural history of human papillomavirus infection of sun-exposed healthy skin of immunocompetent individuals over three climatic seasons and identification of hpv209, a novel betapapillomavirus new generic primer system targeting mucosal/genital and cutaneous human papillomaviruses leads to the characterization of hpv 115, a novel beta-papillomavirus species 3 high prevalence of gammapapillomaviruses (gamma-pvs) in pre-malignant cutaneous lesions of immunocompetent individuals using a new broad-spectrum primer system, and identification of hpv210, a novel gamma-pv type a broad range of human papillomavirus types detected with a general pcr method suitable for analysis of cutaneous tumours and normal skin a broad spectrum of human papillomavirus types is present in the skin of australian patients with non-melanoma skin cancers and solar keratosis improved detection of human papillomavirus harbored in healthy skin with fap6085/64 primers a novel papillomavirus in adélie penguin (pygoscelis adeliae) faeces sampled at the cape crozier colony a single bat species in cameroon harbors multiple highly divergent papillomaviruses in stool identified by metagenomics analysis unique genome organization of non-mammalian papillomaviruses provides insights into the evolution of viral early proteins centrifuge: rapid and sensitive classification of metagenomic sequences interactive analysis of metagenomics data for microbiome studies and pathogen identification spades: a new genome assembly algorithm and its applications to single-cell sequencing fast gapped-read alignment with bowtie 2 muscle: a multiple sequence alignment method with reduced time and space complexity classification of papillomaviruses (pvs) based on 189 pv types and proposal of taxonomic amendments seaview version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building iq-tree: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies fast model selection for accurate phylogenetic estimates improving the ultrafast bootstrap approximation survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes automated circularization of genome assemblies using long sequencing reads sdt: a virus classification tool based on pairwise sequence alignment and identity calculation a novel endogenous betaretrovirus in the common vampire bat (desmodus rotundus) suggests multiple independent infection and cross-species transmission events multiple evolutionary origins of bat papillomaviruses novel papillomaviruses in free-ranging iberian bats: no virus-host co-evolution, no strict host specificity, and hints for recombination the e4 protein; structure, function and patterns of expression identification of a novel human papillomavirus, type hpv199, isolated from a nasopharynx and anal canal, and complete genomic characterization of papillomavirus species gamma-12 predicted alpha-helix/beta-sheet secondary structures for the zinc-binding motifs of human papillomavirus e7 and e6 proteins by consensus prediction averaging and spectroscopic studies of e7 pdz domains: fundamental building blocks in the organization of protein complexes at the plasma membrane binding of the lxcxe insulin motif to a hexapeptide derived from retinoblastoma protein role of the atp-binding domain of the human papillomavirus type 11 e1 helicase in e2-dependent binding to the origin the human papillomavirus e6 protein and its contribution to malignant progression classical nuclear localization signals: definition, function, and interaction with importin alpha nuclear export of human papillomavirus type 31 e1 is regulated by cdk2 phosphorylation and required for viral genome maintenance l2, the minor capsid protein of papillomavirus regulation of human papillomavirus gene expression by splicing and polyadenylation sequence-based taxonomic framework for the classification of uncultured single-stranded dna viruses of the family genomoviridae eukaryotic circular rep-encoding single-stranded dna (cress dna) viruses: ubiquitous viruses with small genomes and a diverse host range determination of the origin cleavage and joining domain of geminivirus rep proteins identification of the nicking tyrosine of geminivirus rep protein a single rep protein initiates replication of multiple genome components of faba bean necrotic yellows virus, a single-stranded dna virus of plants geminivirus replication proteins are related to prokaryotic plasmid rolling circle dna replication initiator proteins conserved sequence and structural motifs contribute to the dna binding and cleavage activities of a geminivirus replication protein functional analysis of a novel motif conserved across geminivirus rep proteins a new superfamily of putative ntp-binding domains encoded by genomes of small dna and rna viruses a common set of conserved motifs in a vast variety of putative nucleic acid-dependent atpases including mcm proteins involved in the initiation of eukaryotic dna replication the oligomeric rep protein of mungbean yellow mosaic india virus (mymiv) is a likely replicative helicase dna helicase activity is associated with the replication initiator protein rep of tomato yellow leaf curl geminivirus contaminating viral sequences in high-throughput sequencing viromics: a linkage study of 700 sequencing libraries clinical metagenomic next-generation sequencing for pathogen detection development and optimization of metagenomic next-generation sequencing methods for cerebrospinal fluid diagnostics quality control implementation for universal characterization of dna and rna viruses in clinical respiratory samples using single metagenomic next-generation sequencing workflow metagenomic analysis of viruses from bat fecal samples reveals many novel viruses in insectivorous bats in china evaluation of rapid and simple techniques for the enrichment of viruses prior to metagenomic virus discovery limited reverse transcriptase activity of phi29 dna polymerase deciphering the bat virome catalog to better understand the ecological diversity of bat viruses and the bat origin of emerging infectious diseases high diversity of rabies viruses associated with insectivorous bats in argentina: presence of several independent enzootics classification of papillomaviruses the clinical importance of understanding the evolution of papillomaviruses genetic characterization of the first chiropteran papillomavirus, isolated from a basosquamous carcinoma in an egyptian fruit bat: the rousettus aegyptiacus papillomavirus type 1 viral metagenomics of six bat species in close contact with humans in southern china identification of a novel bat papillomavirus by metagenomics a new family of widespread single-stranded dna viruses identification of diverse viruses in upper respiratory samples in dromedary camels from united arab emirates fecal virome analysis of three carnivores reveals a novel nodavirus and multiple gemycircularviruses identification of an enterovirus recombinant with a torovirus-like gene insertion during a diarrhea outbreak in fattening pigs viral gut metagenomics of sympatric wild and domestic canids, and monitoring of viruses: insights from an endangered wolf population small circular single stranded dna viral genomes in unexplained cases of human encephalitis, diarrhea, and in untreated sewage detection and molecular characterization of gemycircularvirus from environmental samples in brazil identification and molecular characterization of a single-stranded circular dna virus with similarities to sclerotinia sclerotiorum hypovirulence-associated dna virus 1 genome sequences of poaceae-associated gemycircularviruses from the pacific ocean island of tonga diverse replication-associated protein encoding circular dna viruses in guano samples of central-eastern european bats global organization and proposed megataxonomy of the an integrated tool for comprehensive microbial variant detection and genome assembly improvement this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license the authors thank irene villa, german saigo, mauricio taborda, and violeta di domenica for collecting bat samples. the authors declare no conflict of interest.viruses 2020, 12, 422 key: cord-331698-rwow1ydx authors: latorre-pérez, adriel; pascual, javier; porcar, manuel; vilanova, cristina title: a lab in the field: applications of real-time, in situ metagenomic sequencing date: 2020-08-20 journal: biol methods protoc doi: 10.1093/biomethods/bpaa016 sha: doc_id: 331698 cord_uid: rwow1ydx high-throughput metagenomic sequencing is considered one of the main technologies fostering the development of microbial ecology. widely used second-generation sequencers have enabled the analysis of extremely diverse microbial communities, the discovery of novel gene functions, and the comprehension of the metabolic interconnections established among microbial consortia. however, the high cost of the sequencers and the complexity of library preparation and sequencing protocols still hamper the application of metagenomic sequencing in a vast range of real-life applications. in this context, the emergence of portable, third-generation sequencers is becoming a popular alternative for the rapid analysis of microbial communities in particular scenarios, due to their low cost, simplicity of operation, and rapid yield of results. this review discusses the main applications of real-time, in situ metagenomic sequencing developed to date, highlighting the relevance of this technology in current challenges (such as the management of global pathogen outbreaks) and in the next future of industry and clinical diagnosis. for many years, culture-dependent approaches were the only tools available for the study of microorganisms, although the vast majority of microbial species (>99%) cannot be cultivated [1] . this limitation lasted until the development of molecular techniques, such as the automation of sanger sequencing [2] , molecular markers [3] , cloning [4] , or fluorescence in situ hybridization [5] , among many others. however, these molecular techniques presented other weaknesses, like the inability to access low-abundance microorganisms, generating a bias towards the most abundant taxa. the term metagenomics was proposed in the 1990s [6] to define the set of genomes that could be found in a given environment. the fundamental aim of metagenomics is the study of microorganisms in the context of their community by means of sequencing genomic fragments from the entire microbiome simultaneously. nevertheless, this goal can be partially accomplished by sequencing marker genes, even though this approach should not be considered as true metagenomics [7] . in marker-gene studies, generic, relatively universal primers are used to amplify a fragment of a given gene through polymerase chain reaction (pcr) (e.g. 16s rrna for bacteria/archaea, 18s/its for fungi) from all genomes present in a given sample, and the resulting pool of amplicons is sequenced. next, the sequences are clustered into operational taxonomic units (otus), each otu is taxonomically identified, and compared across samples. traditionally, otus were constructed by grouping sequences according to a defined similarity threshold (typically 97%). however, otus are being replaced by amplicon sequence variants, which group sequences that are completely identical [8] . while fast and inexpensive, this method does not give any information on the hundreds of thousands of functional genes encoded by other parts of the (meta)genomes as these remain unsequenced. whole-genome sequencing (wgs)-or shotgunmetagenomics can offer an alternative and complementary method since it is based on the application of sequencing techniques to the entirety of the genomic material in the microbiome of an environmental sample. sequencing the genomes of all microorganisms can provide information about the diversity of functional genes, and allow the assignment of each metabolic function to specific taxa, to identify novel genes or proteins so far unknown, and to assemble genomes in order to study evolutionary relationships. the number of metagenomic studies has dramatically increased in the last years, mainly due to the emergence of highthroughput sequencing technologies and the development of bioinformatic tools that facilitate the assembly of data and the assignment of sequences through a process called binning [9] . the binning process consists of grouping assembled sequences (contigs) into discrete units (bins), which ideally represent draft genomes of individual microorganisms [10] . overall, both highthroughput sequencing and bioinformatics have proven powerful tools that have generated, at a relatively low cost, a huge amount of genetic information [11] . high-throughput sequencing technologies can be divided into second-and third-generation ones. two of the most widely used second-generation sequencing (sgs) technologies are illumina and ion torrent. albeit both techniques are based on sequence-by-synthesis, they have methodological differences. in illumina sequencers, short dna fragments are attached to a glass slide or micro-well and amplified to produce clusters. fluorescence-labelled nucleotides are then washed across the flowcell and are incorporated to the complementary dna sequence of the clustered fragment. then, fluorescence from the incorporated nucleotides is detected, revealing the dna sequence. on the other hand, ion torrent is based on the use of semiconductor materials that detect the release of h þ protons while the dna molecule is synthesized [12, 13] . third-generation sequencing (tgs), also known as long-read sequencing, is based on single-molecule sequencing, which speeds up the sequencing process. this technology is currently under active development and includes platforms such as pacific biosciences (pacbio) or oxford nanopore technologies (ont). pacbio is based on single-molecule, real-time sequencing technology. an engineered dna polymerase is attached to a single strand of dna, and these are placed into micro-wells called zero mode waveguides (zmws) [14] . during polymerization, the incorporated phospholinked nucleotides carry a fluorescent tag (different for each nucleotide) on their terminal phosphate. the tag is excited and emits light which is captured by a sensitive detector (through a powerful optical system). eventually, the fluorescent label is cleaved off and the polymerization complex is then ready to extend the strand [15] . on the other hand, in ont, a single-strand of dna passes through a protein nanopore, resulting in changes in the electric current that can be measured. the dna polymer complex consists of double-stranded dna and an enzyme that unwinds the double-strand and passes the singlestranded dna through the nanopore. as the dna bases pass through the pore, there is a detectable disruption in the electric current, and this allows the identification of the bases on the dna strand [16, 17] . three substantial improvements have been made in tgs technologies with regard to sgs: 1. increase in read length. while the sgs technologies produce many millions of short reads (150-400 bp), tgs typically produce much longer reads (6-20 kb)-without theoretical length limit for ont-albeit far fewer reads per run (typically hundreds of thousands). short reads produced by sgs lead to highly fragmented assemblies when it comes to de novo assembly of larger genomes because of difficulties in resolving repetitive sequences in the genome. 2. reduction of sequencing time (from days to hours or even minutes for real-time applications). while major sgs platforms use sequencing by synthesis technologies, tgs technologies directly target single dna molecules, and in the case of ont platforms, reads are available for analysis as soon as they have passed through the sequencer. 3. reduction or elimination of sequencing biases introduced by pcr amplification [18] . despite this improvement, tgs technologies have high systematic error rates ($5-15%) unlike sgs technologies (<1%) [19] . nevertheless, the accuracy of tgs can improve up to 99.9% in consensus sequences thanks to recent software developments [20] . in 2014, ont released the minion sequencing system which, unlike the bulk sequencing installations needed for the other technologies, is a palm-sized device producing long reads in real-time. when launched, the minion read length was $6-8 kb [21, 22] ; however, lab protocols enabling the obtention of longer sequences (>100 kb) have been reported [23] . minion is the smallest sequencing device currently available (10 â 3 x 2 cm and weighing just 90 g). it is inexpensive (less than e1000) in comparison with pacbio (more than e100 000), allowing laboratories with few economic resources to be able to access this technology. it can be directly plugged into a standard usb3 port on a computer with a simple configuration. specifically, a computer with a solid-state drive, >8 gb of ram, and >128 gb of hard disk space can be used for sequencing. the sequencer periodically outputs a group of reads in the form of raw current signals, which are then base-called on a laptop or on an ultraportable ont's minit. furthermore, sequence analyses (such as sequence alignment and genome polishing) can be performed on a mobile phone [24] . therefore, the ultra-portability, affordability, and speed in data production make the minion technology suitable for real-time sequencing in a variety of environments, such as ebola surveillance in west africa during the last outbreak [25] , microbial communities inspection in the arctic [26] , dna sequencing on the international space station (iss) [27] , and even the recently emerging pandemic coronavirus sars-cov-2 [28, 29] . this review describes a range of applications in which having portable, low-cost, fast, and robust technologies allowing an in situ analysis of samples is key to address important challenges. exploring the microbial diversity of natural environments via dna sequencing techniques has become a routine in the last decade. long-scale studies like the earth microbiome project have led to the massive characterization of microbial populations inhabiting different environments on our planet [30] . moreover, metagenomic sequencing has proved to be very useful for a wide range of applications such as recovering new genomes from unculturable organisms, mining microbial enzymes with potential applications in the industry, or discovering new biosynthetic gene clusters [31] [32] [33] . these studies have typically relied on next-generation sequencing platforms like illumina, which usually requires shipping samples to a centralized sequencing facility. nevertheless, biodiversity assessment studies are usually carried out in remote locations with limited access to dna sequencing services, forcing scientists to design-intensive sampling expeditions and returning to their home institutions to perform the sequencing and the data analysis. ont sequencers have emerged as an alternative to these traditional approaches, allowing the creation of mobile, in-field laboratories. figure 1 depicts a general workflow for the metagenomic analysis of samples using adapted protocols and a minion device. pomerantz et al. [34] and menegon et al. [35] designed portable laboratories that included thermocyclers and centrifuges powered by external batteries, and a minion device connected to a laptop to perform in situ dna sequencing. both works were not focused on metagenomic applications, but on evaluating the taxonomic identity of different animal specimens (reptiles and amphibians) via targeted sequencing of the 16s rrna gene or other mitochondrial genes. however, the applied methodologies and lab configurations could be easily adapted to perform metataxonomic approaches relying on the amplification and massive sequencing of marker genes. the feasibility of minion-based metagenomic sequencing protocols has been specially tested in extremely cold environments. edwards et al. [36] reported for the first time the use of mobile laboratories for the in situ characterization of the microbiota of a high arctic glacier. they were able to adapt the widely used powersoil dna isolation kit (mobio, inc.) for its in-field use, and to perform the data analysis either online and offline. the report included new results from in situ metagenomics and 16s rrna sequencing of different glaciers samples, and a benchmarking of the performance of in-field sequencing protocols by using mock communities as well as real samples. in the latter case, they compared the resulting taxonomic profiles with the microbial composition assessed by sgs platforms, describing strongly positive pearson correlations at the phylum level. goordial et al. [26] were also able to perform in situ minion sequencing in the mcgill arctic research station. in this case, a permafrost sample was analysed using two different library preparation kits on the same extracted dna. a similar percentage of bacteria and archaea was detected using both kits, but differences in the relative abundance of viruses and eukaryotic organisms were noted. the taxonomic profile of the same permafrost sample was also obtained by means of 16s rrna illumina sequencing. notably, similar taxonomic groups were identified in all the cases at the phylum level, although relative abundances varied among the different methodologies. in a parallel work, johnson et al. [37] used portable field techniques to isolate dna from desiccated microbial mats collected in the antarctic dry valleys, construct metagenomic libraries, and sequence the samples outdoors (taylor valley; temperature ¼ à1 c) and in the mcmurdo station (room temperature , rt). longer reads were achieved by sequencing at rt, but average and median read length did not depend on ambient temperature. the study also reported that cold temperatures (4 c) reduced the quality of the generated sequences, even when working with high-quality dna (lambda phage). finally, gowers et al. [38] designed and transported a miniaturized lab across europe's largest ice cap (vatnajö kull, iceland) by ski and sledge. they adapted dna extraction and sequencing protocols to be performed in a tent during the expedition, using solar energy and external batteries to power the hardware. offline basecalling was achieved in situ by using guppy (oxford nanopore, oxford, uk), but the metagenomic data analysis could not be carried out due to code errors while running the local version of kaiju [39] . in addition to cold environments, ont sequencers have been also applied for sequencing a biofilm sample at a depth of 100 m within a welsh coal mine [40] . this work presented the 'metagenomad', a suite of off-the-shelf tools for metagenomic sequencing in remote areas using battery-powered equipment. the authors were able to perform the data analysis in situ by using centrifuge [41] and a local database for characterizing the microbial composition of the sample. interestingly, minion devices have allowed dna sequencing off the earth. a first study from castro-wallace et al. [27] compared the performance of nanopore sequencing in the iss with experiments carried out on ground control, obtaining similar results. as a proof-of-concept, the authors used equimolar mixtures of genomic dna from lambda bacteriophage, escherichia coli (strain k12, mg1655) and mus musculus (female balb/c mouse) for the metagenomic sequencing. data analysis could not be carried out at the iss because of the lack of a laptop with the necessary tools installed, but it was demonstrated on the ground that sequencing analysis and microbial identification are completely feasible aboard the iss. recently, burton et al. [42] have reported that the preparation and sequencing of 16s rrna libraries are also achievable at the iss. specifically, the zymobiomics microbial community dna standard (zymo research) was used as the input dna. again, the results were comparable to the microbial profiles obtained on earth. remarkably, carr et al. [43] determined that ont sequencers performed consistently in reduced gravity environments, which would allow the use of nanopore sequencing in space expeditions to mars or icy moons. although the viability of nanopore sequencing has been widely demonstrated even under extremely harsh conditions, the vast majority of the studies resulted in reduced yield compared to current minion's metagenomic output (table 1) , which could reach up to 27 gbp using a single flowcell [48] . this highlights the need to optimize in-field protocols in order to maximize the use of sequencing resources and reduce the price per sample, which is a key factor in some applications. recently, a work from urban et al. [44] studied the microbial communities present in the surface water of cam river (cambridge). all the protocols were carried out in the lab, and the authors were able to achieve up to $5.5 m 16s rrna full-length sequences with exclusive barcode assignments in a single minion run. other groups have used minion devices for characterizing river water [45] , seawater [46] , and marine sediments [47] through metagenomic sequencing. even though these experiments were not implemented in the field, they demonstrated the possibility of obtaining higher sequencing yields ( table 1 ). the described outputs are compatible with more ambitious metagenomic analyses, such as the de novo recovery of single genomes directly from complex environmental samples. for that reason, the adaptation of sequencing protocols to field conditions is still to be further optimized. microbiology has been present in the industry for centuries. in fact, human beings already used microorganisms for their own benefits long before they even knew that microscopic life existed. nowadays, most of the microbiome-driven industrial processes are still not completely understood. metagenomic sequencing has been widely applied in order to shed light on the microbial and metabolic transitions occurring on these industrial transformations. some examples include the investigation of the link between microorganisms and their key roles or prevalence in microbial-based food products [49, 50] ; the interaction of plants and root-associated bacteria for enhancing plant mineral nutrition [51] ; or the description of the adverse effects of industrial subproducts used as soil fertilizers [52] . ont portable sequencers are not only a valuable tool for characterizing industrial microbiomes, but for detecting and monitoring crucial microorganisms in real time (fig. 2) . hardegen et al. [53] used full-length 16s rrna sequencing for analysing changes in the archaeal community present in anaerobic digesters operating under different conditions. higher proportions of methanosarcina spp. were detected in the reactors achieving elevated biogas production. although the sequencing was not carried out in situ, the suitability of minion for monitoring and evaluating an industrial process through a microbial marker was demonstrated. bacteriomes involved in the biogas production have been also studied through nanopore sequencing [54, 55] , producing results which could be coupled with the lotka-volterra model for analysing the microbial interactions occurring in the reactor [56] . water quality and wastewater management is another area of great interest for microbial monitoring. in fact, it has been proposed that sewage could serve for tracking infectious agents excreted in urine or faeces, such as sars-cov-2 [57] . in this particular context, the in situ and real-time assessment of pathogenic microorganisms by means of minion sequencing would be especially advantageous. hu et al. [58] reported correlations between e. coli culturing counts and the proportion of nanopore reads mapping a comprehensive human gut microbiota gene dataset, highlighting the potential use of this molecular technique as an indicator of faecal contamination. ont metagenomic sequencing results were similar to those obtained with illumina 16s rrna sequencing, but a reduced time was achieved using minion. nanopore sequencing could be also employed for evaluating antibiotic resistance genes (args) and antimicrobialresistant pathogens present in wastewater treatment plants [59] . in this case, both illumina and nanopore shotgun sequencing revealed comparable abundances of major arg types. the agreement between the two platforms has been also described for the analysis of different water sources in nepal through 16s rrna sequencing [60] . although long-reads allowed the classification of 59.41% of the reads down to the species level-no illumina reads were classified at this level-a significant number of false-positives arose. these results were consistent with observations from [61] , which showed that the bacterial identification at the genus level was reliable. species-level missclassifications could be partially addressed by employing different-and optimized-bioinformatic approaches for the taxonomic classification [45, 62] , by sequencing the complete 16s-its-23s region of the ribosomal operon [63, 64] , or by coupling minion sequencing with complementary quantitative pcr assays [60] . agro-food industry would also benefit from real-time sequencing. for instance, nanopore metagenomic sequencing could be useful for the quick detection of plant pathogens infecting crops. hu et al. [65] were able to identify the fungal species causing diseases on wheat plants, which were previously infected with known microbes. co-occurrences between fungal and bacterial genera were also detected. viral infectious diseases could be in situ monitored by using this technology, allowing rapid and improved response to outbreaks [66] . other successful applications of ont in the food industry included the characterization of the microbiome of a salmon ectoparasite (caligus rogercresseyi), revealing its potential role as a reservoir for fish pathogens [67] ; and the determination of the fish species present in complex mixtures, which would help to prevent-and rapidly detect-food fraud [68] . overall, nanopore results generally agreed with those obtained by illumina sequencing when available, thus validating the use of this technology for the vast majority of applications. despite the huge potential shown, the suitability of minion sequencing in an industrial context has yet to be ascertained, since all the discussed works were not carried out under field conditions. in fact, there are some critical points to be addressed before this technique could become a standard in the industry: (i) sequencing cost should be reduced; (ii) rapid and reliable in situ dna extraction and library preparation protocols should be designed and validated; (iii) minimal sequencing yields should be determined for each specific application; (iv) fast and real-time pipelines should be created and tested; and (v) level of expertise for managing the data and the samples should be notably reduced. microbial infections are an increasingly relevant problem in intensive care units worldwide. especially, the emergence of multi-drug resistance microorganisms is one of the main threats our society is facing from a clinical point of view [69] . current diagnostics for pathogen identification in hospitals is still mainly dependent on culture-and molecular-based approaches, which have several limitations regarding specificity, bias, sensitivity, and time to diagnosis. the revolution of high-throughput sequencing and the decreasing costs associated to sgs has strongly empowered clinical diagnostics and other aspects of medical care [70] . in the particular case of clinical infections, high-throughput metagenomic sequencing allowed for the first time the precise strain-level identification of multiple pathogenic agents in single, all-inclusive diagnostic tests [71] . however, the limitations of sgs regarding cost and time to results (as described in previous sections) hamper its application when a fast analysis is needed. for instance, in the case of sepsis, patients are usually treated with broad-spectrum antibiotics until the first results of culture-based analysis (including determination of antibiotic susceptibility) are obtained 36-48 h later. when available, sgs approaches can speed up the process to $24 h, but result is expensive, labour intensive, and informatically challenging for most hospitals and healthcare centres [72] . in this context, minion sequencing ( fig. 1) paves the way towards a diagnostic alternative in a clinically critical timeframe, which could reduce the morbidity and mortality associated to major microbial infections. the first reports on minion sequencing in clinical diagnosis were focused on the detection of single pathogens during outbreaks. flagship examples of such applications are the fast (<24 h) detection of ebola virus during the 2015 outbreak in west africa [16, 73] , or the fast (<6 h) phylogenomic analysis of salmonella strains during a hospital outbreak [74] . other significant efforts have focused on the fast identification of single clinical isolates [75] , including the analysis of args in a timeframe of <6 h [76, 77] . however, a range of use cases in the clinical field requires the use of metagenomic sequencing to unveil the identity of viral or microbial communities rather than single isolates. in the case of viruses, the seminal work of greninger et al. [78] reported the detection of several viral pathogens in human blood in <6 h since the obtention of the samples, by using cdna conversion and random amplification prior to sequencing. despite the notable error rate observed in the sequences, all viruses (chikungunya virus, ebola virus, and hepatitis c virus) were correctly identified and most of their genomes were recovered with high accuracy (97-99%). a similar approach was reported for the rapid identification of mosquito-borne arbovirus [79] , and other viruses causing co-infections, including dengue, from human serum samples [80] . on the other hand, an extensive number of reports have been focused on the analysis of infections caused by bacterial communities (table 2) , using different approaches which resulted in different analysis times. even though a range of pcr-free protocols have been developed for minion sequencing, one of the main problems associated to the analysis of microbial communities in clinical samples is the overwhelming concentration of host dna, which hampers the detection of bacterial sequences during the first hours of the sequencing runs [89, 90] . several strategies have been applied to partially overcome this limitation. on the one hand, pcr-based approaches targeting the 16s rrna gene proved the most rapid methods to identify pathogenic agents from human samples. particular examples of this are the metagenomic analysis in empyema patients with pleural effusion [83] and the metagenomic analysis of patients with acute respiratory distress syndrome [84] , both studies reporting the obtention of the first results in only 2 h after the collection of samples. on the other hand, the use of human cell-free samples allows the application of wgs protocols for the analysis of the communities, yielding not only taxonomic information but also the identification of putative antimicrobial resistance genes, which are of outstanding relevance for the selection of effective treatments. pendleton et al. analysed in 2017 [86] lavage fluids from patients with pneumonia and managed to identify the bacterial pathogens in the lungs in <9 h using a wgs strategy. similar approaches performed on urine samples [87] and resected valves from patients with endocarditis [85] yielded a diagnosis in 4 h. for the analysis of bacterial sepsis, recent reports describe the application of minion metagenomic sequencing on cell-free samples (<6 h from samples to results) [81] and on faecal samples from preterm infants (obtaining results in <5 h) [82] . the depletion of human dna prior to metagenomic sequencing proved also a useful alternative to reduce total analysis time [88] . in the current sars-cov-2 outbreak, minion sequencing is proposed not only as a rapid tool for wgs, but also as a metagenomics-based approach for the rapid diagnosis of polymicrobial/viral infections associated to coronavirus disease covid19. this is especially relevant to optimize the treatment of patients suffering severe symptoms of the disease. finally, other advantages of minion sequencing besides the reduction of analysis are also to be highlighted. given the low price of the devices and consumables (in comparison to sgs equipment), minion has enabled the metagenomic analysis of clinical samples in areas with limited resources [25, 91] . also, from a technical point of view, the generation of long reads increases the resolution of the taxonomic analysis of the samples, reaching in most cases a species-level identification of the most abundant members of the communities [92, 93] . the 'read until' strategy: towards cost-effective in situ metagenomics metagenomic applications are often limited by the nature of the samples to be analysed. for instance, the characterization of prokaryotes or viruses present in a sample dominated by host dna via direct shotgun sequencing could be really challenging, and would require high sequencing depth, thus increasing the cost of the analysis [94, 95] . although it is possible to enrich samples in particular fractions (i.e. differential centrifugation and filtration) or dna fragments (i.e. pcr amplification and dna hybridization) [96, 97] , several factors should be taken into account when considering a fast, in situ application. mainly, it would be especially difficult to adapt enrichment protocols to field conditions, and they could cause substantial losses of genetic material, add extra time to sample preparation, and result in a significant bias. in this context, targeted or selective real-time sequencingalso known as 'read until'-is a new approach for focusing the sequencing process to specific dna fragments of interest. read until is based on the ability of programming nanopore sequencers to reject individual dna molecules while they are being read [98] , releasing the individual nanopore to sequence another dna fragment. ont sequencing speed is estimated to be 450 bp/s [98] [99] [100] , and it is relatively common to achieve sequences longer than 100 kbp [24, 101] . theoretically, to discard a read for being read after a few seconds of translocation through the nanopore would prevent wasting sequencing capacity, which could be saved for sequencing targeted dna fragments [99] . in a metagenomic context, the read until strategy could be used to deplete sequencing of undesirable dna (i.e. host dna) or for enriching specific genes/genomes. this depletion/enrichment procedures would not require any experimental steps, thus facilitating their use under field conditions. selective sequencing was first demonstrated by loose et al. [102] . later, edwards et al. [103] showed the ability of read until strategies to enrich e. coli genomic sequences over human dna. however, the actual revolution in targeted ont sequencing is taking place in the recent months, with three different approaches being simultaneously released ( table 3 ). the first one, named boss-runs, introduced the dynamic selection of dna regions of interest [100] . this method consists of focusing sequencing efforts on areas that have achieved low coverage during the run, thus leading to the compensation of sequencing bias. with this methodology, de maio et al. [100] were able to effectively enrich multiple loci of interest within a bacterial genome, enabling up to 5-fold coverage improvement. in the field of metagenomics, boss-runs could be applied for improving the characterization of samples by ensuring the deep sequencing of clade-specific genetic markers [104] . on the other hand, kovaka et al. [99] recently developed uncalled, a tool able to directly map ont raw signals in order to detect wanted/ unwanted sequences. they used this approach for sequencing a mock community (zymobiomics high molecular weight) containing seven bacteria and one yeast. the objective was to map the generated signals to a database containing the references for the bacterial genomes (29 mbp) , rejecting dna strands when a match was detected. bacterial sequencing depletion resulted [98] . in this work, the same zymobiomics mock community was used, but the enrichment of the yeast genome was achieved in a different way. briefly, sequencing started with default parameters, but when a pre-defined coverage was reached for a specific microorganism, its genome sequence was given to the read until application in order to reject dna strands coming from this microorganism. interestingly, the pipeline was adapted to incorporate a metagenome classifier (centrifuge) [41] , allowing the use of this strategy without prior knowledge of the sample. overall, selective sequencing has proved useful for different metagenomic applications. nevertheless, an associated reduced total yield per flowcell has been reported [98, 99] . this could be explained by two main reasons: (i) rejecting dna strands increase the time that a nanopore is not reading a molecule and (ii) voltage changes needed for rejecting the fragments may produce pore blockages [98] . nuclease flush could potentially help to overcome this situation, although current throughputs are enough for enriching dna sequences and reducing the time needed to reach the desired coverage [98, 99] , which is a key factor in many in situ applications. in this work, we have reviewed the state-of-the-art, current research, and applications of real-time, in situ metagenomics. the spectacular development of metagenomic technologies in the last years as well as the number and importance of current and new challenges-including biomedical hazards-that could be addressed with portable metagenomic sequencing, reveals the importance of further developing this technology to match a variety of niches that we can, already, forecast. for example, we can envisage a close future in which microbial ecologists will be equipped with small, minion-like devices that will allow to both extract dna, carry out a fast sequencing, and yield the results in a very short time. the understandability of the results and the minimization of the-visible-bioinformatic background will be very important to allow nonspecialized staff to use such portable devices. the recent covid-19 outbreak as well as the surveillance of ebola, zika, and many other emergent diseases will need an army of-not necessarily specialized-detectors, for which easy-to-run, easy-to-understand platforms will be needed. alternatively, raw sequencing data will have to be transmitted through secure internet-based applications to centralized points, in which specialist staff will further process and finally analyse the information. such portable, easy-to-use, cheap devices will be used in quality control of all sorts of foods and ingredients; in the identification of crop pathogens on an individual plant basis; in forensic investigations; in the assessment of the energetic potential of different substrates or batches for biogas production; or for the identification of the best soils for specific crops, as deduced by the soil microbial (either taxonomic or functional) profile. in order to meet all these possibilities (which we have ambitiously described in future and not conditional tense), the combination of five traits will have to take place. the in situ, portable platform of the future will (have to) be: inexpensive, robust, fast, easy to use, and connectable. a platform with these features will have a game-changing effect on the way we perform-and understand-microbial ecology. the uncultured microbial majority nucleotide sequence of bacteriophage ux174 dna the analysis of natural microbial populations by ribosomal rna sequences analyzing natural microbial populations by rrna sequences phylogenetic identification and in situ detection of individual microbial cells without cultivation molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products shotgun metagenomics, from sampling to analysis exact sequence variants should replace operational taxonomic units in marker-gene data analysis time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization maxbin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm metagenomics and bioinformatics in microbial ecology: current status and beyond an integrated semiconductor device enabling non-optical genome sequencing progress in ion torrent semiconductor chip based sequencing zero-mode waveguides for single-molecule analysis at high concentrations next generation sequencing technology: advances and applications characterization of individual polynucleotide molecules using a membrane channel sequence-specific detection of individual dna strands using engineered nanopores a window into thirdgeneration sequencing minion analysis and reference consortium: phase 1 data release and analysis connet: accurate genome consensus in assembling nanopore sequencing data via deep learning improved data analysis for the minion nanopore sequencer successful test launch for nanopore sequencing minion-based long-read sequencing and assembly extends the caenorhabditis elegans reference genome f5n: nanopore sequence analysis toolkit for android smartphones real-time, portable genome sequencing for ebola surveillance in situ field sequencing and life detection in remote (79 26 0 n) canadian high arctic permafrost ice wedge microbial communities nanopore dna sequencing and genome assembly on the international space station isolation and characterization of sars-cov-2 from the first us covid-19 patient amplicon based minion sequencing of sars-cov-2 and metagenomic characterisation of nasopharyngeal swabs from patients with covid-19 a communal catalogue reveals earth's multiscale microbial diversity genomes from uncultivated prokaryotes: a comparison of metagenomeassembled and single-amplified genomes marine mtagenome as a resource for novel enzymes recovering genomics clusters of secondary metabolites from lakes using genomeresolved metagenomics real-time dna barcoding in a rainforest using nanopore sequencing: opportunities for rapid biodiversity assessments and local capacity building on site dna barcoding by nanopore sequencing in-field metagenome and 16s rrna gene amplicon nanopore sequencing robustly characterize glacier microbiota real-time dna sequencing in the antarctic dry valleys using the oxford nanopore sequencer entirely off-grid and solarpowered dna sequencing of microbial communities during an ice cap traverse expedition fast and sensitive taxonomic classification for metagenomics with kaiju deep sequencing: intraterrestrial metagenomics illustrates the potential of off-grid nanopore dna sequencing centrifuge: rapid and sensitive classification of metagenomic sequences off earth identification of bacterial populations using 16s rdna nanopore sequencing nanopore sequencing at mars, europa and microgravity conditions applications of real-time, in situ metagenomic sequencing | 9 freshwater monitoring by nanopore sequencing metagenomic profiling of microbial pathogens in the little bighorn river microbial diversity characterization of seawater in a pilot study using oxford nanopore technologies long-read sequencing near-complete lokiarchaeota genomes from complex environmental samples using long and short read metagenomic analyses complete, closed bacterial genomes from microbiomes using nanopore sequencing sequencing of the cheese microbiome and its relevance to industry the grapevine and wine microbiome: insights from high-throughput amplicon sequencing the role of soil microorganisms in plant mineral nutrition-current knowledge and future directions genome-resolved metagenomics of sugarcane vinasse bacteria methanogenic community shifts during the transition from sewage monodigestion to co-digestion of grass biomass ammonia removal during leach-bed acidification leads to optimized organic acid production from chicken manure shedding light on biogas: phototrophic biofilms in anaerobic digesters hold potential for improved biogas production chemically stressed bacterial communities in anaerobic digesters exhibit resilience and ecological flexibility how sewage could reveal true scale of coronavirus outbreak stationary and portable sequencing-based approaches for tracing wastewater contamination in urban stormwater systems mobile antibiotic resistome in wastewater treatment plants revealed by nanopore metagenomic sequencing a comparative assessment of conventional and molecular methods, including minion nanopore sequencing, for surveying water quality targeting the 16s rrna gene for bacterial identification in complex mixed samples: comparative evaluation of second (illumina) and third (oxford nanopore technologies) generation sequencing technologies computational methods for 16s metabarcoding studies using nanopore sequencing data microbiota profiling with long amplicons using nanopore sequencing: full-length 16s rrna gene and the 16s-its-23s of the rrn operon multi-locus and long amplicon sequencing approach to study microbial diversity at species level using the minion tm portable nanopore sequencer pathogen detection and microbiome analysis of infected wheat using a portable dna sequencer real time portable genome sequencing for global food security nanopore sequencing of microbial communities reveals the potential role of sea lice as a reservoir for fish pathogens toward on-site food authentication using nanopore sequencing update on the antibiotic resistance crisis the nextgeneration sequencing revolution and its impact on genomics actionable diagnosis of neuroleptospirosis by next-generation sequencing analysis of culture-dependent versus culture-independent techniques for identification of bacteria in clinically obtained bronchoalveolar lavage fluid nanopore sequencing as a rapidly deployable ebola outbreak tool rapid draft sequencing and real-time nanopore sequencing in a hospital outbreak of salmonella rapid identification of pathogens from positive blood culture bottles with the minion nanopore sequencer rapid nanopore sequencing of plasmids and resistance gene detection in clinical isolates integrating informatics tools and portable sequencing technology for rapid detection of resistance to anti-tuberculous drugs rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis metagenomic arbovirus detection using minion nanopore sequencing assessment of metagenomic nanopore and illumina sequencing for recovering whole genome sequences of chikungunya and dengue viruses directly from clinical samples rapid nextgeneration sequencing-based diagnostics of bacteremia in septic patients rapid minion profiling of preterm microbiota and antimicrobial-resistant pathogens a portable system for rapid bacterial composition analysis using a nanopore-based sequencer and laptop computer real-time diagnostic analysis of minion tm -based metagenomic sequencing in clinical microbiology evaluation: a case report identification of pathogens in culture-negative infective endocarditis cases by metagenomic analysis rapid pathogen identification in bacterial pneumonia using realtime metagenomics identification of bacterial pathogens and antimicrobial resistance directly from clinical urines by nanopore-based metagenomic sequencing nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection real-time analysis of nanopore-based metagenomic sequencing from infected orthopaedic devices culture-independent analysis of liver abscess using nanopore sequencing rapid sequencingbased diagnosis of infectious bacterial species from meningitis patients in zambia species-level evaluation of the human respiratory microbiome rapid and real-time identification of fungi up to the species level with long amplicon nanopore sequencing from clinical samples human and extracellular dna depletion for metagenomic analysis of complex clinical infection samples yields optimized viable microbiome profiles sputum dna sequencing in cystic fibrosis: non-invasive access to the lung microbiome and to pathogen details comprehensive viral enrichment enables sensitive respiratory virus genomic identification and analysis by next generation sequencing metagenomic nanopore sequencing of influenza virus direct from clinical respiratory samples nanopore adaptive sequencing for mixed samples. whole exome capture and targeted panels targeted nanopore sequencing by real-time mapping of raw electrical signal with uncalled boss-runs: a flexible and practical dynamic read sampling framework for nanopore sequencing linear assembly of a human centromere on the y chromosome real-time selective sequencing using nanopore technology real-time selective sequencing with rubric: read until with basecall and reference-informed criteria metaphlan2 for enhanced metagenomic taxonomic profiling the authors want to thank all their colleagues from darwin bioprospecting excellence who contributed to the development of minion sequencing protocols and field applications. adriel latorre is a recipient of a doctorado industrial fellowship from the ministerio de ciencia, innovació n y universidades (reference di-17-09613).conflict of interest statement. none declared. key: cord-339915-8j04y50s authors: deng, wei; luan, yihui title: dv-curve representation of protein sequences and its application date: 2014-05-08 journal: comput math methods med doi: 10.1155/2014/203871 sha: doc_id: 339915 cord_uid: 8j04y50s based on the detailed hydrophobic-hydrophilic(hp) model of amino acids, we propose dual-vector curve (dv-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. this graphical representation not only avoids degeneracy, but also has good visualization no matter how long these sequences are, and can reflect the length of protein sequence. then we transform the 2d-graphical representation into a numerical characterization that can facilitate quantitative comparison of protein sequences. the utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different nd6 protein sequences based on their dv-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins. the graphical representation method has become very common to analyze the huge amount of gene data. generally, with this method we can first observe visual qualitative inspection in order to recognize major differences among similar gene sequences and further draw some mathematical characterizations of sequences to analyze their similarity/dissimilarity and evolutionary homology. letter sequence representation (lsr) of dna sequences represents each base by a letter of four different letters such as a, t, g, and c. dna sequences can be represented in different dimension spaces. for example, g-curve and hcurve [1] were first proposed by hamori and ruskin before thirty years. later, gates [2] established a 2d graphical representation that was simpler than h curve. however, gate's graphical representation has high degeneracy because of some circuits appearing in its curve. several researchers in their recent studies have outlined different kinds of dna sequences graphical representation based on 2d [3] [4] [5] [6] [7] [8] [9] [10] [11] , 3d [12] [13] [14] [15] , 4d [16] , 5d [17] , and 6d [18] . among these methods, we here stress dv-curve representation which was proposed by zhang [10] . dv-curve uses two vectors to represent one alphabet of dna sequences and avoids degeneracy and loss of information. furthermore, dv-curve has good visualization no matter how long these sequences are and can reflect the length of the dna sequence. lsr of protein sequences represents each amino acid by a letter of twenty different letters such as a, r, n, d, c, q, e, g, h, i, l, k, m, f, p, s, t, w, y, and v. although protein sequences and dna sequences belong to symbolic sequences, the methods for the graphical representation of protein sequences are relatively less popular, compared with dna sequences. the key reason is that the extension of dna graphical representation to protein sequences enormously increases the number of possible alternative assignments for these 20 amino acids. the amino acid sequence is the key to discover protein structure and function in the cell, so analysis of amino acid sequences is a very important part of postgenomic studies. the graphical representation study of protein sequences emerged very recently. the first visualization protein model was proposed by randić et al. until 2004 [19] . some researchers have studied on graphical representation of protein sequences from different perspectives [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] . in this paper, we introduce dv-curve graphical representation of protein sequences based on the detailed hydrophobic-hydrophilic (hp) model of amino acids. according to the important hydropathy, this approach is accompanied by a relatively small number of arbitrary choices associated with the graphical representation of proteins. also, this representation has relatively good visualization effect to describe protein sequences in a perceivable way. as its application, we analyze the similarity/dissimilarity among some nd6 sequences and construct the phylogenetic tree of 35 coronavirus spike proteins. 2.1. classification of protein sequences. the amino acid sequence is closely related to biological function. the closer the genetic relationship is, the smaller the difference in amino acid composition between them will be. over the past thirty years, the characteristics of protein sequences have been studied by establishing different classified models [21-24, 26, 27] . a well-known model of protein sequences is the hydrophobic (h or nonpolar)-hydrophilic (p or polar), that is, the hp model may be too simple and lacks enough consideration on the heterogeneity and the complexity of the natural set of residues [30] . based on brown's work [31] , 20 different kinds of amino acids are divided into four groups: nonpolar (np), negative polar (nep), uncharged polar (up), and positive polar (pp). this is called the detailed hp model, which can provide more information than the original hp model. for a given protein sequence = 1 2 ⋅ ⋅ ⋅ with length , where is the letter in the th position among the protein sequence ( = 1, 2, . . . , ), we define a primary protein sequence as a symbolic sequence which includes four letters according to the following rule: so is the substitution for , and then we obtain a sequence ( ) = 1 2 ⋅ ⋅ ⋅ . here is a letter of the alphabet 1 , 2 , 3 , 4 . for example, for a given protein primary sequence = , we can transform it into a new sequence according to the above rule, ( ) = in this section, we will construct dv-curve representation of protein sequence. given any protein primary sequence with length , we can transform it into a new sequence composed of a character set of 1 , 2 , 3 , 4 . as shown in figure 1 , these alphabets are assigned, respectively, by consecutive vectors as follows: we connect adjacent dots with lines and then obtain a dual-vector curve form. this process is shown in figure 2 . based on the construction of dv-curve, we obtain two mathematical models, respectively. one is "from protein sequence to dv-curve, " and the other is "from dv-curve to protein sequence. " firstly, we give some common symbols and variables. (1) according to the classification rule, we describe a protein sequence as ( ) it means that the protein sequence is connected by these alphabets. (2) ( , ) is the coordinate of the th point of dv-curve, and ( 0 , 0 ) = (0, 0) is the start point. model one. given a primary protein sequence, we can draw its dv-curve: according to the above four formulas, the coordinate of each point ( , ) can be calculated. then we connect all the points with beelines, and the dv-curve is obtained. model two. given a dv-curve, we can also obtain the coarsegrained description of the protein sequence based on the detailed hp-model: in order to facilitate quantitative comparisons of sequences, we will give numerical characterization of graphical curve as the descriptor. in general, we transform the graphical representation into a mathematical object like a matrix in order to draw some invariants. the frequently used matrices include matrix, matrix, matrix, and matrix proposed by randić et al. [6, 8, [32] [33] [34] . of course, there are some other matrix invariants such as the average matrix element, the average row sum, the wiener number, and the ale-index et al. these methods were used widely and proved to be useful. here, we use the as an alternative sequence invariant proposed by liao et al. [35] : obviously, this index is relatively simple for calculation so that this index can provide some convenience for long sequences. if we adjust the order of 1 , 2 , 3 , 4 corresponding to basic dual vectors, we can get another curve. so for a given sequence, we can get 4! = 24 different dvcurves totally. therefore, a protein primary sequence can the comparison on biology sequences is one of the most important parts in bioinformatics when analyzing similarities of function and properties. in this section, we will give two main applications of this new graphical representation. one is similarity analysis based on visual graphics. generally, similarity analysis can be divided into two types of methodologies to conduct the comparison: sequence alignment and sequence descriptors comparison. when recognizing figures, our brain is more helpful for similarity analysis in multiple sequences. so it is desirable to propose similarity analysis by inspecting the dv-curve of protein. the other is evolutionary homology analysis based on the numerical characterization of dv-curve, and we construct a 24-component vector to characterize any protein sequence. as further work, the phylogenetic tree of 35 coronavirus spike proteins is constructed. since smith and waterman developed a dynamic programming algorithm in 1981, many alignment algorithms identifying whether two biological sequences are similar to each other have been studied. these methods are proved to be efficient. however, multiple sequence alignment (msa) of several hundred sequences has always produced a bottleneck. in 1994, msa was proved to be an np-complete problem by wang and jiang [36] . moreover, most experts think that it is impossible until now to build a deterministic polynomial algorithm to handle an np-complete problem. it needs to exhaust almost billions or trillions of years. except long computational time, there also exists possible bias of multiple sequence alignments for multiple occurrences of highly similar sequence [37] . however, our brain is much more powerful than computer when recognizing different figures. so it can help us to analyze the similarity in multiple sequences. if we can provide a simple, intuitional, clear, and nondegenerate 2d graphical representation of protein sequences, molecular biologists may easily find out which sequence is most similar or dissimilar to the given target sequence. and next they can use alignment algorithms for further confirmation. according to our proposed definition of protein dvcurve, we can draw the curves of some nd6 (nadh dehydrogenase subunit 6) proteins in order to conveniently compare them. protein sequences that are used to prove our approach were downloaded from genbank: human ( 003024037.1), gorilla ( 008223), chimpanzee ( 008197), wallaroo ( 007405), harbor seal (h. seal) ( 006939), gray seal (g. seal) ( 007080), rat ( 004903), and mouse ( 904339), and the same data set was used in [26, 27] . in figure 3 , it is evident that protein graph of wallaroo is obviously different from the other species because it is the most remote species from the remaining mammals. furthermore, we can see human and chimpanzee have similar curves, harbor seal and gray seal's curves are almost identical, and two curves of rat and mouse are very similar. all these results not only are consistent with the conclusions drawn by smith-waterman algorithm, but also agree well with the known fact of evolution and results drawn by other authors [26, 27, [38] [39] [40] . in particular, compared with the conclusion of [27] , the dv-curve representation reflecting the similarities of sequences is more simple, intuitional, and visible. coronaviruses. coronaviruses belong to order nidovirales, family coronaviridae, and genus coronavirus. they are a diverse group of large, enveloped, single-stranded rna viruses that cause respiratory and enteric diseases in humans and other animals. generally, coronaviruses can be divided into three groups: the first group and the second group come from mammalian; the third group comes from poultry (chicken and turkey). a novel coronavirus has been identified as the cause of the outbreak of severe acute respiratory syndrome (sars). previous phylogenetic analysis based on sequence alignments shows that sars-covs come from a new group distantly related to the above three groups of previously characterized coronaviruses [41, 42] . the spike (s) protein, which is common to all known coronaviruses, is crucial for viral attachment and entry into the host cell. to illustrate the use of dv-curve of protein sequences, we will construct the phylogenetic tree of 35 coronavirus spike proteins of table 1 datasets used in this paper were downloaded from genbank (see table 1 for details). corresponding to 35 spike proteins, a 35 × 35 real symmetric matrix = ( ) is obtained and used to reflect the evolutionary distance of them. using the upgma program included in phylip package 3.65, we can construct the phylogenetic tree of these 35 species [43, 44] . the branch lengths are not scaled according to the distances and only the topology of the tree is concerned. figure 4 shows coronaviruses can be overall divided into four groups. furthermore, it is evident that sars-covs appear to cluster together and form a separate branch, which can be distinguished easily from the other three groups of coronaviruses. rtcov11, mhv8, mhv10, hcov16, bcov13, bcov12, bcov15, bcov14, mhv9, and mhv7, which belong to group 2, are situated at an independent branch, while tgev5, fcov2, ccov6, tgev4, fcov1, and pedv3, belonging to group 1, tend to cluster together. meanwhile, the group 3 coronaviruses, including ibv22, ibv20, ibv23, ibv19, ibv18, ibv21, and ibv17, tend to cluster together in another branch. the resulting monophyletic clusters agree well with the established taxonomic groups [45, 46] . the conclusion is similar to that reported by other authors [23, 24] . compared with result [24] , it is noteworthy that a closer look at the subtree of the first branch shows coronavirus from three different species; that is, mhv, bcov, and hcov can be separated clearly, while they cluster together in a subtree by li's method. obviously, our conclusion is more consistent with the known evolution fact. according to the detailed hydrophobic-hydrophilic (hp) model of amino acids, we can reduce a protein primary sequence containing 20 amino acids into a four-letter sequence, which can be treated as a coarse-grained description of the protein primary sequence. here we cannot avoid losing some information in the reduced sequences, but we can focus our main attention on the part of our interest. some alignment-free methods to analyze dna sequences have been proposed. however, there are few alignmentfree methods to analyze protein sequences. our method realizes the generalization from dna graphical representations to those of proteins acceptable and can be seen a valid supplement to graphical representation of protein sequences. meanwhile we first propose to combine dv-curve and the detailed hp model together to describe protein sequences. compared with classical smith-waterman algorithm, the similarity/dissimilarity analysis results are consistent with dv-curve. in addition, the advantage of our method is that it can visualize the local and global features among different proteins no matter how long these sequences are and avoid degeneracy at the same time. the new approach is applied in two aspects: one is similarity intuitive analysis of nd6 protein sequences of several species and the other is phylogenetic analysis among 35 coronaviruses based on their spike proteins. results have shown that our proposed method is more intuitional, simple, effectual, and feasible. h curves, a novel method of representation of nucleotide series especially suited for long dna sequences a simple way to look at dna two-dimensional graphical representation of dna sequences and intron-exon discrimination in intron-rich sequences a novel 2-d graphical representation of dna sequences of low degeneracy on the uniqueness of quantitative dna difference descriptions in 2d graphical representation models analysis of similarity/dissimilarity of dna sequences based on novel 2-d graphical representation a class of new 2-d graphical represent ation of dna sequences and their application graphical representations of dna as 2-d map h-l curve: a novel 2d graphical representation for dna sequences dv-curve: a novel intuitive tool for visualizing and analyzing dna sequences analysis of similarity/dissimilarity of dna sequences based on chaos game representation a 3d graphical representation of dna sequences and its application a group of 3d graphical representation of dna sequences based on dual nucleotides new graphical representation of a dna sequence based on the ordered dinucleotides and its application to sequence analysis analysis of similarity/dissimilarity of dna sequences based on a condensed curve representation novel 4d numerical representation of dna sequences on the similarity of dna primary sequences based on 5-d representation analysis of similarity/dissimilarity of dna sequences based on nonoverlapping triplets of nucleotide bases unique graphical representation of protein sequences based on nucleotide triplet codons a 2-d graphical representation of protein sequences based on nucleotide triplet codons protein-based phylogenetic analysis by using hydropathy profile of amino acids 2-d graphical representation of proteins based on physico-chemical properties of amino acids 2-d graphical representation of protein sequences and its application to coronavirus phylogeny new 3-d graphical representation of protein sequences and its application a 2d graphical representation of protein sequence and its numerical characterization similarity/dissimilarity studies of protein sequences based on a new 2d graphical representation new technique: protein sequence analysis based on hydropathy profile of amino acids 3d graphical representation of protein sequences and their statistical characterization similarity/dissimilarity analysis of protein sequences using the spatial median as a descriptor modeling study on the validity of a possibly simplified representation of proteins on 3-d graphical representation of dna primary sequences and their numerical characterization novel 2-d graphical representation of dna sequences and their numerical characterization compact 2-d graphical representation of dna application of 2-d graphical representation of dna sequence on the complexity of multiple sequence alignment a probabilistic measure for alignment-free sequence comparison an information-based sequence distance and its application to whole mitochondrial genome phylogeny a new sequence distance measure for phylogenetic tree construction a weighted least-squares approach for inferring phylogenies from incomplete distance matrices a novel coronavirus associated with severe acute respiratory syndrome the genome sequence of the sars-associated coronavirus the principles and practice of numerical classification characterization of a novel coronavirus associated with severe acute respiratory syndrome severe acute respiratorysyndrome coronavirus-like virus in chinese horseshoe bats the authors thank to all the anonymous reviewers for their valuable suggestions and support. this research is supported by the national science foundation of china grants 11371227 and 10921101. the authors declare that there is no conflict of interests regarding the publication of this paper. key: cord-304869-l6a68tqn authors: bielińska-wąż, dorota title: graphical and numerical representations of dna sequences: statistical aspects of similarity date: 2011-08-28 journal: j math chem doi: 10.1007/s10910-011-9890-8 sha: doc_id: 304869 cord_uid: l6a68tqn new approaches aiming at a detailed similarity/dissimilarity analysis of dna sequences are formulated. several corrections that enrich the information which may be derived from the alignment methods are proposed. the corrections take into account the distributions along the sequences of the aligned bases (neglected in the standard alignment methods). as a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the dna sequences or using alignment methods with corrections introduced in this paper. the corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the dna sequences are applied to similarity/dissimilarity studies of β-globin gene across species. the studies are supplemented by detailed similarity studies for histones h1 and h4 coding sequences. the data are described according to the latest version of the embl database. the work is supplemented by a concise review of the state-of-art graphical representations of dna sequences. in an article published by fuchs in nature in 2002 we read "future generations may be able to determine whether the sequencing of the human genome in 2001 indeed led to a paradigm shift in biology and biomedicine as some predicted, or whether the impact of this event was more gradual instead" [1] . the author observes that "so far, the history of biology has been characterized by a continous shift from the whole organism down to the molecular level, from the descriptive characterization of species over macroscopic observations and morphological and physiological studies to today's molecular dissection of individual genes". novel experimental techniques require new computational methods. in order to create good models describing the experimental results, researchers from different areas of science joined computational biology and medical sciences. as a consequence, a new interdisciplinary field adapting methods from many different branches of mathematics, physics, chemistry, and computer science emerged. a fundamental task coming from sequencing is to understand the code written in the sequence of four letters. a lot has been done to reveal some global characteristics of long dna sequences. for example herzel et al. [2] created a model that describes thousands of nearly identical dispersed repetitive sequences present in dna sequences of higher organisms. the hypothetical model sequences consist of independent equidistributed symbols with randomly interspersed repeats. the model that can be analyzed analytically predicts that the entropy of dna sequences measuring the information content is much lower than suggested by earlier empirical studies. a systematic analysis of statistical properties of coding and noncoding dna sequences has been performed by mantegna et al. [3] . the authors compared the statistical behavior of coding and noncoding regions in eukaryotic and viral dna sequences by adapting two tests developed for the analysis of natural languages and symbolic sequences. the authors analyzed some similarities and dissimilarities of statistical properties of coding and noncoding regions. in particular they found that for the three chromosomes they studied, the statistical properties of noncoding regions appear to be closer to those observed in natural languages than those of the coding regions. statistical studies aiming at characterization of correlation structures of dna sequences has been a subject of many studies (for review see [4, 5] ). in particular foss [6] using spectral density of individual base positions demonstrated long-range fractal correlations as well as short-range periodicities. arneodo et al. [7] used the wavelet transform to demonstrate the existence of long-correlations in genes containing introns and noncoding regions. buldyrev et al. [8] in order to answer the question in computational molecular biology whether long-range correlations are present in both coding and noncoding dna sequences have used standard fourier transform analysis and detrended fluctuation analysis. for that purpose, the authors performed analysis of the sequences available in genbank in 1995. for noncoding sequences, they obtained the presence of long-range correlations. azbel in his work [9] demonstrated a universality in a dna statistical structure using an autocorrelation function. however, no long-range correlations have been found in any of the studied dna sequences. peng et.al. [10] studied long-range correlations by constructing a map of the nucleotide sequence onto a walk which they referred to as a dna walk. using such an approach they found long-range correlations in intron-containing genes and in nontranscribed regulatory dna sequences, but not in complementary dna sequence or intron-less genes. visualization technique proposed by peng et al. is based on a one dimensional dna walk showing the relative occurrence of purines and pyrimidynes along the sequence. silverman and linsker introduced vectorial representation of the bases in three dimensions [11] . they used the unit vectors of 3d space to construct a fourier transform. such fourier transform graphs representing the sequences were used as measures of dna periodicity. another visualization technique based on dna walk plotted in three-dimensional cartesian coordinate system has been introduced by berger [5] . in his work berger gave also a good review of visualization techniques based on dna walk and their applications for an analysis of dna sequences i.e. a study of correlation information, sequence periodicities, and other sequence characteristics. more examples of studies focused on statistical properties of dna sequences and also on their biological interpretation may be found in [12, 13] . another class of studies is developing methods aiming at detailed sequence comparisons. most commonly used in computational biology and medical sciences are global and local alignment methods, for example clustal w [14] , blast [15] , needleman-wunsch algorithm [16] , and t-coffee [17] (for review see [18, 19] ). an alternative to the alignment methods are alignment-free methods that can be divided into two groups: numerical similarity/dissimilarity analysis of dna sequences and similarity/dissimilarity analysis based on graphical representations of dna sequences. there is a variety of numerical alignment-free methods (for a review up to 2003 see [20] ). recently new numerical alternative methods have been developed, as for example [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] . another group within numerical alignment-free methods are multidimensional graphical representations. conceptually, they are analogous to the graphical representations but their visualization is difficult (if possible at all). in particular 4d numerical representations [32] [33] [34] , 5d representation [35] , 6d representation [36] have been introduced. due to interdisciplinary character of research on dna, many groups of methods have been developed independently and very often without any knowledge about analogous results obtained in different groups of scientists. in particular, dna walk has been independently discovered by the scientists working on statistical properties of dna sequences [5] and by scientists working on graphical representations. even among researchers working on graphical representations one can find analogous visualization tools discovered independently (see subsequent chapters). this work is focused on graphical representations of dna sequences. biological sequences are often very long, and it is not obvious how to represent them graphically in an easy way that shows the main features of these objects. the size of the plots is restricted by the human abilities of perception. how to restrict the graphs representing the sequences to two-dimensional plots and how to avoid degeneracies has been the subject of numerous studies which resulted in many graphical representations (see subsequent chapters). graphical representations offer both numerical and visualization tool for similarity/dissimilarity analysis. these methods are still restricted to small groups of users. computing codes calculating optimal sequence alignment are implemented using dynamic programming and are freely accessible in the internet and that makes them attractive for potential users. however, they are computationally expensive, and methodologically offer too simplistic similarity/dissimilarity analysis. they restrict the multidimensional similarity space of complex objects and show only one aspect of similarity. it becomes more and more popular to replace the alignment methods by alternative ones. in particular, hönl and ragan consider numerical alignment-free methods that can replace multiple-sequence alignment to infer a phylogenetic tree that represents the history of a set of molecular sequences [37] . graphical representations have been also used for the construction of phylogenetic trees. since multiple alignment strategy does not work for all types of data, liao et al. [38] proposed to use the similarity matrix based on their 2d graphical representation of dna sequences [39] to construct phylogenetic tree. the authors consider mitochondrial sequences belonging to different species. the same graphical representation has been also used by the authors to obtain the phylogenetic relationships of h5n1 avian influenza virus [40] and the phylogenetic relationships of coronaviruses [41] . another 2d graphical representation [42] has been used by yu et al. to construct the phylogenetic tree of coronaviruses and lentiviruses [43] . 3d graphical representation has been also used to construct a phylogenetic tree [44] . wang and zhang studied molecular phylogeny of h5n1 avian influenza viruses in asia using 2d and 3d graphical representations of dna sequences [45] . graphical representations of dna sequences have been also generalized for the analysis of similarity/dissimilarity between rna secondary structures, as for example [46, 47] . 2d graphical representation has been used for the characterization of the neuraminidase rna sequences of h5n1 [48, 49] and of h1n1 [50] strains. also graphical representations of the proteins have been created [51] [52] [53] [54] . graphical representations of the biological sequences (dna, rna, proteins) can be applied to all problems that require similarity/dissimilarity analysis. similarity analysis is not unique to sequences in biology. for instance, the problem of similarity has been developed and applied in computational pharmacology and has resulted in methods such as qsar, qspr [55] [56] [57] [58] [59] [60] [61] which aim at the prediction of molecular properties. the basic paradigm of quantitative structure-property relationship (qsar) is that compounds with similar structure have similar properties. this implies a smooth transient behavior in the relation between structure and property/activity, i.e., for any small change in the structure, the magnitude of the physico-chemical property or biological activity changes smoothly rather than in an abrupt, in all-or-none type, way. the molecular similarity measures are based on a large number of descriptors, i.e. of the numerical indices characterizing molecules. the basis for these studies is the development of various kinds of mathematical descriptors [62, 63] . in the theory of molecular similarity it is commonly accepted that different descriptors and different similarity measures reveal different aspects of similarity. a pair of complex objects may be similar in one aspect and not similar in another aspect. using different similarity measures, usually one obtains contradictory results which may be relevant in different contexts. the first qsar studies on biological sequences using graphical representations of sequences have been already performed [64] . the present work describes the development of fundamental studies related to graphical representations of dna sequences. first, the corrections to alignment methods are proposed in order to enrich the information related to different aspects of similarity. new similarity measures are created for the alignment distributions. second, a critical review of graphical representations and their numerical characterization is given. in the last chapter, new aspects of four-component spectral representation, graphical representation of dna sequences, recently introduced by the author of this work [65] , are described. it is shown in the last chapter of this work that by using the four-component spectral representation one can recognize the difference in one base between a pair of sequences so it can be used for single nucleotide polymorfism (snp) analyses which is subject of many investigation, as for example, in a recent work by bhasi et al. [66] . another important problem is to identify protein coding regions of genomic sequences [67, 68] . first attempts of identifying protein coding genes using graphical representations of dna sequences based on z curve [69] or based on trinucleotides [70] have been already performed. it has been shown that the similarity relations are different for exons, and sequences with introns using the four-component spectral representation (see subsequent chapters). such an observation suggests that the four-component spectral representation that reveals detailed aspects of similarity, as for example the comparisons of asymmetry of the gene structure, can be used to study this problem [71] . in this section i introduce corrections that reveal some aspects of similarity which cannot be identified in the standard alignment methods. the similarity space of complex objects is multidimensional. only simple 1d objects can be classified in a unique way using a single similarity measure. complex objects may be similar in one aspect and very different in another one. for example, in the case of atoms, if a similarity measure based on their atomic numbers is considered then the periodic table of elements is obtained. however, considering ionization energies as descriptors, the similarity relations between atoms change. a final, single similarity measure is a result of averaging over different aspects of similarity or, a consequence of neglecting most of the aspects of similarity. the new similarity measures representing different aspects of similarity can be considered separately, or they can be combined in any way to search for their correlations with different biological functions. the information about similarity of the sequences derived from the alignment methods is rather limited. for example, according to the standard alignment methods, in the following two different cases: the similarity value is the same (50%). the non-zero contributions to the final result come from different positions in the sequences. in the first case, the a bases are spread over the whole sequence (positions 1, 3 give non-zero contributions). in the second case, the a bases are cumulated. in this sense, the alignment methods are degenerate: different structures give the same result. then, in the alignment methods these structures are undistinguishable. this is certainly one of the weakest points of the alignment methods. the degree of degeneracy may be very large and increases with the lengths of the sequences. obviously, the degree of degeneracy in the model example is larger than 2. one can add more than two cases that give the same score as the two model cases. for example, the bases that give non-zero contribution can be also c, t, or g and the positions of the aligned bases can be different. such details usually have biological consequences but they are not taken into account in the alignment methods. in order to describe different aspects of similarity in more detail, let us define a discrete alignment distribution n p for a pair of dna sequences: , if the p-th positions in the two sequences are occupied by two identical bases, 0, otherwise. (1) let us introduce a variable x p running along the sequence where p = 1, 2, . . . k is the position in the sequence and r is the resolution that can be selected, depending on the length of the distribution, in a way convenient for the calculations changing the units of lengths. k is the length of the sequences or subsequences for which the alignment is calculated. two bases belonging to different sequences, both located on the p-th positions are represented by a pair of numbers, {x p , n p }. let us consider multiple alignments. analogously, as for a pair of sequences according to the standard alignment methods, in the following two cases 1. the similarity value is the same (50%). thus, the alignment method is highly degenerate (different bases on different positions give non-zero contributions and these situations are undistinguishable). as a consequence, additional similarity information should be added for a proper description of the objects. this information is necessary to remove the degeneracy, i.e. to distinguish between different cases. analogously, as for a pair of sequences, we can define a discrete alignment distribution n p of several (m) dna sequences: , if the p-th positions in allmsequences are occupied by the identical bases, 0, otherwise. m bases belonging to different sequences, located on the p-th positions are represented by {x p , n p }, where x p is defined in eq. 2. as it is known from the statistics, distributions can be characterized in a convenient way by their moments. distribution moments are the basic quantities in statistical spectroscopy. the aim of statistical spectroscopy [72] [73] [74] , is to construct global characteristics of a spectrum. the individual eigenvalues, the experimental energy levels or the intensities of spectral lines are considered as statistical ensembles. such an approach may be used in many areas of physics to study different kinds of problems. let me just mention several applications of statistical spectroscopy. originally, methods of statistical spectroscopy were used in nuclear physics [75] where the character of the interparticle interactions is not exactly known. assuming different forms of the hamiltonian matrix and comparing distributions of the densities of the energy levels derived from this matrix and from the experiment some information about the hamiltonian may be derived. statistical spectroscopy may also be used to study the locations of the individual eigenvalues of the hamiltonian. in a way this is an inverse problem to the one from which the statistical spectroscopy originated: from global characteristics of the eigenvalues one tries to obtain some information about details of the spectrum. examples of generating individual energy levels using methods of statistical spectroscopy may be found in the theory of nuclear, atomic, molecular, and solid-state spectra [76] [77] [78] . approximating the eigenvalues by statistical quantities (spectral density distribution moments) is not limited by dimensions of the matrices and that is an advantage comparing to the standard methods based on diagonalization of the hamiltonian matrix. in particular, we have studied statistical properties of spectra of the heisenberg hamiltonian [78] . the distribution of the eigenvalues have been found to be gaussian-like, well approximated by several-term gram-charlier expansions [79] . the exact spectra (obtained by the diagonalizations of the hamiltonian matrices) have been compared with the ones derived from the moment-generated spectral density distributions. this approximation gives a very good description of the spectrum in its central part however, as one should expect, deteriorates at the extremes. relations between the exact and the moment-generated spectra are analyzed for several kinds of the lattices as a function of the number of moments. it has been observed that the quality of the statistical description improves with an increase of the dimension of the problem and with a lowering of the symmetry of the lattice. another attractive application of the statistical spectroscopy is a description of the shapes of molecular electronic bands [80] [81] [82] [83] . initially, the method of generating of envelopes of the intensities has been introduced for the transitions in crystals [84] and in atoms [85] . replacing the calculations line by line by the statistical approach with much shorter computing time became also attractive in molecular physics. the shape of a molecular band may be defined as an envelope of the rovibrational lines which constitute the band. the method of determining the shapes of molecular electronic bands consists of several steps. first the expressions for the intensity distribution moments for the considered system are derived. then these expressions are used to calculate the moments corresponding to the solution of the pertinent quantum chemical model. finally, a smooth function for which several lowest moments are equal to the exact ones, is derived. this function is an approximation to the envelope of the electronic band in a molecular spectrum. in particular, i have used this algorithm to derive the intensity spectrum corresponding to the transitions in h 2 molecule using 3-moment trial function [86] . in that paper i have also shown that the quality of the approximation depends on the choice of the trial function rather than on the number of moments taken into account. adding moments of the order higher than 4 does not improve the results when the gram-charlier expansion is taken as the trial function. this process may even be divergent. in some cases a 4-moment gram-charlier expansion may give worse results than the 3-moment one. for example, this happens in a spectrum derived from a model based on the harmonic oscillator potential. in the case of h 2 molecule a non-standard 3-moment trial function has to be applied in order to get a high quality approximation of the spectrum (treated as a statistical distribution). distributions are commonly, and very conveniently, characterized by their moments. in the present work i describe dna sequences as distributions and apply the distribution moments to study similarity between these sequences. a similarity measure˜ q based on the q-th moment of the discrete alignment distribution n p is defined as where q = 0, 1, 2, . . . . in this work r = 1. therefore, the values of x p are equal to p (eq. 2). the normalization constant c is defined so that the zeroth moment of the distribution is equal to one (˜ 0 = 1): comparing sequences, usually one is interested in the quantities that are independent of the lengths of the sequences. for that purpose, moments for which the mean value is equal to 0 ( 1 = 0) and the variance is equal to 1 ( 2 = 1) can be used as similarity measures: table 1 shows a model example of the alignment distribution n p for a pair of sequences. the choice of the query sequence has no influence on the results. the length of sequence 1 is 12 and the length of sequence 2 is 15. if k = 15 is chosen then for p > 12 the distribution is defined as zeros: n 13 = n 14 = n 15 = 0. therefore, the table 1 model example of alignment distributions (r = 1) table 2 shows a model example of the multiple alignment distribution for m = 3 (eq. 3). analogously, the value of k may be set equal to the length of any of the three sequences. thus, independently of the choice of k the moments remain the same. in this work,˜ q and q are proposed as new similarity measures that can be treated as corrections to the alignment methods. they describe such features of similarity that cannot be identified in the alignment methods. in particular, the two model cases defined at the beginning of this chapter can be distinguished using the new measures. the new numerical characterization of dna sequences is exemplified using the βglobin gene of different species. the species and the locations of the sequences in genes as well as the lengths of the sequences, n 1 and n 2 , are listed in table 3 . tables 4, 5, 6, 7, 8, 9, 10, 11 show similarity matrices based on the new measures for the sequences listed in table 3 . tables 4, 5, 6, 7 correspond to the coding sequences of the first exon, exon 1 c ds , and tables 8, 9, 10, 11 correspond to the second exon, exon 2 c ds . the similarity matrices are based on different measures:˜ 1 (tables 4, 8 ), 3 (tables 5, 9 ), described by x 1 = 1, x 2 = 2, and x 3 = 3. in this case˜ 1 = 4/2 = 2 and x 2 = 2 is the middle of the sequence. one can normalize the similarity matrix based on˜ 1 dividing all its elements by k + 1. then one can easily see whether the mean value is larger or smaller than 1/2. if˜ 1 is equal to 1/2 then the location of the mean value of the distribution is in the middle. if it is greater than 1/2 it is shifted towards the end of the distribution. since q are independent of the lengths of the sequences it is convenient to keep at least one similarity measure (˜ 1 ) that carries the information both about the lengths of the sequences and about the distributions of the aligned bases. therefore,˜ 1 is not normalized in this work. an example of the similarity measure independent of the lengths of the sequences is 3 that describes the asymmetry of the aligned distributions. for the symmetric distributions, in particular if identical sequences are compared, 3 is equal to zero. it is negative for the left-skewed distributions and positive for right-skewed distributions. one can observe, that the asymmetry of the aligned distributions for exon 1 c ds (table 5) is different from the one for exon 2 c ds ( table 9 ). the number of the negative values of 3 is 41 for exon 1 c ds and 21 for exon 2 c ds . for example, in the case of human-mouse sequences, 3 is negative for exon 1 c ds and it is positive for exon 2 c ds . another similarity measure independent of the lengths of the sequences is 4 . this is the kurtosis parameter, that is the measure of the peakedness of the distribution. analogously as for the lower order moments, 4 is different for different parts of a gene (tables 6, 10). for example, the similarity measure based on 4 for galluslemur sequences is 1.86 for exon 1 c ds and 1.70 for exon 2 c ds . the similarity relations based on 4 between all the sequences are shown in fig. 1 . the horizontal axis represents the values listed in table 6 table 5 versus values listed in table 7 ( fig. 2, panel b) and the values listed in table 9 versus values listed in table 11 means that the information coming from 5 is similar to the one coming from 3 . therefore, the corrections 5 can be neglected. the information coming from 4 is different than the one coming from 3 which is seen in figs. 2 and 3, panels a, where table 13 model example of four-component multiple alignment distribution (k = 15, m = 3, r = 1) in order to further enrich the information that can be derived from the alignment methods, one can introduce the four-component alignment distributions, separately for a, c, t, and g bases. a specific γ -component of this distribution, referred to as γ -distribution, is defined as where γ = a, c, t, g denotes one of the bases. now, for each of the γ -distributions one can calculate the corrections (the appropriate moments). such kind of distributions can be created for a pair of sequences (m = 2) and also for the multiple alignment studies (m ≥ 3). a model example of the four-component distribution with m = 3 is shown in table 13 . the same definitions of the distributions (eqs. 1, 3, 7) can be also used after the maximally scoring alignment of the sequences has been found. a model example of the optimized multiple alignment distribution based on eq. 3 for m = 3 is shown in table 14 . the maximally scoring alignment is obtained if two gaps are introduced in sequence 3. obviously, the information is more detailed if four-component optimized distribution is created. table 15 shows such a distribution (eq. 7) for the same sequences as shown in table 14 . the application of the new similarity measures to all kinds of the distributions is simple and straightforward. in this way, different aspects of similarity can be revealed. an attractive, alternative to the time consuming alignment methods, are graphical representations. they reveal different aspects of similarity, offer both numerical characterization of similarity and the visualization. also the computing effort is in this case very small. in this section graphical methods are discussed. in the original approaches, dna sequences were plotted as either three-dimensional [87] or two-dimensional [88] [89] [90] curves. the shapes of the curves were determined by a walk in a space spanned by four vectors that represent the four bases. in the first article on this subject, hamori proposed a graphical representation method in which the information about the dna sequence has been mapped into a three-dimensionalspace curve. a unit vector of a characteristic direction has been assigned to each of the four nucleotides: adenine a, cytosine c, thymine t, and guanine g. in this approach the shape of the curve (called h-curve) representing the sequence of nucleotides is obtained by joining the vectors in the order of the nucleotides in the sequence. changing the resolution one can see short-range details or global trends of the distribution of nucleotides. for example, h-curve is shifted in characteristic direction if the sequence is rich in certain nucleotides. it is also easy to recognize the locations of the repeating elements in the sequence. the first mathematical representation hamori also published in nature in 1985 under the title "novel dna sequence representation" [91] . the same year another article about a new graphical representation titled "simpler dna sequence representations" has been also published in nature by gates [88] . in this approach, guanine is represented by a unit vector in the positive x-axis direction, complementary cytosine is represented by a negative x-axis unit vector, and adenine and thymine are represented by unit vectors in the positive and negative y-axis directions, respectively. using such an approach all sequences can be represented in two dimensions in a unique manner, while using the hamori approach, dna structure may be viewed from any chosen perspective in two-dimensional plots. obviously, a chosen perspective of a 3d curve in 2d space gives only a part of the total information about the sequence. however, also in the graphical representation proposed by gates some information may be lost, as it is shown in a subsequent part of this work. about 10 years later, nandy (independently of gates) published an article "a new graphical representation and analysis of dna sequence structure: i. methodology and application to globin genes" [89] . the idea is very similar to the one presented by gates. in the scientific correspondence [92] , the author explains that he has just brought to his attention that a similar technique was presented by gates and indicates some advantages of his method: the nontrivial choice of the coordinate system a-g, c-t (purine-pirymidine) instead of the axis system proposed by gates (c-g, a-t) may give more significant biological information. one year later (independently of nandy), leong and morgenthaler proposed two new graphical representations of dna sequences [90] . the first one is a slight modification of the gates method: they change the unit vectors corresponding to the particular bases. the x-axis represents c and a and y-axis g and t. according to the authors such a change allows to exhibit the distribution of purines (a and g) and pyrimidines (c and t). the authors noticed that some information may be lost if a walk moves several times over the same ground. however, the authors found a good solution to identify in the plot the regions in which the parts of the sequences are hidden: the scale is visible even for long sequences and the numbers that label the bases in the plot are pointed every one hundred. the authors have not proposed any numerical characteristics and a way of indication in the plot of the hidden parts (by labeling or coloring) seems to be a good solution. leong and morgenthaler also proposed another, interesting graphical representation: gap plots that give the information about the distances between particular bases. independently of the graphical representations introduced by gates [88], nandy [89] , leong and morgenthaler [90] , similar graphs, also based on vectorial representations of the four bases and constructing 2d dna walks, have been constructed by mizraji and ninio [93] and by lobry [94] . lobry also used orthogonal directions but his choice of the unit vectors representing the four bases was different than the ones used in refs. [88] [89] [90] . surprisingly, these two important contributions remained rather unnoticed. the specific choice of the basis vectors done by mizraji and ninio seems to solve many problems. the vectors have been chosen so that it is easy to distinguish between coding and noncoding parts of the sequence and the graphs are nondegenerate. mizraji and ninio also proposed a graphical representation which shows purine/pirymidyne distributions along the sequence. however, the authors did not propose any numerical representation associated with these graphs. as a consequence, four similar 2d graphical representations have been created. they differ from each other in the choice of the coordinate systems: x-axis: g-c (gates), a-g (nandy), c-a (leong and morgenthaler), a-t (lobry). the most popular became the graphical representation proposed by nandy, called nandy plots. however, such a two-dimensional representation may lead to some parts of the sequence being hidden if the walk is performed back and forth along the same trace (so called repetitive walks). labeling and coloring only approximately localizes the regions in the sequences where the hidden parts are located. a 2d walk does not retain the history of the graph. this is not a linear method: a particular part of the graph may come from different parts of the sequences. the advantage is a small size of the graph representing long sequences and very often the information coming from such a plot may be sufficient. in order to eliminate, or to minimize, the degeneracy caused by the repetitive walks, many different methods have been introduced. for example, guo et al. [95] introduced a new graphical representation, also based on a walk in 2d space changing the angles between the basis vectors: the four nucleic acid bases are represented by the vectors: a by −1, 1 d ; t by 1 d , −1 ; g by 1, 1 d ; and c by 1 d , 1 , where d is a positive integer. the authors have shown that the degree of degeneracy of the new graphs is lower than for nandy plots, but it is still present and depends on the value of d. a further modification of the graphical representation based on a walk in 2d space has been introduced by changing the vectors in such a way that the basis vectors corresponding to pyrimidines (t, c) are located in the first quadrant of the cartesian coordinate system and to purines (a, g) in the fourth one [38, 39] . the unit vectors representing four nucleotides are as follows: [96] and also h-l curve representation proposed recently by huang et al. [97, 98] . an interesting 2d ladder-like graphical representation has been also proposed by li and wang [99] . their graphs are based on the division of bases according to their chemical properties. the four bases can be classified into groups: 1. purine r = a, g, pirymidyne y = c, t; 2. strong h bond s = c, g, weak h bond w = a, t; and 3. amino m = a, c, keto k = g, t. the method is also based on a walk in 2d space with basis vectors (0,1), (1,0) for characteristic sequences (m, k), (r, y) and (w, s). recently, we have proposed another method aimed at some improvement of the original 2d walk method (nandy plots) [100] . we have called this representation 2d-dynamic graph because its numerical representation, i.e. the set of descriptors, is analogous to the one used in the dynamics (see subsequent chapter). this method is based on nandy plots but it removes the degeneracy coming from the repetitive walks. the dna sequence is represented as a set of material points in 2d space. figure 4 , panel a, shows the method of plotting the 2d-dynamic graph for a model sequence ctc and fig. 5 , panel a, for ctct one. the first base in the sequence is c and we make a shift along the vertical axis in the positive direction. at the end of this vector (position (0,1)) we locate the point with the mass equal to 1. the second base in the sequence is t and we make the second shift along the vertical axis in the negative direction starting from the end of the last vector. at the end of the second vector again we locate the point with the mass also 1 (position (0,0)) and so on. if the ends of vectors meet several times at the same point then the mass of this point increases (it is equal to the sum of all masses located in this point). the total mass in the graph is equal to the total number of bases in the sequence (3 in fig. 4 and 4 in fig. 5 ). different masses are represented by different symbols in the plots. please note that both sequences ctc and ctct are represented by the identical nandy plots (figs. 4, 5, panels b) since the last shift in fig. 5 is made along the same trace as the previous one. 2d-dynamic graph removes this degeneracy (the masses of the points (0,0) are different: 1 for ctc and 2 for ctct). the difference between the two sequences is also revealed in the mass-density distributions which we create for x and y directions [101] . the masses are projected onto two orthogonal directions and then summed for each x and y. in the model examples the results of the projection and of the summation of the masses are shown in figs. 1 and 2 (panels a). for example, in the x direction, they are 3 and 4 in figs. 4 and 5, respectively. the mass-density distributions are composed of single lines located at the coordinates corresponding to the projected masses (x = 0 for ρ x and y = 0, y = 1 for ρ y ). the intensities of the lines correspond to the projected masses. the center panels in figs. 4 and 5 correspond to mass-density distributions for x direction (ρ x ) and the right ones for y directions (ρ y ). these distributions create another way of visualization of the 2d-dynamic graphs. however, the main reason for the creation of the mass-density distributions is deriving new descriptors related to 2d-dynamic graphs (see the subsequent chapter). the modifications of the original 2d walk methods resulted also in graphs which became linear-like representations (1d), extending along one direction in 2d space. in such kind of methods only the horizontal axis is associated with the positions of the bases. therefore these methods are free of the effects of self-overlapping of the graphs. the cost we have to pay for the reduction of the degeneracy, is worse visualization of long sequences. a combination of a linear-like method with a dna walk has recently been proposed by zhang [102] . the author has chosen basis vectors in such a way that the walk is performed along a horizontal axis. one nucleotide is represented by a pair of basis vectors instead of a single vector: (1, 1),(1, 1) corresponds to a, (1, 1), (1, −1) corresponds to t, (1, −1), (1, 1) corresponds to c, and (1, −1), (1, −1) corresponds to g. since one base is represented by a double vector, the author calls his graphical representation of dna sequences a dv-curve. a recently introduced graphical representation of dna sequences based on the neighboring dual niclueotides (dinucleotides) [103, 104] is another example of a linear representation. the authors plot a dinucleotide (dn) curve representing the distributions of pairs of nucleotides along the sequence. several years ago a four-horizontal-line graphical representation has been proposed by randić et al. [105, 106] . instead of considering the four directions along the cartesian coordinate axes, they draw four horizontal lines separated by unit distances. each line is associated with one base: a, t, g, and c, from the top. the sequence is written at the bottom of the lowest line, with unit distances between the neighboring bases. the dots (or rectangles) are put on the lines if a particular base appears in the sequence. this graphical representation resembles medieval musical scripts having staff of four lines [107] . for a better visualization the adjacent points are connected by a line, and zigzag-like curve is obtained. the idea proposed by randić of the visualization of dna sequence by zigzag curves has been extended by different combinations of labeling the lines and by different number of graphs representing one sequence (characteristic graphs). usually, the horizontal lines are not plotted. another linear graphical representation has been proposed by li and wang [108] . the graphical representation is composed of three characteristic graphs, each of them consisting of two horizontal lines. each line in each graph is assigned to more than one base. then, the sequence is represented by more than one characteristic graph. the lines in particular graphs are labeled by the following bases: a similar graphical representation has been proposed by song and tang [109] . in this approach, the three classifications are applied to construct six characteristic graphs representing one sequence. two graphs correspond to one classification. for example in two graphs corresponding to classification purine (r) -pirymidyne (y), the middle lines correspond to purines and pirymidynes in the first and in the second graph respectively. the other two lines correspond to these bases that are not purines or not pirymidynes, respectively, i.e.: a, y, g label the lines in the first graph (top, middle, and bottom respectively) and c, r, t label the lines in the second graph. another example of an analogous graphical representation has been proposed by liao and wang [110] . the sequence is represented by three graphs, and each of them is consisting of two horizontal lines. the lines are labeled as follows: in another analogous graphical representation, proposed by wang and zhang [111] , also consisting of three characteristic graphs, the lines are labeled by the following bases: graph 1: non-a = g, c, t and a, graph 2: non-g = a, c, t and g, graph 3: non-c = a, g, t and c. a slight modification of this method has been proposed by yao and wang [112] . the authors proposed to use cells instead of horizontal lines. they considered different shapes of cells, for example a rectangle. each corner of the rectangle is assigned to a particular base. the cells are placed next to each other. particular bases in the sequence are put in a proper corner (each base is located in its own cell). the adjacent dots are connected by a line and a zigzag curve representing the sequence is obtained. the methods described above, based on several horizontal lines, can be also considered as spectral-like representations (lines with some intensities appear in the positions corresponding to the bases in the sequences). this point of view has been expressed in a recent article by randić [113] . the author presents four-horizontal-line graphs and chaos-game 2d maps [114, 115] in the form of spectrum-like graphical representations. recently, i have introduced another spectral-like graphical representation called four-component spectral representation [65] . the method is very sensitive. within this model, differences in only one base can be detected. by using linear graphical representations of dna sequences the problem of degeneracy can be overcome. however, in technical terms, the visualization of long sequences is rather inconvenient. a good solution for this drawback is introducing a resolution parameter for linear representations as it was done for the four-component spectral representations (for details see sect. 4). another solution is to combine the compact form of the plots characteristic for 2d walks and zigzag curve method, as proposed by randić et al. [116, 117] . in the last approach, the sequence is represented by a zigzag spiral, known in the literature as the worm curve. the worm curve represents a path of a robot [116] . it does not intersect itself and uses a little space for the graphical representations of long sequences. another compact graphical representation, four-color map, has also been proposed by randić et al. [118] . the map is constructed as a spiral of square cells. the first base is located at the central square of the spiral, and the last base finishes the spiral. then four different colors are assigned to particular squares representing different bases: red for g, blue for t, green for c, and yellow for a. the original 3d method proposed by hamori has been also extended by various authors. in particular, a modified hamori curve representation of dna sequences has been recently introduced by pesek and zerovnik [119] . moreover, methods based on a walk in 3d space with different vectors corresponding to particular bases were introduced: vectors located along tetrahedral directions a(1,−1,−1), g(−1,1,−1), c(−1,−1,1), t(1,1,1) [120] or agc-t curve, where the vectors are chosen as a(1,0,0), g(0,1,0), c(0,0,1), t(1,1,1) [121, 122] . examples of other 3d graphs are representations of one sequence by a set of characteristic 3d curves [123] [124] [125] [126] [127] [128] . another 3d graphical representation, called z curve, combines the properties of several characteristic curves [129] . a single z curve contains the information about the distributions of purine/pirymidyne, amino/keto and strong h bond/weak h bond. recently, new 3d graphical representations based on the frequencies of occurring of pairs of nucleotides (dual nucleotides or dinucleotides) or trinucleotides in dna sequences have been created. four nucleotides form 16 dinucleotides and 64 trinucleotides. by assigning different vectors to each pair or to each trinucleotide in 3d space, 3d-curves are obtained. the curves contain the information about neighboring bases and their distributions along the sequence. dual nucleotides can be also divided into groups according to their chemical properties, as for example purine dinucleotides (ag, ga), pirymidyne dinucleotides (ct, tc), amino ones (ac, ca), keto ones (tg, gt), weak h-bond (at, ta) and strong h-bond (cg, gc). 3d graphical representation of one sequence by four characteristic curves based on dinucletides has been proposed by cao et al. [130] . other 3d graphical representations based on dinucleotides (pn-curves) [131] , (dn-curves) [132] , (d-curves) [133] , or based on trinucleotides (tn-curves) [134] have been also proposed. graphical representations constitute a tool allowing visual inspection of the sequences. moreover, each graph can be characterized by the quantities called in the theory of molecular similarity, descriptors. the descriptors representing numerically some properties of the sequences can be used for similarity/dissimilarity analysis of the sequences. the computing time of the calculations of the descriptors is low and the numerical comparison of long sequences becomes attractive. the algorithm of the computation of the descriptors is independent of the visualization tool. therefore, the graphical representations can be recognized as both numerical and graphical tools separately. however, each descriptor represents some specific properties of the graphs and it is not obvious how to characterize graphical objects by numerical values (for review of methods related to the creation of mathematical descriptors of dna sequences up to 2006 see [135] ). one of the methods, most commonly used to describe graphs numerically, is transforming the plots to matrices. the method has been initially introduced by randić et al. for 3d graphical representations [120] . the authors introduced distance matrices, d/d. the numerator in the matrix element (i, j) stands for the euclidean distance between vertices i and j, and the denominator stands for the graph theoretical distance (the number of arcs separating the two vertices). the authors proposed the leading eigenvalues of the matrices as the descriptors. the normalized leading eigenvalue of a d/d matrix offers a measure of the degree of folding of a chain-like structure or a curve. the authors introduced also higher-order matrix k d/ k d that is constructed by taking matrix elements of d/d matrix to power k. in the limit k → ∞, the resulting matrix reduces to a binary matrix ∞ d/ ∞ d. as the descriptors the authors also proposed the leading eigenvalues of these matrices. such kind of descriptors can be viewed as an index of flexibility (or stiffness) of the structure. the methods of transforming graphs to matrices stimulated introducing new kinds of matrices. different kinds of matrices associated with the graphs have been introduced by song and tang [109] . the authors introduced the euclidean matrix e, whose (i, j) element is defined as the euclidean distance between vertices (dots) i and j of the curve. they also introduced m/m matrix whose elements are defined as a quotient of the euclidean distance between two vertices of the curve and the number of arcs between the two vertices. the third kind of matrix introduced by these authors is l/l matrix whose elements are defined as a quotient of the euclidean distance between two vertices of the curve and the sum of geometrical lengths of arcs between the two vertices. as the descriptors the authors chose the leading eigenvalues of m/m and l/l matrices. the authors considered characteristic linear curves and their descriptors characterize the distribution of bases with different chemical structures. the authors also considered higher-order l/l matrices. new kind of matrices has been also proposed by liao et al. [38] . the authors introduced covariance matrices associated with the graphs. usually, the leading eigenvalues of the matrices are taken as descriptors. a discussion of the properties of such kind of descriptors may be found in a recent article by yuan et al. [136] . some authors propose to consider more eigenvalues or matrix elements as descriptors of the sequences. wang and zhang proposed to take as a descriptor the sum of the maximal and minimal eigenvalues for the matrices associated with their graphical representation, called three non-base representation [111] . the authors suggested that the information reflected only by the leading eigenvalue might not be comprehensive enough. liao et al. [38] took all (two) eigenvalues of the 2 × 2 covariance matrices. li and wang proposed as descriptors normalized matrix norms instead of the eigenvalues [99] . randić et al. considered as the descriptors average matrix elements of the matrices associated with the four-color map representation of dna sequences [118] . liao and wang proposed as descriptors the average bandwidths [125] . they can be obtained by summing the distance matrix elements along each of the lines parallel to the main diagonal if the matrix is in the canonical form. qi and fan took all elements of the matrix as descriptors of the sequences of equal lengths [131] . pesek and zerovnik proposed to take as the numerical characterization of the modified hamori curve a product of first ten and last ten eigenvalues of the descending ordered eigenvalue list of the matrix l/l [119] . numerical representation of 2d or 3d graphical representations of dna sequences based on transforming the graphs into matrices and deriving the descriptors from these matrices has been widely used by many authors. these descriptors characterizing a sequence can be used as components of similarity measures between a pair of sequences. examples of similarity analysis of dna sequences using this method may be found in [137, 105, 106, 108, 116, 123, 110, 112, 125, 124, 109, 118, 138, 126, 111] . numerical representation of a graphical representation can be also performed directly from the coordinates or from the properties of the graphs without transforming the graphs to matrices. gates plotted each sequence as a graph of the cumulative manhattan distance (from the origin) against the sequence position [139] . manhattan or city-block distance considered by gates is calculated as the arc length between points. for sequences of equal lengths it is convenient to plot the differences of the graphs. as descriptors of the sequences he proposed the means of the manhattan and euclidean "fractal" dimensions. raychaudhury and nandy proposed mean x and y coordinate values, and the radius of the graph as descriptors of dna sequences [140] . guo and nandy introduced also improved mean x and y coordinate values, and the radius of the graph, reducing the degeneracy of the previously defined descriptors of dna sequences [141] . yao et al. extended these descriptors to three dimensions defining 3d radius and adding mean z-coordinate as a descriptor [122] . we have extended the set of these 2d descriptors to higher-order moments of the mass-density distributions. the mean x and y coordinate values are equal to the firstorder moments (m x,1 , m y,1 ) of the mass-density distribution, ρ x and ρ y respectively. in particular, if in a 2d-dynamic graph we put all masses equal to 1, then the 2ddynamic graph becomes the nandy plot and all the moments of the two graphs are identical. introducing the masses different than 1, the mean x and y coordinate values become the coordinates of the center of mass of the graph and are different than for the nandy plot. as the new descriptors we proposed moments of the mass-density distributions ρ x and ρ y up to the sixth order [101] and up to the eighth order [142] . higher-order moments give more specific information about the distribution of masses. for example, second-order moments (m x,2 , m y,2 ) give the information about the width of ρ x and ρ y . we have shown that the third-(m x,3 ), fourth-(m x,4 ), fifth-(m x,5 ), and sixth-order (m x,6 ) x-moments of the mass-density distributions representing histone h4 coding sequences have different values for plants than for vertebrates [101] . in the present work, 2d-plots m x,q − m x,q are proposed instead of 1d-plots (descriptors versus labels of the sequences) that were shown in [101] . 2d-plots are shown in fig. 6 6 . in all the plots we observe clusterization of evolutionary similar organisms: plants are located in different parts of the plots than the vertebrates. the differences between histone h4 coding sequences across the species are not big and it is rather difficult to find the descriptors that reveal the clusterization. please note that y-moments and also x-moments for the order smaller than 4 do not lead to clusterization in this case. in particular, this means that using the nandy plots for which the descriptors are taken as the mean values (first-order moments) of x and y we cannot get the clusterization. i have also found another set of descriptors (related to the four-component spectral representation) that reveal clusterization for histone h4 and h1 coding sequences (for details see the subsequent chapter). analogous (2d visualization) is introduced in the present work for the recently proposed molecular descriptors. figure 7 shows moment-based classification of the molecules: m 1 − m 2 (top), m 3 − m 4 (middle), and m 5 − m 6 (bottom). we have shown that the new molecular descriptors (moments of the intensity distributions) have different values for two kinds of molecules: nitriles and amides. in our recent paper, 1d plots have been presented (descriptors versus labels of the molecules) [143] . figure 7 shows 2d plots. we observe that the descriptors representing nitriles are located in different parts of the plots than those representing amides. figures 6 and 7 represent different objects: dna sequences and molecules, respectively. however, the idea is the same. the clusterization of the descriptors indicates that these descriptors can be a good tool for similarity/dissimilarity analysis. the descriptors cluster (have similar values) for similar objects so they exhibit some properties of the considered objects. moreover, some of the plots reveal similar shapes, as for example, middle and bottom parts of fig. 7 . this may suggest correlations between some of the descriptors. however, the shape is similar but not identical. the problem of correlation and extracting the minimal set of moments we studied in ref. [144] . we concluded that a universal set of independent moments does not exist. usually 4 lowest moments are sufficient to describe the object but also the information coming from higher-order moments cannot be neglected in some cases. as the new descriptors of dna sequences we also proposed the angles between the x axis and the principal axis of inertia of the 2d-dynamic graph (axes for which the tensor of moment of inertia is diagonal) [100] . we also introduced the principal moments of inertia as the descriptors of dna sequences associated with the 2d-dynamic graph [100] . they are associated with the rotations about the principal axes. the moment of inertia of an object about a given axis describes how difficult is to induce an angular rotation of the object about this axis. if the mass is concentrated close to the axis of rotation, it is easier to accelerate into spinning fast and the moment of inertia is smaller. as a consequence, these descriptors give the information about the concentrations of masses around the axes. another kind of new descriptors has been recently proposed by huang et al. [98] . the authors proposed to take as the descriptors the set of characteristic vectors rep-resenting all bases in the sequence. guo and wang obtained smooth curves from the zigzag curves and took curvatures of the smooth curves as descriptors of the sequences [145] . yu et al. proposed two kinds of descriptors: a set of coordinates of tn curves, and the probabilities of occurring of particular trinucleotides among all 64 trinucleotides in the sequence [134] . yu et al. composed 6d vector associated with the d-curve as a descriptor of dna sequences [133] . another kind of non-standard descriptors has also been introduced for four-component spectral representation (for the details see the next chapter). the descriptors are the numerical characteristics of the sequences. the next step would be the creation of similarity measures between sequences. in most of the similarity studies the set of descriptors characterizing a sequence is treated as components of a vector. usually, as the similarity measure the euclidean distance between the components of the vectors corresponding to a pair of sequences is taken. in particular, for identical sequences, this similarity measure is equal to zero. recently, non-standard measures have been introduced. for example huang et al. defined a measure that changes from 0 to 1 and is equal to 1 for identical sequences [98] . chen et al. constructed cosine value that is a similarity measure of the mean x, y, z coordinates of their graphs [127] . we have used the manhattan distance normalized by the mean value of the descriptors for the similarity studies of the sequences represented by the 2d-dynamic graphs [101, 146] . for identical sequences this measure is equal to zero, as it is assumed for most of similarity studies. another non-standard similarity measure, also normalized to zero for identical sequences, is introduced in this work for four-component spectral representation (for details see subsequent chapter). however, in the alignment studies the similarity measure changes from 0 to 100 for identical sequences. such a measure is also defined for four-component spectral representation ( [147] , see next chapter). another similarity measure, also normalized to 100 for identical sequences, we have used for comparisons of 2d-graphs. this similarity measure introduces nonconventional treatment of graphs and their similarity analysis. we have not calculated the descriptors but the similarity measure has been directly obtained from the graphs. in our studies we treated the graphs as rigid bodies, as in the classical dynamics. as a similarity measure for a pair of sequences represented by the graphs we took mass overlaps [146] . using the genetic methods, very efficient in problems of optimization, we found the locations of a pair of graphs for which their mass overlap reaches maximum. in this position the similarity measure is defined as a mass overlap of a pair of graphs. in the process of maximization of the mass overlap we considered shifts and rotations of the graphs. recently, i have introduced another graphical representation [65] . in this section, the details and new aspects of this representation are described. graphically, this representation resembles the molecular spectrum so i call it spectral representation. the dna sequence is represented by a four-component function (or, graphically, by a four-component spectrum). a single dna sequence is represented by four abstract spectra: one for bases a, one for c, one for t and one for g. this means that i decompose each sequence to four components. each γ -component i call-γ spectrum where γ = a, c, t, g denotes one of the bases. each γ -component is given by a function that is a superposition of the gaussian functions: where n is the length of the sequence, and the parameter r is the resolution of i γ (x). for the visualization of long sequences it is convenient to take small r . the resolution parameter r determines the differences between the maxima of the gaussians. the details of spectra are better visible when r is large, i.e. when the neighboring maxima are well separated. with an increasing r the resolution becomes larger. if r = 1 then the maximum corresponding to the first base ( p = 1) is located at x = 1 = 0 and the maximum corresponding to the last base is located at x = n = n − 1. generally, the locations of the consecutive bases in one of the fourth γ -spectra correspond to x = 0, r, 2r, . . . , (n − 1)r , i.e. each single gaussian function makes the contribution to one of the fourth γ -spectra. if the neighboring γ bases are closely packed then the intensities (i γ ) increase. if the sequence does not contain one of γ bases then the contribution to γ -component may be zero and all the contributions are located in one of the three other γ -spectra. generally, the distributions of particular bases along the sequences are asymmetric and this information is reflected in the form of i γ (x). in principle, x may change from −∞ to +∞. however, in practical terms, i γ (x) = 0 if x < −r or x > nr. therefore one can assume that the graphs extend for x ∈ −r, nr . in this way the first and the last bases are considered in the same way as the other ones. however, for the numerical characterization related to this graphical representation the range from −∞ to +∞ is considered. as the numerical characterization of the four-component spectral representation i propose the properly scaled distribution moments. analogously to the definitions of the moments of a discrete distribution (eqs. 4-6), the q-th moment of the continues distribution i γ (x) reads where is the normalization constant and r(x) is the range of x for which the integrand does not vanish. the normalization has been introduced for the numerical characteristics of the sequences. visualization is independent of the numerical calculations and it is more clear to consider unnormalized plots defined as γ -spectra in eq. 8. good descriptors of the distributions are also the centered moments m for which the first moment is equal to 0, and also m (13) for which the first moment is equal to 0 and the second one is equal to 1. considering several lowest moments it is convenient to perform integrations over the whole range of x (from −∞ to +∞). the integration can be performed analytically and where in the graphical representation defined in eq. 8, the summations are performed from p = 1 to p = n for each γ . however the contributions of many terms are zero. only the terms for which the occupation number is different than zero give non-zero contribution to the γ -spectrum and their number is n γ which is the number of γ bases in the sequence and let us take an example of a model sequence atat. the nonvanishing terms that make the contribution to a-spectrum are for p = 1, 3. in case of t-spectrum p = 2, 4 and for g and c-spectra all the contributions are zeros. as a consequence, the four-component spectrum is the descriptors associated with the four-component spectral representation (d γ q ) have been defined as properly scaled distribution moments [65] . in particular and for q ≥ 3. as it has been shown in the article [65] , due to the division by r, d [147] . in particular, these diagrams can be used for an identification of genes. in this kind of visualization, different types of classified objects are clustered in different areas of the plots. as a similarity measure between a pair of sequences labeled by i and j is proposed, where q = 1, 2, 3, 4 [147] . though q may be easily increased up to higher-orders, as we shall see, the information about similarity sequences is specific enough up to the fourth order. let us note that d γ q is consistent with standard measures used in biology: for the identical sequences the similarity value equals 100% and it decreases (approaching 0) if the difference between the two d γ q increases. the average information about the similarity of a pair of sequences is contained in the measure where are referred to as the weights, n γ (i) is the number of γ bases in the i-th sequence, and is the length of the i-th sequence. in order to study the problem of convergence of the method with respect to the higher-order moments i consider, for a pair of sequences labeled by i and j, the similarity measure where n is the maximum order of moments taken into account. all definitions may be easily generalized for multiple similarity studies. if j sequences labeled by i ≡ {i 1 , i 2 , . . . i j } are matched then the measures are defined as and the weights are equal to the relative numbers of γ bases in all the considered sequences and the measures defined in eqs. 28, 29 and 31 may change from 0% to 100%, analogously to the ones defined, respectively, in eqs. 24, 25 and 27. an alternative similarity measure is defined in this work as s γ q is equal to 0 if the descriptors of the i-th and the j-th sequences are the same (d γ q (i) = d γ q ( j)) and approaches 1 if the difference between the two descriptors increases. this similarity measure is analogous to the one that we have introduced in the molecular similarity studies [56] . i also introduce a similarity measure between the sequences labeled by i and j that carries the information about several (n) properties where i 1 < i 2 < · · · < i n and w i 1 . . . w i n denote the weights. s i 1 ,i 2 ,...i n γ (i, j) is also normalized to the values belonging to the range from 0 (identical properties) to 1. for example, if we consider similarity of three properties: the width, the asymmetry and the curtosis of the γ -spectrum that are described by s γ 2 , s γ 3 and s γ 4 respectively, then n = 3, i 1 = 2, i 2 = 3, i 3 = 4 and the similarity measure is in this work all the weights in eqs. 33 and 34 are equal to 1. the units of descriptors d i k (eq. 23) are normalized for i k ≥ 3. as a consequence, for example s 3,4 γ is a convenient measure for comparison of sequences of different lengths, if we are interested in the similarity information that is not related to the lengths of the sequences. if the information about the mean value d 1 or about the width d 2 of γ -spectra needs to be compared then s i 1 ,i 2 ,...i n γ , where i k are 1 or 2 may be considered. all the panels (a-d) in the figure represent the same model sequence. the difference is the resolution: r = 1, r = 2, r = 3, r = 4 in panels a, b, c, d respectively. the particular bases are represented by gaussians centered at p = ( p − 1)r , where p = 1, 2, . . . 50. the first base is represented by a gaussian with the maximum located at 1 = 0 for all the cases and the last one at 50 = 49, 50 = 98, 50 = 147, 50 = 196 for r = 1, r = 2, r = 3, r = 4 respectively. for smaller r the bases are located close to each other and as a consequence the neighboring gaussian functions overlap and we observe the envelope of the spectrum. in particular, if all the bases are the same, the spectrum becomes rectangular (fig. 8, panel a) . increasing the resolution, the range for which the spectrum is different than zero becomes larger and we have a chance to look into details of the spectra. the details are the locations of particular bases along the sequence. for long sequences, the balance between the details of spectra and the range of the plot determined by the location of the last gaussian n = (n − 1)r has to be found. theoretically, the resolution may change from a small positive value to infinity. however changing the resolution not always results in a change of the information coming from the spectrum. for example, if in the model example the resolution is taken as smaller than 1 then also rectangular representation is obtained. figure 9 shows i a spectrum for this model example where r = 0.5. the difference between r = 1 (fig. 8, panel a) is the range ( 50 = 24.5 for r = 0.5) and the maximum values of i a . for smaller resolution the range of the spectrum is smaller and the neighboring maxima are located close to each other. as a consequence of closely located gaussian functions exp[−(x − p ) 2 ], the resulting maxima of spectrum i a are larger (around 3 in fig. 9 and around 2 in fig. 8, panel a) . however the qualitative information is the same in fig. 8 , panel a, and in fig. 9 . in case of real sequences, there is a natural separation between the neighboring bases. usually the resolution r = 1 and even smaller is sufficient for a good visualization. in fig. 10 , spectral representation of histone h1 coding sequence of arabidopsis thaliana is shown (i = 19, table 16 ). the length of the sequence is n = 822. the resolution has been taken as r = 1. the numbers of particular bases are n a = 259, n c = 167, n t = 188, and n g = 208. the largest number of a bases can be easily seen (large number of lines with large intensities as an effect of overlapping closely located gaussians representing a bases). the same sequence but with the resolution ten times smaller is shown in fig. 11 . the resolution r = 0.1 seems to be sufficient to distinguish between those ranges of x for which the density of bases is larger comparing to ranges that are poor in the considered bases. a very convenient way of a direct comparison of the difference between a pair of sequences labeled by i and j is plotting the difference i γ i j . clearly, for both sequences i γ (x) must be represented with the same resolution in order to compare the distribution of γ bases along the sequence. figures 12 and 13 show the differences between a pair of sequences. in fig. 12 the differences with resolution r = 1 between the spectra representing histone h4 coding sequence of human (i = 9, table 17 ) and histone h4 coding sequence of maize ( j = 1, table 17 table 16 ). the distributions of particular bases along the sequences. in particular, the number of lines in i a i j , i t i j , i g i j is smaller than for i c i j . this means that the difference of the distributions of c bases along the sequences is the largest comparing to the differences of the distributions of other bases. moreover comparing the number of lines that are positive to the ones that are negative, for a particular plot, one can easily estimate the differences between the numbers of the particular bases. for example, n c = 79 for the sequence of human and n c = 96 for the sequence of maize so the number of negative lines for i c i j is larger then the number of the positive ones. analogously, the table 16) number of negative lines for i g i j can be seen: n g = 100 for the sequence of human and n g = 111 for the sequence of maize. since the number of a and t bases are larger for the sequence of human then for the sequence of maize, one can observe in i a i j and i t i j plots more positive lines than the negative ones. in fig. 13 the differences with the resolution r = 1 between the spectra representing histone h4 coding sequence of human (i = 9, table 17 ) and histone h4 coding sequence of mouse ( j = 7, table 17 ) are shown. as a result of the difference between the numbers of a bases one can observe in i a i j more positive lines than the negative ones: n a = 73 for the sequence of human and n a = 65 for the sequence of mouse. the difference in c bases is also clearly seen. there are more negative than positive x fig. 12 differences between the spectra for histone h4 coding sequence of human m60749 and histone h4 coding sequence of maize m13377 (i = 9, j = 1, table 17) lines in i c i j plot: n c = 79 for the sequence of human and n c = 96 for the sequence of mouse. generally, comparing fig. 12 and fig. 13 one can see that the differences human-maize spectra are larger then the differences human-mouse spectra (the number of lines in fig. 12 is larger then the number of lines in fig. 13 ). x fig. 13 differences between the spectra for histone h4 coding sequence of human m60749 and histone h4 coding sequence of mouse v00753 (i = 9, j = 7, table 17) as the descriptors of the four-component spectral representation, i have proposed d γ q . figure 14 shows d g 1 − d g 4 diagram for ten sequences listed in table 17 and for one additional sequence (one point in the figure represents descriptors of one sequence). the additional sequence is histone h4 coding sequence of human (m16707). in many articles the ten sequences were treated as a model set to introduce new graphical and numerical representations. however, there was a mistake in the old version of the embl database. obviously, the length of this coding sequence should be 312 and not 311 as it was in the old version of the embl database. the additional base is g, located at the last position of the sequence. the descriptors d γ q of spectral representation are very sensitive. the difference by only one base can be detected using these descriptors. moreover, the approximate location of this base can be indicated. the descriptors characterizing the same sequence calculated using the old and new version of the embl database have been denoted using different symbols in fig. 14 . their locations are different in the diagram. it is remarkable that the difference by this very base may be recognized in the plots. table 17 figures 15, 16, 17, 18 show the diagrams also for the sequences listed in table 17 . in particular, fig. 15 shows diagrams for g-descriptors, fig. 16 for adescriptors, fig. 17 for c-descriptors, and fig. 18 for t-descriptors. panels a in the figures correspond to d table 17 indicates the location in the sequence of the base that is different for a pair of sequences. the additional g base in the new sequence causes the shift to larger values of the mean of the distribution (d g 1 becomes larger, fig. 14, fig. 15 , panel a). the width of the distribution also increases (d g 2 for the new sequence is larger than for the old one, fig. 15 table 17 lows: considering the properties of g and a-spectra (g and a-descriptors shown in figs. 15, 16, respectively) one can observe clusterization of evolutionary similar organisms: plants and vertebrates that are represented by different symbols in the plots (plants-circles, vertebrates-triangles). considering the properties of cspectra (fig. 17) one can find the properties that are specific for plants and different than for vertebrates and also one can find the properties that are comtable 17 mon for plants and vertebrates. for example in panels a and c where d it is interesting to note that most of the similarity measures (both the standard ones and many alternative ones) indicate larger or equal similarity values between histone h1 coding sequences of chicken (labeled by i = 4, 5 in table 17 ) and plants (labeled in this table by j = 1, 2, 3, 6) than between these of chicken and of vertebrates (labeled table 18 similarity measures between a pair of sequences labeled by i and j sim(i, j), where i and j are defined in the first column of table 17 sim sim (5, 6) sim ( 100 19 by j = 7, 8, 9) . however, using new similarity approach it is possible to extract such components of the similarity measures that cluster the sequence of chicken with the ones of vertebrates rather than with the ones of plants [147] . table 18 shows similarity values obtained using different similarity measures "sim". using alignment method (sim=cl) the similarity value "chicken-plant" cl(5,6) is the same as the similarity value "chicken-vertebrate" cl (5, 7) . considering different aspects of similarity, using d γ 3 , one can see that the clusterization of the sequence of chicken with vertebrates is obtained for γ = g, a, c. however the asymmetry of the gene structure for t bases is identical for the sequence of chicken and of plants (d t 3 (5, 6) = 100) and the similarity value is small in case "chicken-vertebrate" (d t 3 (5, 7) = 19). figures 19 and 20 show the diagrams for the sequences listed in table 16 (histone h1 coding sequences of different species). in particular, fig. 19 shows d and d γ 2 is linear. the most regular linear dependence is for g-descriptors (fig. 20, panel d) . however, using the diagrams for the descriptors independent of the lengths of sequences (fig. 19) , for a and g-descriptors (panels a, d respectively) the clusterization of plants and vertebrates is observed. for c-descriptors, the effect of clusterization is smaller. the effect of clusterization is not observed for t-descriptors. t-descriptors representing sequences of plants and vertebrates even overlap. these observations are the same as in the case of histone h4 coding sequences. figures 21, 22, 23, 24, 25 show the relations between the standard calculations clustal w (c l) and the new measures (eqs. 24, 25, 27) for the sequences listed in table 17 (histone h4 coding sequences). table 16 figures 21, 22, 23, 24, 25, 26, 27, 28, 29 are plotted in the same way as it has been done in chapter 2 (figs. 1, 2, 3) . each point in the plot corresponds to one case: comparison of sequence of species no. i with sequence of species no. j using different methods. for example, the horizontal axis in fig. 21 corresponds to the similarity matrix between sequences of different species using clustal w method (c l) and the vertical axes correspond to the similarity matrix between the same sequences using different components of alternative similarity measures d γ q . as a consequence each plot represents two similarity matrices, which gives a better visualization of the relations between two different similarity measures. in the figures, the functions x = y, where x and y represent, respectively, the horizontal and vertical axes, are plotted table 16 (dashed lines). comparing the distributions of the points around the dashed lines it is easy to recognize these aspects of similarity for which the relations are the same. if the points are concentrated close to the lines then the similarity relations represented by x and y axes are also close to each other. the similarity matrix c l based on clustal w approach for the considered sequences is given in [146] . small range of similarity measures indicates small differences table 17 between the sequences of different species. the range of values of c l is from 78% to 100%. the ranges of values of d γ q for q = 1, q = 2, and q = 4 are smaller than for q = 3 for all γ . d γ 3 changes from about 15% to 100% for all γ . the differences between sequences across species using d γ 4 are smaller than using d since each γ -component is related in a different way to the standard measure, one may expect that it carries independent similarity information. averaging the measures over γ , and then averaging over q, d m e an q (eq. 25) and d n (i, j) (eq. 27) are obtained. figure 25 shows the relations of d n with the standard measure. the convergence of d n measures to the standard measure cl we have discussed in [147] . in the present paper this effect is shown in detail adding d 1 table 17 term. d 1 is very different from cl (the points are located far away from the dashed line, panel a, fig. 25 ). adding higher-order terms, the points are pushed towards the dashed lines (panels b, c, d fig. 25 ). figures 26, 27, 28, 29, 30 show similarity relations for β-globin gene across species using similarity measures defined in eqs. 32 and 33. these data are the standard ones for alternative methods. since the sequences in the database are not complete for some species, they are unified in this work and the appropriate locations in the gene are listed in the tables. in particular, the sequences of mouse and of chicken belong to the standard set of data used by many authors. however, several bases are ambiguous for the third exons for the sequences of the two species. as it was already mentioned, the method used in this work is so sensitive that even a difference in a single base can influence the results. therefore the sequences of mouse and of chicken are omitted from this consideration. moreover, in gorilla and chimpanzee sequences the stop codons are not available in the database. therefore for all the species the stop codons are excluded from the calculations. this means that the length of the coding sequence n cds is three times larger than the corresponding length of the protein sequence for all the species. in this way (excluding the stop codons) all the data used in the calculations are consistent. table 17 the locations in the gene, the numbers of γ bases, n γ k , for each k-th exon according to the latest version of the embl database are specified in tables 19, 20 table 17. 5. the whole first exons which are given in the embl database only for three species with the length n w 1 , denoted exon 1, 6. the coding sequences with the lengths n cds = 3 k=1 n k , denoted cds. γ are also shown in fig. 28 . in this figure the measures are compared for different parts of the β-globin gene. the horizontal axes correspond to the sequences with introns, plusi. the vertical axes correspond to the coding sequences of particular exons: column 1 to exon 1 c ds , column 2 to exon 2 c ds and column 3 to exon 3 c ds . the first row of subfigures correspond to a bases, the second row to c bases, the third row to t bases and the fourth row to g bases. we observe that the points are concentrated around the dashed lines in the middle column (exon 2 c ds ) comparing to the first and to the third columns. small deviations from the dashed lines mean that the second exon is most representative in the whole sequence, plusi (the similarity relations across species fulfilled by plusi and and by exon 2 c ds are closer to each other than the relations fulfilled by plusi and by the other exons). we have also shown that the similarity relations across species fulfilled by cds and and by exon 2 c ds are closer to each other than the relations fulfilled by cds and by the other exons [71] . if we compare the distributions of the points between different bases (rows) one can extract some properties common for particular bases and for some parts of the genes. by a common property we understand close to zero s 3,4 γ (small values correspond to large similarities). in particular small differences between sequences across species are revealed for g bases for the first and for the second exons (panels j, k) and also for c and for t bases for the second exon (panels e, h). generally, larger differences are seen for longer sequences. however also for plusi one can extract properties more common for the species (small ranges of s 3,4 a [plusi] and s 3,4 t [plusi]-first and third rows). figure 29 shows similarity relations for different exons using standard alignment method clustal w version 2.0 [148] . as it was mentioned before, the alignment methods do not take into account which bases are aligned. the alignment of all the bases gives the contribution to the final result and, as a consequence, the similarity is large for all the parts. it is not possible to extract detailed properties of similarities. the information coming from these calculations is averaged. finally, the similarity table 23 values for different exons are the same for all the species since most of the points are concentrated close to the dashed lines. complete sequences for the first exons are given only for three species (table 23 ). the whole sequences of the first exons for human and gorilla differ by only one base. as we see in fig. 30 this is g base. the descriptors d a 1 , d c 1 , d t 1 are exactly the same for human and gorilla sequences. the difference caused by this single base is recognized by d g 1 (panel d). summarizing, four-component spectral representation has been used for similarity/dissimilarity analysis of histone h4 coding sequences across species (figs. 12, 13, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25) , of histone h1 coding sequences across species (figs. 19, 20) , and of different parts of β-globin gene across species (figs. 26, 27, 28, 29, 30) . since many authors use slightly different data for β-globin gene, the locations of different subsequences in this gene and their full description listed in the tables may be helpful for some alternative similarity studies. the numbers of particular bases in all the sequences are also given. it has been shown that the four-component spectral representation can be used for the classification studies (clusterization of the descriptors representing histones h4 and h1 coding sequences of plants and of vertebrates). analogous clusterization is also obtained using some descriptors related to 2d-dynamic graphs (sect. 4). the sensitivity of the four-component spectral representation has also been shown. in particular, a difference between a pair of sequences by only one base can be recognized. also the approximate location of the difference and the base which is different in the compared sequences can be also determined. it has been shown that if higher-order terms of similarity measure based on the descriptors of the four-component spectral representation are added and normalized in the same way as in the alignment methods then a convergence to clustal w results may be obtained. this means that the results obtained with the alignment method may be interpreted as an average of the considered components of the alternative similarity measures. calculating an average is always related to some loss of information, i.e. large degree of degeneracy may appear. as we know, this is an inconvenient feature of similarity/dissimilarity analysis. for example, using the alignment methods the two situations 1. aaaa aaaa 2. tttt tttt cannot be distinguished. therefore, using the four-component spectral representations one has a chance to decompose the similarity information and remove the degeneracy. reducing the degeneracy can also be obtained by adding the corrections to the alignment methods related to different aspects of similarity, as it is proposed in sect. 2 of this work. it has been shown that each part of β-globin gene demonstrates different similarity relations across species. the relations also change when different aspects of similarity are compared (asymmetry of the gene structure or kurtosis of the distributions). therefore using different descriptors or different graphical representations the results may be or very often should be contradictory. different alternative methods describe different aspects of similarity. in particular, most of alternative studies that have been performed for exon 1 cds of β-globin gene often give contradictory results. for example the similarity value of exon 1 cds human-goat is larger than human-mouse if the methods described in the works [106, 112, 126, 137, 149] are used. the reverse situation i.e. similarity value between the sequences of exon 1 cds human-goat is smaller than human-mouse if methods taken from [32, 33, 36, 108, 110, 122, [150] [151] [152] are applied. many authors introducing new graphical representations for beta-globin gene try to avoid considering chimpanzee and gorilla sequences not only because the data are not complete but also because the results are often different than our expectations. we expect the largest similarity for human-chimpanzee sequences. however detailed similarity/dissimilarity analysis of beta-globin gene using four-component spectral representation indicates that this is not true for all parts of the beta-globin gene and for all γ -components of similarity measures. according to the definition of the new measures, s 3,4 γ becomes smaller if the sequences are more similar. considering the second exon, i obtain the largest similarity in the case of human-chimpanzee sequences. this means that s 3,4 γ is the smallest for the two sequences for all γ , and in particular s 3,4 γ =0 for γ = a, c, t . the difference between the two sequences is only in the distribution of g bases. it is interesting to note that s 3,4 γ = 0 for the second exon, both for c and for t bases, in three cases: human-chimpanzee, human-gorilla and gorilla-chimpanzee sequences. however for other exons, s 3,4 γ is not always the smallest in the case of human-chimpanzee sequences comparing to human-other species sequences. if the sequence with introns, plusi, is considered then s 3,4 c is the smallest for humanchimpanzee sequences and for γ = a, t, g, s 3, 4 γ are the smallest for human-gorilla sequences. each descriptor may be related to different biological function. since we are at the beginning of the way of understanding in which contexts particular descriptors may play the key role, the creation of new alternative methods aiming at similarity/dissimilarity analysis of biological sequences is of particular importance. open access this article is distributed under the terms of the creative commons attribution noncommercial license which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited. parts of the first exons starting with the start codon (coding sequences of the first exons) with the length n 1 , denoted exon 1 c ds , 3. the second exons with the length n 2 biological sequence analysis introduction to computational biology: maps, sequences, and genomes: interdisciplinary statistics advances in molecular similarity topological indices and related descriptors in qsar and qspr symmetry, spectroscopy and schur a new view on similarity of dna sequences annual review of nuclear and particle science the advanced theory of statistics symmetry and structural properties of condensed matter the oxford companion to music key: cord-334394-qgyzk7th authors: edgar, robert c.; taylor, jeff; altman, tomer; barbera, pierre; meleshko, dmitry; lin, victor; lohr, dan; novakovsky, gherman; al-shayeb, basem; banfield, jillian f.; korobeynikov, anton; chikhi, rayan; babaian, artem title: petabase-scale sequence alignment catalyses viral discovery date: 2020-08-10 journal: biorxiv doi: 10.1101/2020.08.07.241729 sha: doc_id: 334394 cord_uid: qgyzk7th public sequence data represents a major opportunity for viral discovery, but its exploration has been inhibited by a lack of efficient methods for searching this corpus, which is currently at the petabase scale and growing exponentially. to address the ongoing pandemic caused by severe acute respiratory syndrome coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (cov) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. to implement this strategy, we developed a cloud computing architecture, serratus, tailored for ultra-high throughput sequence alignment at the petabase scale. from this search, we identified and assembled thousands of cov and cov-like genomes and genome fragments ranging from known strains to putatively novel genera. we generalise this strategy to other viral families, identifying several novel deltaviruses and huge bacteriophages. to catalyse a new era of viral discovery we made millions of viral alignments and family identifications freely available to the research community. expanding the known diversity and zoonotic reservoirs of cov and other emerging pathogens can accelerate vaccine and therapeutic developments for the current pandemic, and help us anticipate and mitigate future ones. viral zoonotic disease has had a major impact on human health over the past century despite dramatic advances in medical science, notably by the spanish flu, aids, sars, ebola and covid-19 pandemics. there are an estimated 320,000 mammalian viruses [1] from which emerging infectious diseases in humans may arise [2] . uncovering this viral biodiversity is a prerequisite for predicting and preventing future epidemics and is therefore the focus of consortia such as usaid predict [3] and the global virome project [4] as well as hundreds of government and academic research projects worldwide. these efforts can be aided through re-analysis of petabases of high-throughput sequencing data available in public databases such as the sequence read archive (sra) [5] . this data spans millions of ecologically diverse biological samples, many of which capture viral transcripts that may be incidental to the goals of the original studies [6] . to expand the known repertoire of viruses and catalyse global virus discovery, in particular for coronaviridae (cov) family, we developed the serratus cloud computing architecture for ultra-high throughput sequence alignment. from a screen of 3.8 million libraries comprising 5.6 petabases of sequencing reads, we report 11,120 assemblies, including sequences from 13 previously uncharacterised or unavailable cov or cov-like operational taxonomic units (otus), defined by clustering amino sequences of the rna dependent rna polymerase (rdrp) gene at 97% identity. to demonstrate the broader utility of our approach, we also report six novel deltaviruses related to the human pathogen hepatitis δ virus (hdv), and expand the described members of the recently characterised family of huge bacteriophages (phages). viral discovery is a first step in preparing for the next pandemic. sequencing reads for thousands of uncharacterised viruses already exist and require careful curation. to accelerate this process, we established a freely available and explorable resource of all vertebrate viral alignment data generated by serratus at https://serratus.io. this work lays the foundation for years of future research by enabling the exploration of viruses which have been captured by more than a decade of high-throughput sequencing studies. serratus is a freely available, open-source cloud-computing platform designed to enable petabase-scale sequence alignment against a set of references. using serratus, we aligned in excess of one million short-read sequencing datasets per day for under 1 us cent per dataset (extended figure 1 ). this was achieved by leveraging commercially available computing infrastructure to employ up to 22,250 virtual cpus simultaneously (see methods). we aligned 3,837,755 public rna-seq, meta-genome, meta-virome and meta-transcriptome datasets (termed a sequencing run [5] ) against a collection of viral family pangenomes comprising all genbank cov records clustered at 99% identity plus all non-retroviral refseq records for vertebrate viruses (see methods and extended table 1 ). to uncover more divergent viruses, we re-analysed 370,014 runs in a translated nucleotide search against a query comprising panproteomes for cov and other families. we performed de novo assembly on 52,772 runs potentially containing cov sequencing reads by combining 37,131 sra accessions identified by the serratus search with 18,584 identified by an ongoing cataloguing initiative of the sra called stat [5] . 11,120 of the resulting assemblies contained putative cov contigs, of which 4,179 aligned to cov rdrp (extended table 2 ). of these, we identified 13 otus from a total of 129, i.e. not represented by coronaviridae in genbank (figure 1a and extended figure 2 ). the protein domains of these otu are consistent with a cov or cov-like genome organisation (extended figure 3) . three of the novel cov otus fell within the alphacoronavirus (αcov) genus. the first (exemplar run: err2756788) was from two desmodus rotundus bat metagenomes yielding 29.1 and 25.4 kb cov contigs respectively in the nyctacovirus subgenus. these cov were noted by the data-collectors, [7] , but the sequences were not public and thus novel to our analysis. the second otu (srr9643845) was from a pipistrellus pipistrellus bat metagenome collected in 2016 in china. finally, from five libraries (err2744266) generated for a study on the metagenomic effects of the burying beetle nicrophorus vespilloides on a mouse carcass, we assembled a luchacovirus related to the rodent lucheng rn rat coronavirus (83% genome nucleotide identity to nc 032730.1). from a rodent virome study which identified several novel cov [8] , a sample from an unknown species contained a βcov embecovirus (srr5447167), with the closest matching genome matching an unclassified βcov from vietnam (77.52% to mh687971). finally, the δcov otu (srr5447167) appears to be from a currently unpublished avian virome study in china. we designated the eight remaining otus as group e, noting that all were found in samples from non-mammal aquatic vertebrates falling outside of δcov in the tree (extended figure 2) . a sister taxon to coronaviridae figure 1 : expanded characterisation of cov and related otus a radial cladogram derived from maximum likelihood tree of cov and related otus. inset is a phylogram of the same tree annotated with cov genera (greek letters) and group e cov-like nidoviruses. otus were generated by clustering the rdrp gene at 97% identity. diversity within each such 97% otu was characterised by counting the number of 99% identity otus it contained. an otu (97% or 99%) was considered to be known if it contained a genbank sequence, otherwise to be a novel otu discovered by serratus. hosts were considered novel if the source organism annotated by the sra belonged to a species not annotated as a host in any genbank record, noting that the annotated source may differ from the viral host (e.g., faecal contamination in a plant sample). hosts are classified as primates, fowl (galliformes), bats (chiroptera, aquatic (amphibia and osteichthyes), or other. b length distribution for assemblies of sra datasets classified as likely cov-positive, showing a peak around the typical cov genome length 30knt. c triangular matrix showing median rdrp sequence identities between selected nidovirales and group e viruses. d phylogram of group e cov-like nidoviruses. was recently proposed [9] following the characterisation of a corona-like virus, microhyla alphaletovirus 1 (mlev), in the frog microhyla fissipes, and soon after a related pacific salmon nidovirus (psnv) was described in the endangered oncorhynchus tshawytscha [10] . two of our otus were in these host species and the described viruses proved to be near-perfect matches. we expand this recently characterised group with six additional members, five similar to psnv in; takifugu pardalis (fugu fish; tparnv), syngnathus typhle (broad-nosed pipefish; stypnv), hippocampus kuda (seahorse; hkudnv) [11] , puntigrus tetrazona (tiger barb; ptetnv), ambystoma mexicanum (axolotl; amexnv), and a more distant member in caretta caretta (loggerhead sea turtle; ccarnv). notably, the ambystoma mexicanum (axolotl) nidovirus (amexnv) was assembled in 18 runs, 11 of which yielded 19 kb contigs. easing the criteria of requiring an rdrp match, 28/44 (63.6%) of the runs from the associated studies were amexnv positive. gene structure of the amexnv and related contigs suggests that there is genomic segmentation within this clade (extended figure 3) , with a homologous assembly gap is present in the published psnv genome [10] . these contigs were obtained from experimental animals from two different research groups [12] [13] [14] , the common factor is the animal stock centre used by these studies which is therefore likely to be the source of the virus. axolotl are critically endangered in the wild; determining the distribution and pathophysiology of amexnv in these animals can assist with conservation efforts. infectious agents are the leading cause of pyrexia of unknown origin (puo) in children and immunocompromised adults [15] . in addition to identifying genetic diversity within cov, we cross-referenced cov+ library meta-data to identify possible zoonoses and infer vectors of transmission. discordant libraries, one in which a cov is identified and the viral expected host does not match the sequencing library source taxa, were rare, accounting for only 0.92% of cases (extended table 2e ). in a 2010 virome sequencing study [16] of children with febrile illness, we identified sequencing runs from two children, one febrile (id:9007) and one afebrile (id:9090) with reads mapping to the (βcov), murine hepatitis virus (mhv). we assembled a complete 31.3 kb mhv genome from each replicate taken from the febrile child and a partial genome from the afebrile child. mhv can infect human cells in vitro [17] , but may be rare in humans, highlighting how rapid and unbiased meta-genomic sequence analysis can not only resolve the etiology of a sub-set of puo, centralisation of these data (stripped of human-identifying reads) also serves as a public-health surveillance system for zoonosis. an important consideration for these analyses is that the nucleic acid reads do not prove viral infection has occurred in the nominal host species. for example, we identified four libraries in which a porcine or avian coronavirus was found in plant samples. a more likely explanation than cross-kingdom cov transmission is that cov was present in faeces/fertiliser originating from a mammalian or avian host. coronaviridae is a well-characterised family ( figure 2 and extended figure 4 ), yet our re-analysis of the sra yielded eleven novel or under-reported otus. there are at least 4,497 more high-confidence (score ≥80) and diverged (≤90% identity) virus-containing datasets. in particular picornaviridae and reoviridae are enriched and numerous within this category ( figure 2 ). serratus exploration of under-characterised viruses can potentially fill these gaps in our knowledge. the global mortality from viral hepatitis exceeds that of hiv/aids, tuberculosis or malaria, due to acute and chronic liver cirrhosis and subsequent hepatocellular carcinoma [18] . hepatitis delta virus (hdv) is a small ( 1.7 knt) rna satellite virus infecting hepatocytes. alone, hdv is unable to produce infectious viral particles, as it requires the envelope protein from its helper virus, hepatitis b (hbv) [19] . hdv infection aggravates liver cirrhosis caused by hbv and worsens clinical outcomes [20] . prior to 2018, hdv was the sole known member of its genus; ten members have since been characterised [21-25]. we identified an additional six deltaviruses ( figure 3a ) and assembled complete circular genomes for five (extended figure 6 ). the evolutionary histories of these deltaviruses are explored further in a companion manuscript [24] . one of these novel deltaviruses, mmondv, was identified in marmota monax (eastern woodchuck), a model organism used over the last three decades for the study of viral-induced hepatitis and hepatocellular carcinoma following woodchuck hepatitis virus (whv) infection, a hepadnavirus similar to hbv [26]. from a 2015 study of 24 woodchucks born in captivity and experimentally infected with whv [30], liver biopsy rna-seq from four (16.7%) animals contained >5 mmondv-mapping reads in at least one time-point of the 26 week study (figure 3c ). woodchuck hepatitis virus can support replication of human hdv, it is in fact a model for hdv pathogenesis [31, 32], so it is probable that whv is also the helper virus for mmondv. inter-animal variation of whv-induced liver cirrhosis can be substantial [30] and cryptic mmondv infection may have be underlying some of this variability from the past three decades of research using this model system, which warrants further investigation. to explore the utility of broad-scale read archive searches for microbiome research, we sought to locate phages whose genomes encode proteins related to the terminases and major capsid proteins from recently reported huge phages [33] . to focus on phages whose genomes are substantially larger than normal (the average size is 52 kbp [33]), we prioritised assembled sequences of ≥140 kbp (figure 4a ). assembly of 287 high-scoring runs returned 252 terminase-containing long contigs, primarily from cats, dogs, cattle and whales. the phylogenetic analysis of these sequences resolves new groups of phages with large genomes, some of which are comprised only of sequences only from one animal genus. however, in a few cases we identified closely related phages in different animal orders, including one case where related phages were found in a human from bangladesh (err866585) and groups of cats (prjeb9357) and dogs (prjeb34360) from england, sampled 5 years apart. this result parallels the finding of 545 kbp lak phage genomes in pigs, baboons and humans [34] . these newly recovered sequences substantially expand the previously defined clades and reveal members of these clades in new habitats ( figure 4b ). overall, these findings amplify that phages with large genomes are prevalent in human and animal microbiomes. since the completion of the initial draft of the human genome, the cost of dna sequencing has outpaced moore's law with a corresponding increase in the sizes of sequence databases [35] . serratus offers researchers access to over a decade of data collected by the global research community in a rapid and a cost-effective manner. while our first priority was viral discovery in the context of an ongoing global health crisis, we believe that serratus and further extensions of petabase scale metagenomics will shape a new era in computational biology, and enable radically new approaches to gene discovery, pathogen surveillance, pangenomic evolutionary analysis amongst other applications. rapid translation of large datasets, such as those generated by serratus, into meaningful biomedical advances requires concerted collaboration between specialists [36] and underscores a greater need for prompt, free and unrestricted data sharing in the community, not only of raw data (reads) but also of analyses such as assemblies and annotations. to facilitate such progress, we established a data warehouse of the 5.7 terabytes of viral alignments containing known, and yet to be characterized, viral species, each requiring domain expertise for curation. these data can be explored via a graphical web interface at https://serratus.io or programatically through the r package tantalus (https://github.com/serratus-bio/tantalus) which interfaces to a postgresql-server hosting high-level data summaries. computational biology is outpacing the rate at which classical isolation-or culture-based validation can be performed. reverse genetics and synthetic nucleic acids offer a path to biological validation when virions are unavailable, such as those predicted from sequence alone [37, 38] . innovative fields such as high-throughput functional viromics [39] leverage these broad and rapidly growing collections of viral sequences, and can inform evidence-based policies responding to emerging pandemics [40, 41] . human population growth and encroachment on animal habitats is bringing more species into proximity, leading to increased zoonosis [2] and accelerating the anthropocene mass extinction [42, 43] . while the availability of computation and data analysis is increasing, the opportunity to capture the rich genetic diversity of endangered species and their associated microorganism biodiversity is not. the need to invest in field studies for the collection and curation of rare and biologically diverse samples has never been as pressing as it is today. if not for the conservation of endangered species, then to conserve our own. figure 1 ). the processing of each sequencing library is split into three modules dl (download), align, and merge. the dl module acquires compressed data (.sra format) via prefetch, from the aws s3 mirror of the sra, decompresses to fastq, and splits the data into fq-blocks of 1 million reads or read-pairs into a temporary s3 cache bucket. to mitigate excessive disk usage caused by a few large datasets, a limit of 100 million reads per dataset was imposed. the align module reads individual fq-blocks and aligns to an indexed database of user-provided query sequences using either bowtie2 each component is launched from a separate aws autoscaling group with its own launch template, allowing the user to tailor instance requirements per task. this enabled us to minimise the use of costly block storage during compute-bound tasks such as alignment. we used the following spot instance types; dl: 250gb ssd block storage, 8vcpus, 32gb ram (r5.xlarge) 1300 instances; align: 10gb ssd block storage, 8vcpus, 8gb ram (c5.xlarge) 4,300 instances; merge: 150gb ssd block storage, 4vcpus, 4gb ram (c5.large) 60 instances. users should note that it may be necessary to submit a service ticket to access more than the default 20 ec2 instance limit. ec2 instances have higher network bandwidth (up to 1.25 gb/s) than block storage bandwidth (250 mb/s). to exploit this, we used s3 buckets as a data buffering and streaming system and to transfer data between instances following methods developed in a previous cloud architecture (https://github.com/fredhutch/sra-pipeline). this, combined with splitting of fastq files into individual blocks, effectively eliminated file input/output (i/o) as a bottleneck, since the available i/o is multiplied per running instance (conceptually analogous to a raid0 configuration). using s3 as a buffer also allowed us to decouple the input and output of each module s3 storage is cheap enough that in the event of unexpected issues (e.g., exceeding ec2 quotas) we could resolve problems and resume processing. for example, shutting done the align modules to hotfix a genome indexing problem without having to re-run the dl modules. the serratus scheduler node controls the number of desired instances to be created for each component of the workflow, based on the available work queue. we implemented a pull-based work queue. upon boot-up each instance launches a number of worker threads equal to the number of cpu available. each worker independently manages itself via a boot script, and query the scheduler for available tasks. upon completion of the task, the worker updates the scheduler of the result: success, or fail, and queries for a new task. under ideal conditions, this allows for a response time in the hundreds of milliseconds, worst case, keeping cluster throughput high. each task typically lasts several minutes. the scheduler itself was implemented using postgres (for persistence and concurrency) and flask (to pool connections and translate rest queries into sql). the flask layer allowed us to scale the cluster past the number of simultaneous sessions manageable by a single postgres instance. the work queue can also be managed manually by the user, to perform operations such as re-attempt downloading of an sra accession upon a failure or to pause an operation while debugging. the system is designed to be fully self-scaling. an "autoscaling controller" was implemented which scales-in or scales-out the desired number of instances per task every five minutes based on the work queue. as a backstop, when all workers on an instance fail to receive work instructions from the scheduler, the instance is shut-down. finally a "job cleaner" component checks the active jobs against currently running instances. if an instance has disappear due to spot termination or manual shutdown, it resets the job allowing it to be processed up by the next available instance. to monitor cluster performance in real-time, we used prometheus and node exporter to retrieve cpu, disk, memory, and networking statistics from each instance, postgres exporter to expose performance information about the work queue, and python exporter to export information from the flask server. this allowed us to identify and diagnose performance problems within minutes to avoid costly overruns. we define a viral pangenome as the entire collection of reference sequences belonging to a taxonomic viral family, which may contain both full-length genomes and sequence fragments such as those based on rdrp amplicon sequencing. we developed a summarizer module written in python to provide a compact, human-and machine-readable synopsis of the alignments generated for each sra dataset. the method was implemented in serratus summarizer.py for nucleotide alignment and serratus psummarizer.py for amino acid alignments. reports generated by the summarizer are text files with three sections described in detail online (https://github.com/ababaian/serratus/ wiki/.summary-reports). in brief, each contains a header section with alignment meta-data and one-line summaries for each virus family pangenome, reference sequence and gene respectively, with gene summaries provided for protein alignments only. for each summary line we include descriptive statistics gathered from the alignment data such as the number of aligned reads, estimated read depth, mean alignment identity, and coverage, i.e. the distribution of reads across each reference sequence or pangenome. coverage is measured by dividing a reference sequence into 25 equal bins and depicted as an ascii text string of 25 symbols, one per bin; for example oaooomouu:owwuuwowamwaauw. each symbol represents log 2 (n + 1) where n is the number of reads aligned to a bin in this order: .:uwaomuwaom^. thus, ' ' indicates no reads, '.' exactly one read, ':' two reads, 'u' 3-4 reads, 'w' 5-7 reads and so on; '^' represents > 2 13 = 8, 192 reads in the bin. for a pangenome, alignments to its reference sequences are projected onto a corresponding set of 25 bins. for a complete genome, the projected pangenome bin number 1, 2, . . . , 25 is the same as the reference sequence bin number. for a fragment, a bin is projected onto the pangenome bin implied by the alignment of the fragment to a complete genome. for example, if the start of a fragment aligns half way into a complete genome, bin 1 of the fragment is projected to bin 25 2 = 12 of the pangenome. the introduction of pangenome bins was motivated by the observation that bowtie2 selects an alignment at random when there are two or more top-scoring alignments, which tends to distribute coverage over several reference sequences when a single viral genome is present in the reads. coverage of a single reference genome may therefore be fragmented, and binning to a pangenome better assesses coverage over a putative viral genome in the reads while retaining pangenome sequence diversity for detection. the summarizer implements a binary classifier predicting the presence or absence of each virus family in the query. for a given family f , the classifier reports a score in the range [0,100] with the goal of assigning a high score to a dataset if it contains f and a low score if it does not. setting a threshold on the score divides datasets into disjoint subsets representing predicted positive and negative detections of family f . the choice of threshold implies a trade-off between false positives and false negatives. sorting by decreasing score ranks datasets in decreasing order of confidence that f is present in the reads. naively, a natural measure of the presence of a virus family is the number of alignments to its reference sequences. however, alignments may be induced by non-homologous sequence similarity, for example low-complexity sequence. the score for a family was therefore designed to reflect the overall coverage of a pangenome because coverage across all or most of a pangenome is more likely to reflect true homology, i.e. the presence of a related virus. ideally, coverage would be measured individually for each base in the reference sequence, but this could add undesirable overhead in compute time and memory for a process which is executed in the linux alignment pipe (fastq decompression → aligner → summarizer → alignment file compression). coverage was therefore measured by binning as described above, which can be implemented with minimal overhead. a virus that is present in the reads with coverage too low to enable an assembly may have less practical value than an assembled genome. also, genomes with lower identity to previously known sequences will tend to contain more novel biological information than genomes with high identity and will tend to have fewer alignments highly diverged segments. with these considerations in mind, the classifier was designed to give higher scores when coverage is high, read depth is high, and/or identity is low. this was accomplished as follows. let h be the number of bins with at least 8 alignments to f , and l be the number of bins with from 1 to 7 alignments. let s be the mean alignment percentage identity, and define the identity weight w = ( s 100 ) −3 , which is designed to give higher weight to lower identities, noting that w is close to one when identity is close to 100% and increases rapidly at lower identities. the classification score for family f is calculated as z f = max(w(4h + l)), 100). by construction, z f has a maximum of 100 when coverage is consistently high across a pangenome, and is also high when identity is low and coverage is moderate, which may reflect high read depth but many false negative alignments due to low identity. thus, z f is greater than zero when there is at least one alignment to f and assigns higher scores to sra datasets which are more likely to support successful assembly of a virus belonging to f . )" (date accessed: may 17th 2020). retroviruses (n = 80) were excluded as preliminary testing yielded excessive numbers of alignments to transcribed endogenous retroviruses. each sequence was annotated with its taxonomic family according to its refseq record; those for which no family was assigned by refseq (n = 81) were designated as "unknown". the collection of these pangenomes was termed cov3m, and was the sequence reference used for this study. the protein search query was composed of the following sequences: (i) cov proteins (method described under to run serratus, a target list of sra run accessions is required. for this work, we designed target lists broadly classified as human, mouse, mammal, vertebrate, invertebrate, bat (including genome sequencing libraries), virome and metagenome (extended table 1c ). each list contained accessions of rna-seq, meta-genomic, and metatranscriptome runs for these organisms; some run accessions appeared in more than one list. prior to each serratus run, the lists were depleted for accessions already analyzed. re-processing of a failed dataset was attempted at least twice. in total we were able to generate alignments to the query pangenomes for 3,837,755/4,059,695 (94.5%) of the targeted sra accessions. we implemented an on-going, multi-tiered release policy for code and data generated by this study, as follows. all code, electronic notebooks and raw data is immediately available at https://github.com/ababaian/serratus and on the s3://serratus-public/ bucket, respectively. upon completion of a project milestone, a structured data-release is issued containing raw data into our viral data warehouse s3://lovelywater/. for example, at the time of writing the .bam alignment files from 3.84 million sra runs are stored in s3://lovelywater/bam/x.bam; .summary files are s3://lovelywater/summary/x.summary, where x is a sra run accession. these structured releases enable downstream and third-party programmatic access to the data. summary files for every searched sra dataset are parsed into a postgresql relational database which can be queried remotely via an aws relational database (rds) server. this enables users and programs to perform complex operations such as retrieving summaries and meta-data for all sra runs matching a given reference sequence with above a given classifier score threshold. for example, all records containing at least 20 aligned reads to hepatitis delta virus (nc 001653.2) and the associated host taxonomy for the corresponding sra datasets. for users unfamiliar with sql queries we developed tantalus (https://github.com/serratus-bio/tantalus, an r programming-language package which directly interfaces the serratus rds server to retrieve summary information as data-frames. tantalus also offers functions to explore and visualize the data. finally, the serratus data can be explored via a graphical web interface by accession, virus, or viral family at https:/serratus.io. the website uses javascript to access the rds server and create a graphical report with an overview of viral families found in each sra accession matching a user query. all four data access interfaces are under ongoing development, receiving community feedback via their respective github issue trackers to facilitate the translation of this data collection into an effective viral discovery resource. documentation for data access methods is available at https://serratus.io/access 1.4 viral assembly and annotation 1.4.1 coronaspades rna viral genome assembly faces several distinct challenges stemming from technical and biological bias in sequencing data. during library preparation, reverse transcription introduces 5 end coverage bias, and gc-content skew and secondary structures lead to unequal pcr amplification [48] . technical bias is confounded by biological complexity such as intra-sample sequence variation due to transcript isoforms, as found in cov [49] and/or to presence of multiple strains. to address the assembly challenges specific to rna viruses, we developed coronaspades, described in detail in a companion manuscript [50] . in brief, rnaviralspades and the more specialized variant, coronaspades, combines algorithms and methods from several previous approaches based on metaspades [51], rnaspades [52] and metaviralspades [53] with a hmmpathextension step. coronaspades constructs an assembly graph from a rna-sequencing dataset (transcriptome, meta-transcriptome, and meta-virome are supported), removing expected sequencing artifacts such as low-complexity (poly-a / poly-t) tips, edges, single-strand chimeric loops or doublestrand hairpins [52] and subspecies-bases variation [53] . to deal with possible misassemblies and high-covered sequencing artifacts, a secondary hmmpathextension step is performed to leverage orthogonal information about the expected viral genome. protein domains are identified on all assembly graphs using a set of viral hidden markov models (hmms), and similar to biosyntheticspades [54], hmmpathextension attempts to find paths on the assembly graph which pass through significant hmm matches in order. coronaspades is bundled with the pfam sars-cov-2 set of hmms [55], although these may be substituted by the user. this latter feature of coronaspades was utilized for hdv assembly, where the hmm model of hdag, the hepatitis delta antigen, was used instead of pfam sars-cov-2 set. note that despite the name, these hmms are quite general, modeling domains found in all coronavirus genera in addition to rdrp, which is found in many rna virus families. hits from these hmms cover most bases in most known coronaviruse genomes, enabling the recovery of strain mixtures and splice variants. accurate annotation of cov genomes is challenging due to ribosomal frameshifts and polyproteins which are cleaved into maturation proteins [56] , and thus previously-annotated viral genomes offer a guide to accurate gene-calls and protein functional predictions. however, while many of the viral genomes we were likely to recover would be similar to previously-annotated genomes in refseq or genbank, we anticipated that many of the genomes would be taxonomically distant from any available reference. to address these constraints, we developed an annotation pipeline called darth [57] 1 which leverages both reference-based and ab initio annotation approaches. in brief, darth consists of the following phases: canonicalize the ordering and orientation of assembly contigs using conserved domain alignments, perform reference-based annotation of the contigs, annotate rna secondary structure, ab intio gene-calling, generate files for aiding assembly and annotation diagnostics, and generate a master annotation file. it is important to put the contigs in the "expected" orientation and ordering to facilitate comparative analysis of synteny and as a requirement for genome deposition. to perform this canonicalization, darth generates the six-frame translation of the contigs using the transeq [58] and uses hmmer3 [59] to search the translations for pfam domain models specific to cov [60] . darth compares the pfam accessions from the hmmer alignment to the ncbi sars-cov-2 reference genome (ncbi nucleotide accession nc 045512.2) to determine the correct ordering and orientation, and produces an updated assembly fasta file. darth performs reference-based annotation using vadr [61] , which provides a set of genome models for all cov refseq genomes [62] . vadr provides annotations of gene coordinates, polyprotein cleavage sites, and functional annotation of all proteins. darth supplements the vadr annotation by using infernal [63] to scan the contigs against the sars-cov-2 rfam release [64] which provides updated models of cov 5 and 3 untranslated regions (utrs) along with stem-loop structures associated with programmed ribosomal frame-shifts. while vadr provides reference-based gene-calling, darth also provides ab initio gene-calling by using fraggenescan [65] , a frameshift-aware gene caller. darth also generates auxiliary files which are useful for assembly quality and annotation diagnostics, such as indexed bam files created with samtools [66] representing self-alignment of the trimmed reads to the canonicalized assembly using bowtie2 [44], and variant-calls using bcftools from samtools. darth generates these files so that the can be easily loaded into a genome browser such as jbrowse [67] or igv [68] . as the final step darth generates a single generic feature format (gff) 3.0 file [69] containing combined set of annotation information described above, ready for use in a genome browser, or for submitting the annotation and sequence to a genome repository. the serratus searches described above identified 37,131 libraries (14,304 by nucleotide and 23,898 by amino acid) as potentially positive for cov (score ≥20 and ≥10 reads). to supplement this search we also employed a recently developed index of the sra called stat [5] with which identified an additional 18,584 sra datasets not in the defined sra search space. the stat bigquery was where tax id=11118 and total count >1" accessed on june 24th 2020. we used aws batch to launch thousands of assemblies of ncbi accessions simultaneously. the workflow consists of four standard parts: a job queue, a job definition, a compute environment, and finally, the jobs themselves. a cloudformation template 2 was created for building all parts of the cloud infrastructure from the command line. the job definition specifies a docker image, and asks for 8 virtual cpus (vcpus, corresponding to threads) and 60 gb of memory per job, corresponding to a reasonable allocation for coronaspades. the compute environment is the most involved component. we set it to run jobs on cost-effective spot instances (optimal setting) with an additional cost-optimization strategy (spot capacity optimized setting), and allowing up to 40,000 vcpus total. in addition, the compute environment specifies a launch template which, on each instance, i) automatically mounts an exclusive 1 tb ebs volume, allowing sufficient disk space for several concurrent assemblies, and ii) downloads the 5.4 gb checkv database, to avoid bloating the docker image. the peak aws usage of our batch infrastructure was 28,000 vcpus, performing 3,500 assemblies simultaneously. a total of 46,861 accessions out of 55,715 were assembled in a single day. they were then analysed by two methods to detect putative cov contigs. the first method is checkv, followed selecting contigs associated to known cov genomes. the second method is a custom script 3 that parses coronaspades bgc candidates and keeps contigs containing cov domain(s). for each accession, we kept the set of contigs obtained by the first method (checkv) if it is non-empty, and otherwise we kept the set of contigs from the second method (bgc). a majority (76%) of the assemblies were discarded for one of the following reasons: i) no cov contigs were found by either filtering method, ii) reads were too short to be assembled, iii) batch job or sra download failed, or iv) coronaspades ran out of memory. a total of 11,120 assemblies were considered for further analysis. with rna-seq metagenomic reads, the number of reads per base may be highly variable at different locations in a viral genome. regions of high coverage may be adjacent to regions with low coverage or no reads, causing breaks between contigs. thus, a given base in a contig may have only one or very few reads as evidence, and as a consequence the reliability of base calls may be low in some regions of the assembly which could degrade inference of biological variations between genomes. the assemblers used in this work do not provide a per-base quality score, and to address this issue we used two complementary approaches: (1) reporting contig average coverage as a proxy for quality, and (2) self-aligning reads to the assembly sequence and calling variants to enable facile visual inspection of per-base coverage levels and significant variants in genome browsers (see section 1.4.2). we developed a module, serratax, to predict taxonomy for cov genomes and assemblies (https://github. com/ababaian/serratus/tree/master/containers/serratax). serratax was designed with the following requirements in mind: provide taxonomy predictions for fragmented and partial assemblies in addition to complete genomes; report best-estimate predictions balancing over-classification and under-classification (too many and too few ranks, respectively); and assign an ncbi taxonomy database [70] identifier (taxid). assigning a best-fit taxid was not supported by any previously published taxonomy prediction software to the best of our knowledge; this requires assignment to intermediate ranks such as sub-genus and ranks below species (commonly called strains, but these ranks are not named in the taxonomy database), and to unclassified taxa, e.g. taxid 2724161, unclassified buldecovirus, in cases where the genome is predicted to fall inside a named clade but outside all named taxa within that clade. serratax uses a reference database containing domain sequences with taxids. this database was constructed as follows. records annoated as cov were downloaded from uniprot [71] , and chain sequences were extracted. each chain name, e.g. helicase, was considered to be a separate domain. to generate an alternate taxonomic annotation of an assembled genome, we created a pipeline based on phylogenetic placement, serraplace. to perform phylogenetic placement, a reference phylogenetic tree is required. to this end, we collected 823 reference amino acid rdrp sequences, spanning all coronaviridae. to this set we added an outgroup rdrp sequence from the torovirus family (nc 007447). we clustered the sequences to 99% identity using usearch ([46] , uclust algorithm, v11.0.667), resulting in 546 centroid sequences. subsequently we performed multiple sequence alignment on the clustered sequences using muscle ( [72] , v3.8.31). we then performed maximum likelihood tree inference using raxml-ng ( [73] , protgtr+fo+g4, v0.9.0), resulting in our reference tree. to apply serraplace to a given genome, we first use hmmer ([59], v3.3) to generate a reference hmm, based on the reference alignment. we then split each contig into orfs using esl-translate, and use hmmsearch (p-value cutoff 0.01) to identify those query orfs that align with sufficient quality to the previously generated reference hmm. all orfs that pass this test are considered valid input sequences for phylogenetic placement. subsequently, we use epa-ng ( [74] , v0.3.7) to place each sequence on the rdrp reference tree. this produces a set of likely placement locations on the tree, with an associated likelihood weight. we then use gappa ( [75] , v0.6.1) to assign taxonomic information to each query, using the taxonomic information for the reference sequences. gappa assigns taxonomy by first labelling the interior nodes of the reference tree by a consensus of the taxonomic labels of all descendant leaves of that node. if 66% of leaves share the same taxonomic label up to some level, then the internal node is assigned that label. then, the likelihood weight associated with each sequence is assigned to the labels of internal nodes of the reference tree, according to where the query was placed. from this result, we select that taxonomic label that accumulated the highest total likelihood weight as the taxonomic label of a sequence. note that multiple orfs of the same genome may result in a taxonomic label, in which case, we select the longest sequence as the source of the taxonomic assignment of the genome. we performed phylogenetic inferences using a custom snakemake pipeline (available at https://github.com/ lczech/nidhoggr), using pargenes ( [76] , v1.1.2). pargenes is a treesearch orchestrator, build on top of modeltest-ng [77] and raxmlng, enabling higher levels of parallelisation for a given tree search. to infer the maximum likelihood phylogenetic tree displayed in extended figure 2 , we performed a tree search comprising 100 distinct starting trees (50 random, 50 parsimony), as well as 1000 bootstrap searches. we used modeltestng to automatically select the best evolutionary model, which in this case was lg+iu+g4m. the pipeline also automatically produces versions of the best maximum likelihood tree annotated with felsenstein's bootstrap ( [78] ) support values, and transfer bootstrap expectation ([79]) values, the latter of which was used in extended figure 2 . archival copies of all code generated for this study is available at https://github.com/serratus-bio. electronic notebooks for experiments are available at https://github.com/ababaian/serratus. access to all data generated in this study can be accessed at https://serratus.io/access. assembled genomes contigs for this study are available at https://serratus.io/access pending deposition into public repositories. extended table 1 : sra run queries and search nucleotide accessions. queries and accessions from this study. a sra queries to retrieve collections of datasets. b nucleotide accessions compiled into the cov3ma reference query and c the sequence masked applied to those sequences. extended table 2 : assembled coronaviridae in the sra. a run accessions, assembly statistics and select meta-data for the 11,120 runs for which coronaviridae, or coronaviridae-like sequences were assembled. b assignment of assembled runs to operational taxonomic units (otus) based on 97% identity of the rna dependent rna polymerase (rdrp) domain. c assignment of genbank records to rdrp otus. d assignment of expected viral host for genbank records. e taxonomic source for rdrp containing assemblies. f supporting data for figure 1 . extended figure 1 : overview of the serratus architecture. a schematic and data workflow (b) as described in the methods for aligning to the viral pangenome (c). d a nucleotide alignment completion rate for serratus shows stable and linear performance to complete 1.29 million sra accessions in a 24-hour period. e cost breakdown for this run. compute costs between modules are an approximate comparison of cpu requirements of each step. the total average cost per completed sra accession was $0.0062 us dollars or $0.1892 us dollars per terabase processed. extended figure 5 : distribution of dna and other viral families in the sra the total number of datasets matching each dna or other viral pangenome, binned by the average nucleotide identity and colored by score (see methods). an interactive and queryable version of this plot is available at https://serratus.io/family. figure 7 : deltavirus ribozymes evolutionary history a multiple sequence alignment of the genomic and anti-genomic deltavirus ribozymes based on muscle [72] and refined manually based on secondary structure. the shortening of the j1/2 loop and presence of the lg loop is specific to and conserved within the genomic ribozyme. consensus secondary structure of the b genomic and c anti-genomic ribozymes. d maximum-likelihood tree based on concatenated ribozyme sequences supports the topology of the δag amino-acid tree (figure 3) a strategy to estimate unknown viral diversity in mammals global trends in emerging infectious diseases. eng global shifts in mammalian population trends reveal key predictors of virus spillover risk the global virome project. en the sequence read archive the sensitivity of massively parallel sequencing for detecting candidate infectious agents associated with human tissue. eng demographic and environmental drivers of metagenomic viral diversity in vampire bats. en comparative analysis of rodent and small mammal viromes to better understand the wildlife origin of emerging infectious diseases description and initial characterization of metatranscriptomic nidovirus-like genomes from the proposed new family abyssoviridae, and from a sister group to the coronavirinae, the proposed genus alphaletovirus endangered wild salmon infected by newly discovered viruses comparative population genomics in animals uncovers the determinants of genetic diversity. en blastemal progenitors modulate immune signaling during early limb regeneration midkine is a dual regulator of wound epidermis development and inflammation during the initiation of limb regeneration ap-1 cfos/junb /mir-200a regulate the pro-regenerative glial cell response during axolotl spinal cord regeneration. en pyrexia of unknown origin sequence analysis of the human virome in febrile and afebrile children mouse hepatitis virus strain jhm infects a human hepatocellular carcinoma cell line. eng the global burden of viral hepatitis from 1990 to 2013: findings from the global burden of disease study infection by hepatitis delta virus. en pfam sars-cov-2 special update (part 2) en. library catalog: xfam.wordpress.com vadr: validation and annotation of virus sequence submissions to genbank coronavirus annotation using vadr en. library catalog: github infernal 1.1: 100-fold faster rna homology searches rfam coronavirus special release en. library catalog: xfam.wordpress.com fraggenescan: predicting genes in short and error-prone reads. en the sequence alignment/map format and samtools. eng jbrowse: a dynamic web platform for genome visualization and analysis. eng publisher: american association for cancer research section: focus on computer resources the sequence ontology: a tool for the unification of genome annotations the ncbi taxonomy database uniprot: a worldwide hub of protein knowledge muscle: multiple sequence alignment with high accuracy and high throughput raxml-ng: a fast, scalable and userfriendly tool for maximum likelihood phylogenetic inference. en epa-ng: massively parallel evolutionary placement of genetic sequences. en genesis and gappa: processing, analyzing and visualizing phylogenetic (placement) data pargenes: a tool for massively parallel model selection and phylogenetic tree inference on thousands of genes. en modeltest-ng: a new and scalable tool for the selection of dna and protein evolutionary models. en tobaniviridae (t), and roniviridae (r). b distribution of pair-wise sequence identities for rdrp sequences within and between distinct taxa at species, subgenus and genus rank, respectively. c distribution of pair-wise rdrp identities for coronaviridae genera hidden markov model (hmm) protein domain matches from the rdrp containing contigs or reference sequences for 47 exemplar operational taxonomic units (otus) grouped by genus extended figure 6: newly characterised deltavirus genomes genome structure and organisation of the five deltaviruses (pmacdv srr7910143; mmondv srr2136906; ovirdv srr4256033; tgutdv srr5001850; and ichidv srr8954566) and one deltavirus-like (bgladvl srr8242383; for which we could not identify a ribozyme sequence) sequence identified in our study. each circular rna virus shows characteristic rod-like genome folding and low free-energy (δg), similar to a hepatitis delta virus positive control, and two ribozymes and the serratus project is an initiative of the hackseqrna genomics hackathon (https://www.hackseq.com). we would like to thank the many contributors for code snippets and bioinformatic discussion; e. erhan, j. chu, i. birol, k. wellman, c. xu, m. huss, k. ha, e. nawrocki, r. mclaughlin, c. morgan-lang, c. blumberg, and the j. brister lab. a. rodrigues, s. mcmillan, v. wu, c. kennet, k. chao, and n. pereyaslavsky for aws support. we would also like to thank the j. joy lab, g. mordecai, j. taylor, s. roux, l. bergner, r. orton, and d. streicker for virology discussions. we are grateful to the entire team managing the ncbi sra. ta is grateful for the advanced research computing resource at the university of british columbia. pb was financially supported by the klaus tschira foundation, rc by anr transipedia and inception grants (pia/anr-16-conv-0005, anr-18-ce45-0020), ak and dm were supported by the russian science foundation (grant 19-14-00172) and computation was carried out in part by resource centre "computer centre of spbu". ak and dm are grateful to saint petersburg state university for the overall support of this work (project id: 51555639). project support and computing resources were kindly provided by the university of british columbia community health and wellbeing cloud innovation centre, powered by aws. and special thanks to our patient and understanding partners. ab conceived and led the study. ab and jt designed and implemented the serratus architecture. ab and rce constructed the viral pangenomes and panproteomes. rce developed the serratax and summarizer modules. pb developed the serraplace tree placement and taxonomy prediction code and calculated maximum likelihood trees. ta developed the darth annotation pipeline and submitted the annotated genomes to ena. dm and ak developed the coronaspades assembler. rc implemented the assembly pipeline, and deployed the assembly and annotation pipeline. ab, vl, and dl designed and developed https://serratus.io and the sql server. ab and gn developed the tantalus r package. ab, rce, ta, pb, dm, ak, and rc analysed the coronavirus and deltavirus data. bas and jb designed the phage panproteome, assembled phage genomes, and conducted phylogenetic analyses. all authors contributed to data interpretation and writing the manuscript. correspondence should be addressed to ab. does not apply. key: cord-342785-55r01n0x authors: lemmon, gordon h; gardner, shea n title: predicting the sensitivity and specificity of published real-time pcr assays date: 2008-09-25 journal: ann clin microbiol antimicrob doi: 10.1186/1476-0711-7-18 sha: doc_id: 342785 cord_uid: 55r01n0x background: in recent years real-time pcr has become a leading technique for nucleic acid detection and quantification. these assays have the potential to greatly enhance efficiency in the clinical laboratory. choice of primer and probe sequences is critical for accurate diagnosis in the clinic, yet current primer/probe signature design strategies are limited, and signature evaluation methods are lacking. methods: we assessed the quality of a signature by predicting the number of true positive, false positive and false negative hits against all available public sequence data. we found real-time pcr signatures described in recent literature and used a blast search based approach to collect all hits to the primer-probe combinations that should be amplified by real-time pcr chemistry. we then compared our hits with the sequences in the ncbi taxonomy tree that the signature was designed to detect. results: we found that many published signatures have high specificity (almost no false positives) but low sensitivity (high false negative rate). where high sensitivity is needed, we offer a revised methodology for signature design which may designate that multiple signatures are required to detect all sequenced strains. we use this methodology to produce new signatures that are predicted to have higher sensitivity and specificity. conclusion: we show that current methods for real-time pcr assay design have unacceptably low sensitivities for most clinical applications. additionally, as new sequence data becomes available, old assays must be reassessed and redesigned. a standard protocol for both generating and assessing the quality of these assays is therefore of great value. real-time pcr has the capacity to greatly improve clinical diagnostics. the improved assay design and evaluation methods presented herein will expedite adoption of this technique in the clinical lab. real-time pcr assays are gaining popularity as a clinical tool for detecting and quantifying the presence of both viral and bacterial pathogens, as reviewed in [1] . com-pared to traditional culturing methods used in identification, real-time pcr is fast and cost effective. in addition, it can be quantitative and sensitive, in some cases greatly exceeding the sensitivity for conventional testing meth-ods. commercially distributed kits are available for pcrbased pathogen diagnostics, and pcr is no longer thought of merely as confirmatory to culture. however real-time pcr assays are limited by the quality of the primers and probes chosen. these primers and probes must be sensitive enough to match all target organisms yet specific enough to exclude all others. a common approach to developing a primer/probe combination is by using commercial software such as primerexpress ® (applied biosystems, foster city, ca, usa). this software asks the user to upload a dna sequence file, and then finds possible primer/probe sets that meet the assay criteria. generally a researcher will provide as input a gene region conserved throughout the taxa that the assay is being designed to detect. the software then provides possible primer/probe sets. the researcher chooses a representative signature. if there are single nucleotide polymorphisms (snps) within the chosen conserved region, a signature with consensus primers and probes is often chosen. next a blast [2] search is performed to ensure that the primers are not hitting other targets. finally the signature is verified in vitro with laboratory strains. while this design approach may work acceptably well in the research laboratory, the clinical laboratory calls for a more thorough analysis to ensure detection of novel, diverse, and uncommon strains. these may appear, for example, as a result of spread by foreign travel or migration. whole genome based automated signature design [3] presents a great improvement to the common method. however, in addition to better design strategies, methods for automated signature evaluation are needed. as additional sequence data becomes available, it is necessary to regularly reassess the predicted efficacy of a given signature. this analysis must include the predicted false negative and false positive rates for the developed signatures, and consider all available public sequence data. we have analyzed a number of real-time pcr assays found in the literature based on public sequence data. herein we report how well these signatures performed, offer a revised approach to pcr assay design, and use this approach to produce new assays predicted to have higher sensitivity and specificity. the literature was combed for recently published articles reporting real-time pcr assays for the clinical detection of bacterial and viral taxa. the primer and probe sequences were accumulated, with a preference for taqman assays. however, 3 intercalating dye assays were also selected. papers reporting nucleotide sequences that could not eas-ily be copied from an online source were avoided. in total, 112 signatures from 32 papers were analyzed. local oracle databases have been constructed from the complete genome sequence data available at ncbi genbank, tigr, embl, img (jgi), and baylor hgsc. we used our "all_virus" and "all_bacteria" databases to find signature matches and predict false negatives and false positives. these databases were designed to contain only whole genomes and whole segments from segmented genomes. however, the heuristics used to separate whole genomes from partial sequences are not fail-proof due to inconsistency in sequence annotation within the public databases. consequently many sequences in these databases may show up as false negatives when they are actually just a section or segment of a genome that is not expected to contain the signature, and we manually sorted these sequences into true or false negatives. a freely available real time pcr analysis tool called taqsim [4] was used to find public sequences that would match the primer/probe assay in question. taqsim uses blast searches to find sequences that match both forward and reverse primers and probe. to be reported as a "hit" the primers and probe must match in the required orientations relative to one another and the primers must be in sufficiently close proximity. the forward/reverse primers may fall on either the plus or minus strand, so long as the orientation relative to one another is appropriate. there may not be mismatches at the 3' end of either primer. for each hit taqsim calculates the primer and probe melting temperatures as bound to the candidate hit sequence (accounting for mismatches) based on reaction conditions (reagent concentrations and hybridization temperature), and returns sequences predicted to be amplified. instead of replicating the various exact reaction conditions reported in each paper, very lenient settings were applied in all cases, essentially removing the screen for primer/probe vs. candidate hit tm by setting this threshold to 0°k, and instead checking for specificity by requiring that hits have fewer than 3 mismatches per primer or probe. taqsim's predicted sequence hits were compared with sequences listed under a given set of ncbi taxonomy tree nodes. for instance, if a signature was reported to detect hepatitis b, then its set of taqsim hits would be compared with the set of sequences under node 10407, corresponding to hepatitis b virus. sequences in both sets were considered true positives, sequences in the taqsim output that were missing from the chosen taxonomy nodes were considered false positives, and sequences that were in the taxonomy tree but missing from the taqsim output were considered false negatives. test statistics such as specificity and sensitivity (power) were then calculated. in this paper we define sensitivity and specificity as follows: the primary research articles were read carefully to determine what the authors had designed their primers/probes to detect. ncbi taxonomy nodes were chosen to represent these target organisms. this was not a trivial task, since many articles lack clarity as to which taxa, specifically, their assay should detect. for instance, the cytomegalovirus assay did not detect all sequences in the cytomegalovirus genus (taxonomy node 10358), but rather all sequences in the human herpesvirus 5 species (taxonomy node 10359). none of the articles specified a taxonomy node for their signatures. perl scripting [5] was used to help compare blast hits and taxonomy node sequences, and count false negative, false positive and true positive sequence matches. however some sequences required hand sorting due to the wide array of sequence types and annotations. these often represented segmented genomes, in which case many of the would-be false negative sequences simply represent a different segment than that on which the signature lands, so we manually tabulated them as true negatives. they may also represent plasmids. in these situations, a careful review of the genbank entry, and sometimes of the primary article cited by genbank, was necessary to determine if the sequence of interest was truly a false negative. although we attempt to include only complete genomes in our sequence database, because of inconsistencies in the annotation of sequence data some partial sequences nevertheless make it into our databases. any of these partial cds's documented as containing the target gene on which the signature was supposed to land were counted as false negatives, but those partial cds's not documented to contain the target gene were eliminated from the false negative pool because it is possible the signature could land on the unsequenced section with the target gene. our database also contains "glued fragments", which represent draft genomes "glued" together with hundreds of "n"s as a simple way to keep the separate contigs associated as part of the same genome. while we report false negatives from these draft genomes, it is possible that the signatures could land on gaps between the contigs, and that finished sequencing could result in re-classification as a true positive. tables 1, 2, 3, 4, 5 summarize our analysis of various dna signatures. details of all true positive, false positive and false negative sequences are available from the authors. note that these results are in silico results; no laboratory testing was performed for verification, so that by stating that an organism is "detected" we mean that this is our prediction based on sequence data. a few notes of interest concerning the data in the tables are described below: the two human corona virus strains, 229e and oc43 are a frequent cause of the common cold [6] . a taqman assay for 229e was predicted to perform perfectly, while an assay for oc43 turned up a number of false positives, all of which were animal corona viruses. a coxsackie b3 virus assay [7] performed well, but a coxsackie b4 assay [8] hit many other human coxsackie, echo, and entero viruses. four out of 5 false negatives for a marburg virus assay [9] were of the lake victoria variety. false negatives associated with a yellow fever signature [10] included trinidad, french neurotropic, french viscerotropic, and vaccine strains. the filoviridae (ebola/marburg) assay [10] detected only ebola viruses. staphylococcus aureus [11] and enterobacteriaceae assays [12] had low sensitivity. an escherichia coli assay [12] hit shigella and vibrio sequences. many of these signatures [13] had high sensitivity. combining several of them into a multiplex assay would probably improve sensitivity further. these signatures were designed using a minimal set clustering approach [14] . while individual signatures have decent sensitivity, combining several signatures in one assay, as advocated in the publication greatly improved sensitivity. the signatures for hepatitis a are currently undergoing laboratory screening by the fda, and are performing well (g. hartman, personal communication). several reported signatures produced no predicted hits. these include assays for several flaviviruses [10, 15, 16] , and 16s rrna assays [12] for several bacteria. examination of blast output showed that in these cases either a primer or internal oligo (probe) did not have blast hits to target, there were too many mismatches per primer or probe sequence above the threshold specified in our analall but three are taqman signatures. the three intercalating dye type assays are bolded. yses, or there were mismatches at the 3' end of a primer relative to target. it is possible that if the sequences of the samples used in the laboratory differ from available genomic data, or if the pcr reaction conditions are performed at low stringency (e.g. low annealing temperatures or high salt concentrations) these assays could in fact work in the laboratory. however, according to the genomic data available, a better match of primers and probes to target is possible and is usually desired for high sensitivity detection. targeting a number of the organisms for which currently published signatures were predicted to perform poorly, as well as some for which additional signatures may be desired (even though published signatures may perform well), we generated new signatures using minimal set clustering (msc) according to methods previously described [14, 17] . msc begins by removing non-unique regions from consideration as primers or probes from each of the target sequences relative to a database of nontarget bacterial and viral sequences. the remaining unique regions of each target sequence are mined for all or many candidate signatures, without regard for conservation among other targets, yet satisfying user specifications for primer and probe length, t m , gc%, avoidance of long homopolymer runs, and amplicon length. all candidate signatures are compared to all targets and clustered by the subset of targets they are predicted to detect. signatures within a given cluster are equivalent, in that they are predicted to detect the same subset of targets, so by clustering we reduce the redundancy and size of the problem to finding a small set of signatures that detect all targets. nevertheless, finding the optimal solution of the fewest clusters to detect all targets is an np complete problem, so for large data sets we use a greedy algorithm to find a small number of clusters that together should pick up all targets. in the supplementary table, we often provide more than one alternative signature to detect a given equivalence group of genomes to serve as a backup should a signature perform poorly in laboratory testing. some of the signatures may have mismatches to some of their intended targets, although these mismatches are not predicted to reduce the t m of primer/probe hybridizing to target below typical taqman reaction conditions. none of these computationally predicted signatures have been screened in the laboratory, as this is beyond the scope of this paper. year target pseudomonas aeruginosa, escherichia coli, and neisseria meningitidis. as expected we found that false negatives were much more common than false positives. though signatures are generally based on conserved gene regions, they often fail to take into account all of the variation within a target set of organisms. this may be because the signatures were developed using sequence data from a handful of strains, rather than a thorough study of all strains publicly available. these false negatives may also represent sequences that have become available since the publication of the given signature. since new sequence data is made available at an ever increasing rate, there is great benefit in re-evaluating clinically used dna signatures regularly. when new sequence data leads to false negative predictions for a signature, one of two explanations can be given. the new sequences either represent recently recognized variation that has been around since the time the signature was published, or new variation, the result of mutation and natural selection. in either case, an improved or additional signature should be designed. high false positive or false negative rates do not necessarily indicate a "bad" dna assay. the quality of an assay must be considered in light of the milieu in which the testing will take place. in the clinical laboratory, a signature with high sensitivity but perhaps low specificity may be preferred over a test with lower sensitivity in cases where the putative pathogen requires immediate treatment or may spread quickly. the case of antibiotic resistant bacteria probably falls in this category. on the other hand, the nation's basis and biowatch programs insists on zero false positives, so as to avoid public disturbances due to false alarms, while still aiming for zero false negatives [18] . one must also consider the type of false negative and false positive results to determine their relevance. for instance, in this article an assay for human corona virus oc43 [6] what about such a match in a clinical lab in africa? on the other hand, the echovirus sequences that the coxsackie b4 assay [8] can detect could produce misleading results in any clinical lab. the false negative and false positive rates presented in this study may vary substantially from those seen empirically. this is because the strains available in a laboratory may differ significantly from the sequence data available, or because the empirical protocol is more or less stringent than the sequence-based requirements we imposed, which allowed no more than 2 mismatches per primer or probe for detection. we believe that as more target sequences become available, our predicted false negative rates will tend to increase for a given published signature both as a result of better sampling of diversity and as a result of failure to detect newly evolved variants. it has been estimated that a minimum of 3-4 genomes are needed in order to computationally design taqman pcr signatures likely to detect most strains, with those isolates chosen for sequencing that have been selected to span gradients of geographic, phenotypic, and temporal variation [19] . even more than 4 genomes are needed for particularly diverse organisms. thus, older signatures may not perform as well as newly developed signatures from the most up-to-date sequence data. a future study of interest would be a longitudinal look at how these rates continue to change over time as additional sequences become available. this study could be performed retrospectively, since sequence submission dates are easily obtained from public databases. we also hypothesized that the wider the intended scope of a signature, the lower its sensitivity would tend to be. the point is illustrated loosely in our data tables. twenty-six of the 28 signatures with less than 10 publicly available target sequences had sensitivities of 1 (i.e. zero false negatives), while signatures with 10 or more targets had an average sensitivity of 0.710116. however this approach only considers scope in the context of sequence data available. we tried to demonstrate the relationship between specificity and scope at a more fundamental level by grouping signatures by the taxonomic level of their target as shown in figure 1 . however the results are misleading. in virology, taxonomic level is not a good indicator of nucleotide diversity. for instance, there is more diversity in the influenza a species then there is in the entire filoviridae family, which consists of only two known genera: ebola-like viruses and marburg viruses. a better approach might be to calculate nucleotide diversity as a function of phylogenetic branch length or shared k-mer clusters within a target taxonomy node. finally, we averaged the sensitivities of microbes by genome type as shown in table 6 . note that the ssrna-rt category includes only hiv-1. this chart demonstrates that creating signatures with high sensitivity becomes more difficult for target organisms with high mutation rates. current real-time pcr assay design approaches produce signatures with sensitivities generally too low for clinical use. we suggest that a rigorous approach involving false positive and false negative analysis should be the standard by which an initial assessment of signature quality is made. signatures must also regularly be reassessed as sequence data becomes available. for targets with wide nucleotide diversity, it becomes necessary to develop a set of signatures, for which we suggest a minimal set clustering approach that may also include signatures with degenerate/inosine bases. newrealtimepcrsigatures. fifty seven taqman pcr primer/probe combinations we predict to have higher sensitivity/specificity than current published assays. click here for file [http://www.biomedcentral.com/content/supplementary/1476-0711-7-18-s1.doc] sensitivity by taxonomy level figure 1 sensitivity by taxonomy level. each colored diamond represents a real-time pcr assay examined in this paper. black bars indicate the mean, grey bars indicate the median. top and bottom of each box indicates 75th and 25th percentiles, and grey lines at whisker ends denote min and max values. the wide ranging sensitivities demonstrate both inconsistency in genetic diversity at a given taxonomy level, and inconsistency in signature design approaches. japanese encephalitis virus too many mismatches in either forward or reverse primer. several strains have 3 mismatches at 3' end of forward primer in addition to internal mismatches reverse primer only has a blast hit to one strain (angola71) saint louis encephalitis virus 2007 too many mismatches in the reverse primer, with 3 mismatches at 3' end as well as at other locations dengue virus 1 2006 reverse primer does not have any blast hits to target dengue virus 2 2006 forward primer has 3 or 12 mismatches at 3' end for most strains, the probe has blast hits to only 7 of the 57 genomes available, and reverse primer only has a blast hit to 1 genome dengue virus 4 2006 too many mismatches in forward primer and in some cases the probe too many mismatches in forward primer. however, they are at the 5' end, so assay could still work for some strains pseudomonas aeruginosa 2004 no blast hits of probe to pseudomonas aeruginosa probe is not between or even in close proximity to the forward and reverse primers stenotrophomonas maltophilia 2004 no blast hits of probe to stenotrophomonas maltophilia probe only matches in 17 of 22 bases, which is unlikely to give a strong signal, since probe is unlikely to bind prior to the primers as desired for real time taqman chemistry. references 1 basic local alignment search tool comprehensive dna signature discovery and validation the perl directory frequent detection of human coronaviruses in clinical specimens from patients with respiratory tract infection by use of a novel real-time reverse-transcriptase polymerase chain reaction the interferon inducer ampligen markedly protects mice against coxsackie b3 virus-induced myocarditis coxsackievirus b4 infection of human fetal thymus cells rapid detection protocol for filoviruses rapid detection and quantification of rna of ebola and marburg viruses, lassa virus, crimean-congo hemorrhagic fever virus, rift valley fever virus, dengue virus, and yellow fever virus by real-time reverse transcription-pcr a 5' nuclease pcr (taq-man) high-throughput assay for detection of the meca gene in staphylococci algorithm for the identification of bacterial pathogens in positive blood cultures by real-time lightcycler polymerase chain reaction (pcr) with sequence-specific probes. diagnostic microbiology and infectious disease development of quantitative gene-specific real-time rt-pcr assays for the detection of measles virus in clinical specimens limitations of taqman pcr for detecting divergent viral pathogens illustrated by hepatitis a, b, c, and e viruses and human immunodeficiency virus development of mulitplex real-time reverse transcriptase pcr assays for detecting eight medically important flaviviruses in mosquitoes development of real-time reverse transcriptase pcr assays to detect and serotype dengue viruses comparative genomics tools applied to bioterrorism defense rapid development of nucleic acid diagnostics sequencing needs for viral diagnostics design and validation of an h5 taqman real-time one-step reverse transcription-pcr and confirmatory assays for diagnosis and verification of influenza a virus h5 infections in humans lion t: real-time quantitative pcr assays for detection and monitoring of pathogenic human viruses in immunosuppressed pediatric patients rapid reverse transcription-pcr detection of hepatitis c virus rna in serum by using the taqman fluorogenic detection system rapid detection of west nile virus from human clinical specimens, field-collected mosquitoes, and avian samples by a taqman reverse transcriptase-pcr assay development of a quantitative real-time detection assay for hepatitis b virus dna and comparison with two commercial assays sensitive and accurate quantitation of hepatitis b virus dna using a kinetic fluorescence detection system (tagman pcr) comparison of two quantitative cmv pcr tests, cobas amplicor cmv monitor and taqman assay, and pp65-antigenemia assay in the determination of viral loads from peripheral blood of organ transplant patients differentiation of herpes simplex virus types 1 and 2 in clinical samples by a real-time taqman pcr assay development of a flurogenic polymerase chain reaction assay (taqman) for the detection and quantitation of varicella zoster virus rapid and sensitive detection of mumps virus rna directly from clinical samples by real-time pcr development of a real-time reverse-transcription pcr for detection of newcastle disease virus rna in clinical samples transfer and evaluation of an automated, low-cost real-time reverse transcription-pcr test for diagnosis and monitoring of human immunodeficiency virus type 1 infection in a west african resource-limited setting rapid detection of enterovirus rna in cerebrospinal fluid specimens with a novel single-tube real-time reverse transcription-pcr assay use of applied biosystems 7900ht sequence detection system and taqman assay for detection of quinolone-resistant neisseria gonorrhoeae comparison of a new quantitative ompa-based real-time pcr taqman assay for detection of chlamydia pneumoniae dna in respiratory specimens with four conventional pcr assays a lightcycler taqman assay for detection of borrelia burgdorferi sensu lato in clinical samples detection of medically important ehrlichia by quantitative multicolor taqman real-time polymerase chain reaction of the dsb gene we thank beth vitalis, jason smith, and tom slezak for helpful discussion and for encouraging this work, and kari allmon for entering the references. we gratefully acknowledge financial support from the intelligence technology innovation center. lawrence livermore national laboratory is operated by lawrence livermore national security, llc, for the u.s. department of energy, national nuclear security administration under contract de-ac52-07na27344. the authors declare that they have no competing interests. gl found real time pcr signatures in the literature, wrote perl scripts, and performed the analysis of published signatures. sg conceived of the research, designed new signatures, and provided guidance throughout the study. publish with bio med central and every scientist can read your work free of charge http://www.ann-clinmicrob.com/content/7/1/18 key: cord-330312-1pjolkql authors: liu, y.-t. title: infectious disease genomics date: 2017-01-20 journal: genetics and evolution of infectious diseases doi: 10.1016/b978-0-12-799942-5.00010-x sha: doc_id: 330312 cord_uid: 1pjolkql the history and development of infectious disease genomics have been closely associated with the human genome project (hgp) during the past 20 years. it has been emphasized since the beginning of the hgp that such effort must not be restricted to the human genome and should include other organisms including mouse, bacteria, yeast, fruit fly, and worm for comparative sequence analyses. a brief history is reviewed in this chapter. as of 2016, more than 7000 completed genome sequencing projects have been reported. one of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. a number of examples are discussed in this chapter. the history and development of infectious disease genomics are closely associated with the human genome project (hgp). 1 a series of important discussions about the hgp were made from 1985 to 1986, 1,2 which led to the appointment of a special national research council (nrc) committee by the national academy of sciences to address the needs and concerns, such as its impact, leadership, and funding sources. the committee recommended that the united states begin the hgp in 1988. 3 they emphasized the need for technological improvements in the efficiency of gene mapping, sequencing, and data analysis capabilities. in order to understand potential functions of human genes through comparative sequence analyses, they also advised that the hgp must not be restricted to the human genome and should include model organisms including mouse, bacteria, yeast, fruit fly, and worm. in the meantime, the office of technology assessment (ota) of the us congress also issued a similar report to support the hgp. 4 in 1990, the department of energy (doe) and the national institutes of health (nih) jointly presented an initial 5-year plan for the hgp. 5 in october 1993, the sanger center/institute (hinxton, uk) was officially open to join the hgp. the cost of dna sequencing was about $2 to $5 per base in 1990 and the initial aim was to reduce the costs to less than $0.50 per base before large-scale sequencing. 5 the sequencing cost gradually declined during the subsequent years. in 2004, the national human genome research institute (nhgri) challenged scientists to achieve a $100,000 human genome (3 gb/haploid genome) by 2009 and a $1000 genome by 2014 to meet the need of genomic medicine. in early 2014, illumina announced that the company would begin producing a new system to deliver full coverage human genomes for less than $1,000. 6 the first complete genome to be sequenced was the phix174 bacteriophage (5.4 kb) by sanger's group in 1977. 7 the complete genome sequence of sv40 polyomavirus (5.2 kb) was published in 1978. 8, 9 the human epsteinebarr virus (170 kb) genome was determined in 1984. 10 the first completed free-living organism genome was haemophilus influenza (1.8 mb), sequenced through a whole-genome shotgun approach in 1995. 11 the second sequenced bacterial genome, mycoplasma genitalium (600 kb), was completed in less than 1 month in the same year using the same approach. 12 the doe was the first to start a microbial genome program (mgp) as a companion to its hgp in 1994. 13 the initial focus was on nonpathogenic microbes. along with the development of the hgp, there was exponential growth of the number of completely sequenced free-living organism genomes. the fungal genome initiative (fgi) 14 was established in 2000 to accelerate the slow pace of fungal genome sequencing since the report of the genome of saccharomyces cerevisiae in 1996. 15 one of the major interests was to sequence organisms that are important in humanhealth and commercial activities. with the explosion in the number of sequenced genomes, thanks to the development of next generationesequencing methods, many genome-based studies have become popular. compared to 6 years ago when only 1100 completed genome projects were documented, the gold (genomes online database) contains information for 67,879 genome-sequencing projects, of which 7210 were completed, as of august 2015. 16, 17 the genomes of human malaria parasite plasmodium falciparum and its major mosquito vector anopheles gambiae were published in 2002. 18, 19 historically, the effort to sequence the malaria genome began in 1996 by taking advantage of a clone derived from laboratory-adapted strain. 20 notably, many parasites have complex life cycles that involve both vertebrate and invertebrate hosts and are difficult to maintain in the laboratory. few other important human pathogenic parasites, such as trypanosomes, 21, 22 leishmania, 23 and schistosomes, 24, 25 have been either completely or partially sequenced. 26, 27 in the meantime, the genome sequence of aedes aegypti, the primary vector for yellow fever and dengue fever, was published in 2007. 28 the genome size (1376 mb) of this mosquito vector is about 5 times larger than the previously sequenced genome of the malaria vector a. gambiae. about 50% of the genome consists of transposable elements. in 2010, the genome sequence of the body louse (pediculus humanus humanus), an obligatory parasite of humans and the main vector of epidemic typhus (rickettsia prowazekii), relapsing fever (borrelia recurrentis), and trench fever (bartonella quintana), was reported. 29 its 108 mb genome is the smallest among the known insect genomes. subsequently, more vector genomes have been published. 30e32 genome-sequencing projects for other important human disease vectors are in progress. 33, 34 these include culex pipiens (mosquito vector of west nile virus), and ixodes scapularis (tick vector of lyme disease, babesia and anaplasma). the challenge to sequence the genome of an insect vector is much greater than a microbe. for example, the genome of ticks was estimated to be between 1 and 7 gb and may have a significant proportion of repetitive dna sequences, which may be a problem for genome assembly. 35 furthermore, the evolutionary distances among insect species may also affect homologybased gene predictions. it is as important to understand the sequence diversity within a species as to perform a de novo sequencing of a reference genome from the perspective of human health. this is true for both hosts and pathogens. 36, 37 the goal of the 1000 genomes project is to find most genetic variants that have frequencies of at least 1% in the human populations studied. 38 one of the similar efforts for human pathogens is the nih influenza genome sequencing project. when this project began in november 2004, only seven human influenza h3n2 isolates had been completely sequenced and deposited in the genbank database. 39, 40 as of may 2010, more than 5000 human and avian isolates had been completely sequenced, including the 1918 "spanish" influenza virus. 41 databases for human immunodeficiency virus (hiv) and hepatitis c virus have also been established. while most human studies of microbes have focused on the disease-causing organisms, interest in resident microorganisms has also been growing. in fact, it has been estimated that the human body is colonized by at least 10 times more prokaryotic and eukaryotic microorganisms than the number of human cells. 42 it was suggested to have "the 2nd human genome project" to sequence the human microbiome. 43 highly variable intestinal microbial flora among normal individuals has been well documented. 44e46 therefore, the human microbiome project (hmp) was initiated by the nih in late 2007. 47 the analysis and data of 242 healthy adults at 15 (for males) or 18 (for females) body sites over 22 months were published in 2012. 48 the completed or ongoing genome projects (table 10 .1) provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. specific examples are provided to illustrate how the information provided by various genome projects may help achieve the goal of promoting human health. meningococcal isolates produce one of 13 antigenically distinct capsular polysaccharides, but only five (a, b, c, w135, and y) are commonly associated with disease. 49 the polysaccharide capsule is important for meningococci to escape from complement-mediated killing. while conventional vaccines consisting of the conjugation of capsular polysaccharides to carrier proteins for meningococcus serogroups a, c, y, and w-135 have been clinically successful, the same approach failed to produce clinically useful vaccine for serogroup b (menb). the capsule polysaccharide (a2e8-n-acetylneuraminic acid) of menb is identical to human polysialic acid, therefore is poorly immunogenic. 50 alternatively, vaccines consisting of outer-membrane vesicles (omvs) have been successfully developed to control menb outbreaks in areas where epidemics are dominated by one particular strain. 51e54 the most significant limitation of this type of vaccine is that the immune response is strain specific, mostly directed against the porin protein, pora, which varies substantially in both expression level and sequence across strains. 55, 56 with the completion of the genome sequence of a virulent menb strain, a "reverse vaccinology" approach was applied for the development of a universal menb vaccine by novartis. 55, 57, 58 through bioinformatic searching for surface exposed antigens, which may be the most suitable vaccine candidates due to their potential to be readily recognized by the immune system, 570 open reading frames (orfs) were selected from a total of 2158 orfs of the mc58 genome. eventually, five antigens were chosen as the vaccine components based on a series of criteria including the ability of candidates to be expressed in escherichia coli as recombinant proteins (350 candidates), the confirmation of surface exposure by immunological analyses, the ability of induced protective antibodies in experimental animals (28 candidates), and the conservation of antigens within a panel of diverse meningococcal strains, primarily the diseaseassociated menb strains. 55, 58, 59 the vaccine formulation consists of an fhbp-gna2091 fusion protein, a gna2132-gna1030 fusion protein, nada, and omvs from the new zealand menzb vaccine strain, which contains the immunogenic pora. initial phase ii clinical results in adults and infants showed that this vaccine could induce a protective immune response against three diverse menb strains in 89e96% of subjects following three vaccinations and 93e100% after four vaccinations. 59 this vaccine (bexsero) has been approved in the usa and in more than 30 other countries. 60 natural products, especially microbial secondary metabolites, are important source of bioactive compounds. actinomycetes have been a main source of natural-product discovery in bacteria. consequently, the high rediscovery rate of known compounds and scaffolds were inevitable with activity-based screening. genome mining of gene clusters that produce secondary metabolites have been a new approach to overcome this problem. for example, an antibiotic, clostrubin, was discovered through searching novel compounds from clostridium beijerinckii due to the presence of several cryptic gene clusters for secondary metabolite biosynthesis. 61 genome mining starts with a genome-wide search for highly conserved members of the required biosynthesis gene cluster. computational programs that support the prediction of operons help to assign boundaries of newly identified biosynthesis gene clusters. a large-scale, high-throughput genome mining for the genetic potential for producing phosphonic acids by screening more than 10,000 actinomycetes has been achieved in 2015. 62 it was believed that phosphonates would have greater potential to become pharmaceuticals, with a past commercialization rate of 15% (3/20), such as fosfomycin, compared to the 0.1% average for natural products as a whole. 63, 64 in addition, bioinformatical discovery of phosphonate biosynthetic loci has been well established, as all but two previously characterized phosphonate biosynthetic pathways start with phosphoenolpyruvate (pep) mutase that is encoded by pepm. among 10,000 actinomycetes, only 278 strains were confirmed to have pepm by polymerase chain reaction (pcr) screening and genome sequencing. a diverse collection of phosphonate biosynthetic gene clusters were identified within these strains. remarkably, 55 out of the 64 distinct clusters would direct the synthesis of unknown compounds. characterization of strains within five of these groups resulted in discovery of argolaphos, and other interesting compounds, including valinophos, and phosphonocystoximate. argolaphos showed broad-spectrum antibacterial activity against salmonella typhimurium, e. coli, and staphylococcus aureus. targeting an essential pathway is a necessary but not sufficient requirement for an effective antimicrobial agent. 65 identification of essential genes in a completely sequenced genome has been actively pursued with various approaches. 66, 67 the indispensable fatty acid synthase (fas) pathway in bacteria has been regarded as a promising target for the development of antimicrobial agents. 68 the subcellular organization of the fatty acid biosynthesis components is different between mammals (type i fas) and bacteria (dissociated type ii fas), which raises the likelihood of host specificity of the targeting drugs. comparison of the available genome sequences of various species of prokaryotes reveals highly conserved fas ii systems suggesting that the antimicrobial agent can be broad spectrum. 69 in addition, through computational analyses, new members of the fas ii system have been discovered in different bacterial species. 70, 71 one of the protein components in this system, fabi, is the target of an antituberculosis drug isonizid and a general antibacterial and antifungal agent, triclosan. 72e74 through a systematic screening of 250,000 natural product extracts, a merck team identified a potent and broad-spectrum antibiotic, platensimycin, which is derived from streptomyces platensis and a selective fabf/b inhibitor in fas ii system. 75 treatment with platensimycin eradicated s. aureus infection in mice. platensimycin did not have cross-resistance to other antibiotic-resistant strains in vitro, including methicillin-resistant s. aureus, vancomycin-intermediate s. aureus, and vancomycinresistant enterococci. no toxicity was observed using a cultured human cell line and the activity of platensimycin was not affected by the presence of human serum in this study. however, the fas ii system appears to be dispensable for another gram-positive bacterium, streptococcus agalactiae, when exogenous fatty acids are available, such as in human serum. 65, 76 the susceptibility to inhibitors targeting the fas ii system indicates heterogeneity in fatty acid synthesis or in acquiring exogenous fatty acids among gram-positive pathogens. 76 comparative genomic approaches may be useful to identify and develop a strategy to target the salvage pathway for s. agalactiae. alternatively, similar approaches as described earlier for menb vaccine may also be applied for s. agalactiae (group b streptococcus). 77 emergence of drug-resistant malaria to chloroquine in 1950s and sulfadoxinee pyrimethamine in 1960s occurred from western cambodia to the greater mekong subregion (gmsr, including cambodia, lao, myanmar, thailand, and vietnam) and to africa. the finding of artemisinin-resistant malaria in cambodia and gmsr raised a concern regarding the global spread of these parasites. while a number of studies, including population genetics and laboratory-based investigations were conducted, no reliable molecular marker was identified until the major breakthrough reported in early 2014. 78 clinical artemisinin resistance has been defined as a reduction of parasite-clearance rate, which is expressed as an increase of parasite-clearance half-life, or a persistence of microscopically detectable parasites 3 days after artemisinin-based combination therapy (act). although artemisinin was thought to have broad-stage specificity against malaria throughout the life cycle, it was showed that artemisinin-resistant parasites only had decrease of artemisinin susceptibility at ring stages, which was demonstrated by the ring-stage survival assay (rsa 0e3 h ). 79 an in vitro laboratory-based approach was conducted at a time when populationbased genome-wide association studies (gwas) did not clearly identify the genes responsible for artemisinin resistance. 78 for 5 years, an artemisinin-resistant f32-art5 parasite line was selected by culturing an artemisinin-sensitive f32-tanzania clone under a dose-escalating, 125-cycle regimen of artemisinin. eight mutations in seven genes were eventually selected from the result based on whole-genome sequence analysis f32-art5 and f32-tem (its sibling clone cultured without artemisinin) at 460ã� and 500ã� average nucleotide coverage, respectively. to examine whether these in vitro selected mutations were associated with artemisinin resistance in cambodia, sequence polymorphism in all seven genes were analyzed from 49 culture-adapted clinical isolates related to their rsa 0e3 h . only polymorphisms of a gene, k13-propeller, showed a significant association with rsa 0e3 h survival rates. in total, four mutant alleles, each harboring a single nonsynonymous snp (y493h, r539t, i543t, and c580y) within a kelch repeat of the c-terminal k13-propeller domain were identified. to confirm that k13propeller polymorphism is a molecular marker of clinical artemisinin resistance, parasite-clearance half-lives in patients were correlated with their k13 alleles. of the 150 patients, 72 carried parasites with a wild-type allele and the others carried parasites with only one of the three single nonsynonymous snps in the k13propeller: c580y (n â¼ 51), r539t (n â¼ 6), and y493h (n â¼ 21). the parasiteclearance half-life in patients with wild-type parasites is significantly shorter (median 3.30 h) than those with these three mutant alleles (median 6.28e7.19 h). subsequently, clinical studies have validated the association between k13 propeller mutations and artemisinin resistance. 80e82 early mathematical model for malaria control suggested that the most vulnerable element in the malaria cycle was survivorship of adult female mosquitos. 83, 84 therefore, insect control is an important part of reducing transmission. the use of ddt as an indoor residual spray in the global malaria eradication program from 1957 to 1969 has reduced the population at risk of malaria to about 50% by 1975 compared with 77% in 1900. 83, 85 engineering genetically modified mosquitoes refractory to malaria infection appeared to be an alternative approach, 86 given the environmental impact of ddt and the emergence of insecticide-resistant insects. the vector biology network (vbn) was formed in 1989 and had proposed a 20-year plan with the who in 2001 to achieve three major goals: (1) to develop basic tools for the stable transformation of anopheline mosquitoes by the year 2000, (2) to engineer a mosquito incapable of carrying the malaria parasite by 2005, and (3) to run controlled experiments to test how to drive the engineered genotype into wild mosquito populations by 2010. 87e89 while some proof-of-concept experiments have been achieved for the first two aims in 2002 when the a. gambiae genome was completely sequenced, 90,91 the progress has been relatively slow. 92 genomic loci of the a. gambiae responsible for p. falciparum resistance have been identified through surveying a mosquito population in a west african malaria transmission zone. 93 a candidate gene, anopheles plasmodium-responsive leucine-rich repeat 1 (apl1) was discovered. subsequently, other resistant genes have also been identified. 94, 95 studying the genetic basis of resistance to malaria parasites and immunity of the mosquito vector will be important to control malaria transmission. 96 perhaps the most immediate impact of a completely sequenced pathogen genome is for infectious disease diagnosis. the information may be of great importance to the public health when a newly emerged or reemerged pathogen is discovered. a few examples will be described. a novel swine-origin influenza a virus (s-oiv) emerged in the spring of 2009 in mexico and subsequently was discovered in specimens from two unrelated children in the san diego area in mid-april 2009. 97, 98 those samples were positive for influenza a but negative for both human h1 and h3 subtypes. the complete genome sequence and a real-time pcrebased diagnostic assay were released to the public in late april. the outbreak evolved rapidly and who declared the highest phase 6 worldwide pandemic alert on june 11, 2009. s-oiv has three genome segments (ha, np, and ns) from the classic north american swine (h1n1) lineage, two segments (pb2 and pa) from the north american avian lineage, one segment (pb1) from the seasonal h3n2, and most notably, two segments (na and m) from the eurasian swine (h1n1) lineage. 98 with the available influenza genome database, diagnostic assays to distinguish previous seasonal h1n1, h3n2, and s-oiv can be easily accomplished. 99 a comprehensive pathogen genome database is not only useful for infectious disease diagnosis but also for novel pathogen discovery. 100 homologous sequences within the same family or among different family members are important for new pathogen identification even with the advent of third generationesequencing technology. 101 de novo pathogen discovery may also be complicated by coexisting microorganisms, such as commensal bacteria in the human body. without prior knowledge of these microorganisms, one may be misled. in 2003, a microarray-based assay, designated virochip, was used to help discover the sars conoronavirus. 102 the virochip contained the most highly conserved 70mer sequences from every fully sequenced reference viral genome in genbank. the computational search for conservation was performed across all known viral families. a microarray hybridized with a reaction derived from a viral isolate cultivated from a sars (severe acute respiratory syndrome) patient revealed that the strongest hybridizing array elements belong to families astroviridae and coronaviridae. alignment of the oligonucleotide probes having the highest signals showed that all four hybridizing oligonucleotides from the astroviridae and one oligonucleotide from avian infectious bronchitis virus, an avian coronavirus, shared a core consensus motif spanning 33 nucleotides. interestingly, it had been known previously through bioinformatics analyses that this sequence is present in the 3 0 utr of all astroviruses, avian infectious bronchitis virus, and an equine rhinovirus. 103 therefore, a new member of the coronavirus was identified through the unique hybridizing pattern and subsequent confirmations. the finding of the seventh human oncogenic virus, merkel cell polyomavirus (mcv) 104 in 2008 is another example of why conserved sequences are important for novel pathogen discovery. mcv is the etiological agent of merkel cell carcinoma (mcc), which is a rare but aggressive skin cancer of neuroendocrine origin. two cdna libraries derived from mcc tumors were subjected to high-throughput sequencing by a next-generation roche/454 sequencer. nearly 400,000 sequence reads were generated. the majority (99.4%) of the sequences derived from human origin were removed from further analyses. only one of the remaining 2395 cdna was homologous to the t antigen of two known polyomaviruses. one additional cdna was subsequently identified to be part of the mcv sequence when the complete viral sequence was known. later analyses showed that 80% (8/10) of the mcc had integrated mcv in the human genome. monoclonal viral integration was revealed by the patterns of southern blot analysis. only 8e16% of control tissues had low copy number of mcv infection. in 2015, an interesting and unexpected discovery of the malignant transformation of hymenolepis nana, a human tape worm, in a human host has been reported by conventional and next generationesequencing approaches. 104a initially, examination of a 41-year-old hiv-infected man revealed extensive lymphadenopathy. h. nana eggs and blastocystis hominis cysts were found in stool. the disease progressed to death despite antiparasitic and antiretroviral treatment. histological examination of biopsied lymph nodes revealed proliferative cells with overt malignant features. they were monomorphic with morphologic features characteristic of stem cells (a high nucleus-to-cytoplasm ratio). however, the small cell size (<10) suggested infection with an unfamiliar, possibly unicellular, eukaryotic organism. infection with a plasmodial slime mold rather than h. nana was considered because of the prominent syncytia formation and the primitive appearance of the atypical cells but lack of architecture identifiable as tapeworm tissue. pcr screening suggested that these cells were h. nana. next generationegenome sequencing and comparative analysis revealed h. nana variants harboring mutations typically found in cancer. as of 2016, next generationesequencing technologies are gradually being applied for diagnosis and monitoring of infectious diseases, including genotypic resistance testing, direct detection of unknown disease-associated pathogens without culture, investigation of microbial population diversity in the host, and strain typing. 105 however, promising, next generationesequencing approaches for clinical diagnosis require further improvements for automation, standardization of technical and bioinformatic procedures, and other practical issues, such as costs and turnaround time. while we can expect that the efforts of a variety of genome projects may improve human health, the socioeconomic issues that are not discussed in this chapter may be substantial. in addition, the tremendous amount of information derived from these projects will also pose a challenge for scientists as well nonscientists to follow and understand. the human genome project: past, present, and future a turning point in cancer research: sequencing the human genome mapping and sequencing the human genome mapping our genesdgenome projects: how big? how fast understanding our genetic inheritance, the u.s. human genome project: the first five years: fiscal years 1991e1995 the $1,000 genome nucleotide sequence of bacteriophage phi x174 dna the genome of simian virus 40 complete nucleotide sequence of sv40 dna dna sequence and expression of the b95-8 epstein-barr virus genome whole-genome random sequencing and assembly of haemophilus influenzae rd hisotry of microbial genomics microbial genome program mutation of the pik3ca gene in ovarian and breast cancer life with 6000 genes the genomes on line database (gold) in 2009: status of genomic and metagenomic projects and their associated metadata the genomes online database (gold) v.5: a metadata management system based on a four level (meta)genome project classification genome sequence of the human malaria parasite plasmodium falciparum the genome sequence of the malaria mosquito anopheles gambiae funding for malaria genome sequencing the genome sequence of trypanosoma cruzi, etiologic agent of chagas disease the genome of the african trypanosome trypanosoma brucei the genome of the kinetoplastid parasite, leishmania major the genome of the blood fluke schistosoma mansoni the schistosoma japonicum genome reveals features of host-parasite interplay helminth genomics: the implications for human health eupathdb: a portal to eukaryotic pathogen databases genome sequence of aedes aegypti, a major arbovirus vector genome sequences of the human body louse and its primary endosymbiont provide insights into the permanent parasitic lifestyle highly evolvable malaria vectors: the genomes of 16 anopheles mosquitoes genome of rhodnius prolixus, an insect vector of chagas disease, reveals unique adaptations to hematophagy and parasite infection genome sequence of the tsetse fly (glossina morsitans): vector of african trypanosomiasis vectorbase: a data resource for invertebrate vector genomics genomic resources for invertebrate vectors of human pathogens, and the role of vectorbase tick genomics: the ixodes genome project and beyond the genome gets personalealmost human genetics of infectious diseases: between proof of principle and paradigm a plan to capture human diversity in 1000 genomes race against time large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution characterization of the 1918 influenza virus polymerase genes microbial ecology of the gastrointestinal tract the meaning and impact of the human genome sequence for microbiology bacterial community variation in human body habitats across space and time diversity of the human intestinal microbial flora a core gut microbiome in obese and lean twins microbiology learning about who we are human microbiome project c. structure, function and diversity of the healthy human microbiome mechanisms of avoidance of host immunity by neisseria meningitidis and its effect on vaccine development an igg monoclonal antibody to group b meningococci cross-reacts with developmentally regulated polysialic acid units of glycoproteins in neural and extraneural tissues effect of outer membrane vesicle vaccine against group b meningococcal disease in norway vaccine against group b neisseria meningitidis: protection trial and mass vaccination results in cuba phase ii meningococcal b vesicle vaccine trial in new zealand infants efficacy, safety, and immunogenicity of a meningococcal group b (15:p1.3) outer membrane protein vaccine in iquique, chile. chilean national committee for meningococcal disease identification of vaccine candidates against serogroup b meningococcus by whole-genome sequencing effect of sequence variation in meningococcal pora outer membrane protein on the effectiveness of a hexavalent pora outer membrane vesicle vaccine complete genome sequence of neisseria meningitidis serogroup b strain mc58 a universal vaccine for serogroup b meningococcus vaccinology in the genome era lessons from reverse vaccinology for viral vaccine design discovery of clostrubin, an exceptional polyphenolic polyketide antibiotic from a strictly anaerobic bacterium discovery of phosphonic acid natural products by mining the genomes of 10,000 actinomycetes thoughts and facts about antibiotics: where we are now and where we are heading biosynthesis of phosphonic and phosphinic acid natural products type ii fatty acid synthesis is not a suitable antibiotic target for gram-positive pathogens identification of critical staphylococcal genes using conditional phenotypes generated by antisense rna global transposon mutagenesis and a minimal mycoplasma genome antibacterial targets in fatty acid biosynthesis the application of computational methods to explore the diversity and structure of bacterial fatty acid synthase a triclosan-resistant bacterial enzyme a new mechanism for anaerobic unsaturated fatty acid formation in streptococcus pneumoniae molecular basis of triclosan activity inhibiting bacterial fatty acid synthesis mycobacterium tuberculosis platensimycin is a selective fabf inhibitor with potent antibiotic properties essentiality of fasii pathway for staphylococcus aureus identification of a universal group b streptococcus vaccine by multiple genome screen a molecular marker of artemisinin-resistant plasmodium falciparum malaria reduced artemisinin susceptibility of plasmodium falciparum ring stages in western cambodia spread of artemisinin resistance in plasmodium falciparum malaria genetic architecture of artemisinin-resistant plasmodium falciparum spread of artemisinin-resistant plasmodium falciparum in myanmar: a cross-sectional survey of the k13 molecular marker malaria management: past, present, and future the epidemiology and control of malaria the global distribution and population at risk of malaria: past, present, and future possible use of translocations to fix desirable genes in insect pest populations from tucson to genomics and transgenics: the vector biology network and the emergence of modern vector biology the mosquito genomeea breakthrough for public health malaria control with genetically manipulated insect vectors stable germline transformation of the malaria mosquito anopheles stephensi transgenic anopheline mosquitoes impaired in transmission of a malaria parasite malaria control with transgenic mosquitoes natural malaria infection in anopheles gambiae is regulated by a single genomic control region leucine-rich repeat protein complex activates mosquito complement in defense against plasmodium parasites dissecting the genetic basis of resistance to malaria parasites in anopheles gambiae mosquito defenses against plasmodium parasites swine influenza a (h1n1) infection in two childrenesouthern california, marcheapril emergence of a novel swine-origin influenza a (h1n1) virus in humans detection in 2009 of the swine origin influenza a (h1n1) virus by a subtyping microarray a technological update of molecular diagnostics for infectious diseases third-generation sequencing fireworks at marco island viral discovery and sequence recovery using dna microarrays a common rna motif in the 3' end of the genomes of astroviruses, avian infectious bronchitis virus and an equine rhinovirus clonal integration of a polyomavirus in human merkel cell carcinoma malignant transformation of hymenolepis nana in a human host next-generation sequencing for infectious disease diagnosis and management: a report of the association for molecular pathology database resources of the national center for biotechnology information ensembl genomes: extending ensembl across the taxonomic space the comprehensive microbial resource the microbial rosetta stone database: a compilation of global and emerging infectious microorganisms and bioterrorist threat agents genomic metadata for infectious agents, a geospatial surveillance pathogen database a catalog of reference genomes from the human microbiome the influenza virus resource at the national center for biotechnology information key: cord-338207-60vrlrim authors: lefkowitz, e.j.; odom, m.r.; upton, c. title: virus databases date: 2008-07-30 journal: encyclopedia of virology doi: 10.1016/b978-012374410-4.00719-6 sha: doc_id: 338207 cord_uid: 60vrlrim as tools and technologies for the analysis of biological organisms (including viruses) have improved, the amount of raw data generated by these technologies has increased exponentially. today's challenge, therefore, is to provide computational systems that support data storage, retrieval, display, and analysis in a manner that allows the average researcher to mine this information for knowledge pertinent to his or her work. every article in this encyclopedia contains knowledge that has been derived in part from the analysis of such large data sets, which in turn are directly dependent on the databases that are used to organize this information. fortunately, continual improvements in data-intensive biological technologies have been matched by the development of computational technologies, including those related to databases. this work forms the basis of many of the technologies that encompass the field of bioinformatics. this article provides an overview of database structure and how that structure supports the storage of biological information. the different types of data associated with the analysis of viruses are discussed, followed by a review of some of the various online databases that store general biological, as well as virus-specific, information. in 1955, niu and frankel-conrat published the c-terminal amino acid sequence of tobacco mosaic virus capsid protein. the complete 158-amino-acid sequence of this protein was published in 1960. the first completely sequenced viral genome published was that of bacteriophage ms2 in 1976 (genbank accession number v00642). sanger used dna from bacteriophage phix174 ( j02482) in developing the dideoxy sequencing method, while the first animal viral genome, sv40 ( j02400), was sequenced using the maxam and gilbert method and published in 1978. viruses therefore played a pivotal role in the development of modern-day sequencing methods, and viral sequence information (both protein and nucleotide) formed a substantial subset of the earliest available biological databases. in 1965, margaret o. dayhoff published the first publicly available database of biological sequence information. this atlas of protein sequence and structure was available only in printed form and contained the sequences of approximately 50 proteins. establishment of a database of nucleic acid sequences began in 1979 through the efforts of walter goad at the us department of energy's los alamos national laboratory (lanl) and separately at the european molecular biology laboratories (embl) in the early 1980s. in 1982, the lanl database received funding from the national institutes of health (nih) and was christened genbank. in december of 1981, the los alamos sequence library contained 263 sequences of which 50 were from eukaryotic viruses and 12 were from bacteriophages. by its tenth release in 1983, genbank contained 1865 sequences (1 827 214 nucleotides) of which 449 (457 721 nucleotides) were viral. in august of 2006, genbank (release 154) contained approximately 59 000 000 records, including 367 000 viral sequences. the number of available sequences has increased exponentially as sequencing technology has improved. in addition, other high-throughput technologies have been developed in recent years, such as those for gene expression and proteomic studies. all of these technologies generate enormous new data sets at ever-increasing rates. the challenge, therefore, has been to provide computational systems that support the storage, retrieval, analysis, and display of this information so that the research scientist can take advantage of this wealth of resources to ask and answer questions relevant to his or her work. every article in this encyclopedia contains knowledge that has been derived in part from the analysis of large data sets. the ability to effectively and efficiently utilize these data sets is directly dependent on the databases that have been developed to support storage of this information. fortunately, the continual development and improvement of data-intensive biological technologies has been matched by the development and improvement of computational technologies. this work, which includes both the development and utilization of databases as well as tools for storage and analysis of biological information, forms a very important part of the bioinformatics field. this article provides an overview of database structure and how that structure supports the storage of biological information. the different types of data associated with the analysis of viruses are discussed, followed by a review of some of the various online databases that store general biological information as well as virusspecific information. definition a database is simply a collection of information, including the means to store, manipulate, retrieve, and share that information. for many of us, lab notebook fulfilled our initial need for a 'database'. however, this information storage vehicle did not prove to be an ideal place to archive our data. backups were difficult, and retrieval more so. the advent of computers -especially the desktop computer -provided a new solution to the problem of data storage. though initially this innovation took the form of spreadsheets and electronic notebooks, the subsequent development of both personal and large-scale database systems provided a much more robust solution to the problems of data storage, retrieval, and manipulation. the computer program supplying this functionality is called a 'database management system' (dbms). such systems provide at least four things: (1) the necessary computer code to guide a user through the process of database design; (2) a computer language that can be used to insert, manipulate, and query the data; (3) tools that allow the data to be exported in a variety of formats for sharing and distribution; and (4) the administrative functions necessary to ensure data integrity, security, and backup. however, regardless of the sophistication and diverse functions available in a typical modern dbms, it is still up to the user to provide the proper context for data storage. the database must be properly designed to ensure that it supports the structure of the data being stored and also supports the types of queries and manipulations necessary to fully understand and efficiently analyze the properties of the data. the development of a database begins with a description of the data to be stored, all of the parameters associated with the data, and frequently a diagram of the format that will be used. the format used to store the data is called the database schema. the schema provides a detailed picture of the internal format of the database that includes specific containers to store each individual piece of data. while databases can store data in any number of different formats, the design of the particular schema used for a project is dependent on the data and the needs and expertise of the individuals creating, maintaining, and using the database. as an example, we will explore some of the possible formats for storing viral sequence data and provide examples of the database schema that could be used for such a project. figure 1 (a) provides an example of a genbank sequence record that is familiar to most biologists. these records are provided in a 'flat file' format in which all of the information associated with this particular sequence is provided in a human-readable form and in which all of the information is connected in some manner to the original sequence. in this format, the relationships between each piece of information and every other piece of information are only implicitly defined, that is, each line starts with a label that describes the information in the rest of the line, but it is up to the investigator reading the record to make all of the proper connections between each of the data fields (lines). the proper connections are not explicitly defined in this record. as trained scientists, we are able to read the record in figure 1 (a) and discern that this particular amino acid sequence is derived from a strain of ebola virus that was studied by a group in germany, and that this sequence codes for a protein that functions as the virus rna polymerase. the format of this record was carefully designed to allow us, or a computer, to pull out each individual type of information. however as trained scientists, we already understand the proper connections between the different information fields in this file. the computer does not. therefore, to analyze the data using a computer, a custom software program must be written to provide access to the data. extensible markup language (xml) is another widely used format for storing database information. figure 1 (b) shows an example of part of the xml record for the ebola virus polymerase protein. in this format, each data field can be many lines long; the start and end of a data record contained within a particular field are indicated by tags made of a label between two brackets (''). unlike the lines in the genbank record in figure 1 (a), a field in an xml record can be placed inside of another, defining a structure and a relationship between them. for example, the tseq_orgname is placed inside of the tseq record to show that this organism name applies only to that sequence record. if the file contained multiple sequences, each tseq field would have its own tseq_orgname subfield, and the relationship between them would be very clear. this self-describing hierarchical structure makes xml very powerful for expressing many types of data that are hard to express in a single table, such as that used in a spreadsheet. however, in order to find any piece of information in the xml file, a user (with an appropriate search program) needs to traverse the whole file in order to pull out the particular items of data that are of interest. therefore, while an xml file may be an excellent format for defining and exchanging data, it is often not the best vehicle for efficiently storing and querying that data. that is still the realm of the relational database. 'relational database management systems' (rdbmss) are designed to do two things extremely well: (1) store and update structured data with high integrity, and (2) provide powerful tools to search, summarize, and analyze the data. the format used for storing the data is to divide it into several tables, each of which is equivalent to a single spreadsheet. the relationships between the data in the tables are then defined, and the rdbms ensures that all data follow the rules laid out by this design. this set of tables and relationships is called the schema. an example diagram of a relational database schema is provided in figure 2 . this viral genome database (vgd) schema is an idealized version of a database used to store viral genome sequences, their associated gene sequences, and associated descriptive and analytical information. each box in figure 2 represents a single object or concept, such as a genome, gene, or virus, about which we want to store data and is contained in a single table in the rdbms. the names listed in the box are the columns of that table, which hold the various types of data about the object. the 'gene' table therefore contains columns holding data such as the name of the gene, its coding strand, and a description of its function. the rdbms is lines and arrows display the relationships between fields as defined by the foreign key (fk) and primary key (pk) that connect two tables. (each arrow points to the table containing the primary key.) tables are color-coded according to the source of the information they contain: yellow, data obtained from the original genbank sequence record and the ictv eighth report; pink, data obtained from automated annotation or manual curation; blue, controlled vocabularies to ensure data consistency; green, administrative data. able to enforce a series of rules for tables that are linked by defining relationships that ensure data integrity and accuracy. these relationships are defined by a foreign key in one table that links to corresponding data in another table defined by a primary key. in this example, the rdms can check that every gene in the 'gene' table refers to an existing genome in the 'genome' table, by ensuring that each of these tables contains a matching 'genome_id'. since any one genome can code for many genes, many genes may contain the same 'genome_id'. this defines what is called a one-to-many relationship between the 'genome' and 'gene' tables. all of these relationships are identified in figure 2 by arrows connecting the tables. because viruses have evolved a variety of alternative coding strategies such as splicing and rna editing; it is necessary to design the database so that these processes can be formally described. the 'gene_segment' table specifies the genomic location of the nucleotides that code for each gene. if a gene is coded in the traditional manner -one orf, one protein -then that gene would have one record in the 'gene_segment' table. however, as described above, if a gene is translated from a spliced transcript, it would be represented in the 'gene_segment' table by two or more records, each of which specifies the location of a single exon. if an rna transcript is edited by stuttering of the polymerase at a particular run of nucleotides, resulting in the addition of one or more nontemplated nucleotides, then that gene will also have at least two records in the 'gene_segment' table. in this case, the second 'gene_segment' record may overlap the last base of the first record for that gene. in this manner, an extra, nontemplated base becomes part of the final gene transcript. other more complex coding schemes can also be identified using this, or similar, database structures. the tables in figure 2 are grouped according to the type of information they contain. though the database itself does not formally group tables in this manner, database schema diagrams are created to benefit database designers and users by enhancing their ability to understand the structure of the database. these diagrams make it easier to both populate the database with data and query the database for information. the core tables hold basic biological information about each viral strain and its genomic sequence (or sequences if the virus contains segmented genomes) as well as the genes coded for by each genome. the taxonomy tables provide the taxonomic classification of each virus. taxonomic designations are taken directly from the eighth report of the international committee on taxonomy of viruses (ictv). the 'gene properties' tables provide information related to the properties of each gene in the database. gene properties may be generated from computational analyses such as calculations of molecular weight and isoelectric point (pi) that are derived from the amino acid sequence. gene properties may also be derived from a manual curation process in which an investigator might identify, for example, functional attributes of a sequence based on evidence provided from a literature search. assignment of 'gene ontology' terms (see below) is another example of information provided during manual curation. the blast tables store the results of similarity searches of every gene and genome in the vgd searched against a variety of sequence databases using the national center for biotechnology information (ncbi) blast program. examples of search databases might include the complete genbank nonredundant protein database and/or a database comprised of all the protein sequences in the vgd itself. while most of us store our blast search results as files on our desktop computers, it is useful to store this information within the database to provide rapid access to similarity results for comparative purposes; to use these results to assign genes to orthologous families of related sequences; and to use these results in applications that analyze data in the database and, for example, display the results of an analysis between two or more types of viruses showing shared sets of common genes. finally, the 'admin' tables provide information on each new data release, an archive of old data records that have been subsequently updated, and a log detailing updates to the database schema itself. it is useful for database designers, managers, and data submitters to understand the types of information that each table contains and the source of that information. therefore, the database schema provided in figure 2 is color-coded according to the type and source of information each table provides. yellow tables contain basic biological data obtained either directly from the genbank record or from other sources such as the ictv. pink tables contain data obtained as the result of either computational analyses (blast searches, calculations of molecular weight, functional motif similarities, etc.) or from manual curation. blue tables provide a controlled vocabulary that is used to populate fields in other tables. this ensures that a descriptive term used to describe some property of a virus has been approved for use by a human curator, is spelled correctly, and when multiple terms or aliases exist for the same descriptor, the same one is always chosen. while the use of a controlled vocabulary may appear trivial, in fact, misuse of terms, or even misspellings, can result in severe problems in computer-based databases. the computer does not know that the terms 'negative-sense rna virus' and 'negative-strand rna virus' may both be referring to the same type of virus. the provision and use of a controlled vocabulary increases the likelihood that these terms will be used properly, and ensures that the fields containing these terms will be easily comparable. for example, the 'geno-me_molecule' table contains the following permissible values for 'molecule_type': 'ambisense ssrna', 'dsrna', 'negative-sense ssrna', 'positive-sense ssrna', 'ssdna', and 'dsdna'. a particular viral genome must then have one of these values entered into the 'molecule_type' field of the 'genome' table, since this field is a foreign key to the 'molecule_type' primary key of the 'genome_molecule' table. entering 'double-stranded dna' would not be permissible. raw data obtained directly from high-throughput analytical techniques such as automated sequencing, protein interaction, or microarray experiments contain little-to-no information as to the content or meaning. the process of adding value to the raw data to increase the knowledge content is known as annotation and curation. as an example, the results of a microarray experiment may provide an indication that individual genes are up-or downregulated under certain experimental conditions. by annotating the properties of those genes, we are able to see that certain sets of genes showing coordinated regulation are a part of common biological pathways. an important pattern then emerges that was not discernable solely by inspection of the original data. the annotation process consists of a semiautomated analysis of the information content of the data and provides a variety of descriptive features that aid the process of assigning meaning to the data. the investigator is then able to use this analytical information to more closely inspect the data during a manual curation process that might support the reconstruction of gene expression or protein interaction pathways, or allow for the inference of functional attributes of each identified gene. all of this curated information can then be stored back in the database and associated with each particular gene. for each piece of information associated with a gene (or other biological entity) during the process of annotation and curation, it is always important to provide the evidence used to support each assignment. this evidence may be described in a standard operating procedure (sop) document which, much like an experimental protocol, details the annotation process and includes a description of the computer algorithms, programs, and analysis pipelines that were used to compile that information. each piece of information annotated by the use of this pipeline might then be coded, for example, 'iea: inferred from electronic annotation'. for information obtained from the literature during manual curation, the literature reference from which the information was obtained should always be provided along with a code that describes the source of the information. some of the possible evidence codes include 'ida: inferred from direct assay', 'igi: inferred from genetic interaction', 'imp: inferred from mutant phenotype', or 'iss: inferred from sequence or structural similarity'. these evidence codes are taken from a list provided by the gene ontology (go) consortium (see below) and as such represent a controlled vocabulary that any data curator can use and that will be understood by anyone familiar with the go database. this controlled evidence vocabulary is stored in the 'evidence' table, and each record in every one of the gene properties tables is assigned an evidence code noting the source of the annotation/curation data. as indicated above, the use of controlled vocabularies (ontologies) to describe the attributes of biological data is extremely important. it is only through the use of these controlled vocabularies that a consistent, documented approach can be taken during the annotation/curation process. and while there may be instances where creating your own ontology may be necessary, the use of already available, community-developed ontologies ensures that the ontological descriptions assigned to your database will be understood by anyone familiar with the public ontology. use of these public ontologies also ensures that they support comparative analyses with other available databases that also make use of the same ontological descriptions. the go consortium provides one of the most extensive and widely used controlled vocabularies available for biological systems. go describes biological systems in terms of their biological processes, cellular components, and molecular functions. the go effort is community-driven, and any scientist can participate in the development and refinement of the go vocabulary. currently, go contains a number of terms specific to viral processes, but these tend to be oriented toward particular viral families, and may not necessarily be the same terms used by investigators in other areas of virology. therefore it is important that work continues in the virus community to expand the availability and use of go terms relevant to all viruses. go is not intended to cover all things biological. therefore, other ontologies exist and are actively being developed to support the description of many other biological processes and entities. for example, go does not describe disease-related processes or mutants; it does not cover protein structure or protein interactions; and it does not cover evolutionary processes. a complementary effort is under way to better organize existing ontologies, and to provide tools and mechanisms to develop and catalog new ontologies. this work is being undertaken by the national center for biomedical ontologies, located at stanford university, with participants worldwide. the most comprehensive, well-designed database is useless if no method has been provided to access that database, or if access is difficult due to a poorly designed application. therefore, providing a search interface that meets the needs of intended users is critical to fully realizing the potential of any effort at developing a comprehensive database. access can be provided using a number of different methods ranging from direct query of the database using the relatively standardized 'structured query language' (sql), to customized applications designed to provide the ability to ask sophisticated questions regarding the data contained in the database and mine the data for meaningful patterns. web pages may be designed to provide simple-touse forms to access and query data stored in an rdbms. using the vgd schema as a data source, one example of an sql query might be to find the gene_id and name of all the proteins in the database that have a molecular weight between 20 000 and 30 000, and also have at least one transmembrane region. many database providers also provide users with the ability to download copies of the database so that these users may analyze the data using their own set of analytical tools. when a user queries a database using any of the available access methods, the results of that query are generally provided in the form of a table where columns represent fields in the database and the rows represent the data from individual database records. tabular output can be easily imported into spreadsheet applications, sorted, manipulated, and reformatted for use in other applications. but while extremely flexible, tabular output is not always the best format to use to fully understand the underlying data and the biological implications. therefore, many applications that connect to databases provide a variety of visualization tools that display the data graphically, showing patterns in the data that may be difficult to discern using text-based output. an example of one such visual display is provided in figure 3 and shows conservation of synteny between the genes of two different poxvirus species. the information used to generate this figure comes directly from the data provided in the vgd. every gene in the two viruses (in this case crocodilepox virus and molluscum contagiosum virus) has been compared to every other gene using the blast search program. the results of this search are stored in the blast tables of the vgd. in addition, the location of each gene within its respective genomic sequence is stored in the 'gene_segment' table. this information, once extracted from the database server, is initially text but it is then submitted to a program running on the server that reformats the data and creates a graph. in this manner, it is much easier to visualize the series of points formed along a diagonal when there are a series of similar genes with similar genomic locations present in each of the two viruses. these data sets may contain gene synteny patterns that display deletion, insertion, or recombination events during the course of viral evolution. these patterns can be difficult to detect with text-based tables, but are easy to discern using visual displays of the data. information provided to a user as the result of a database query may contain data derived from a combination of sources, and displayed using both visual and textual feedback. figure 4 shows the web-based output of a query designed to display information related to a particular virus gene. the top of this web page displays the location of the gene on the genome visually showing surrounding genes on a partial map of the viral genome. basic gene information such as genome coordinates, gene name, and the nucleotide and amino acid sequence are also provided. this information was originally obtained from the original genbank record and then stored in the vgd database. data added as the result of an automated annotation pipeline are also displayed. this includes calculated values for molecular weight and pi; amino acid composition; functional motifs; blast similarity searches; and predicted protein structural properties such as transmembrane domains, coiled-coil regions, and signal sequences. finally, information obtained from a manual curation of the gene through an extensive literature search is also displayed. curated information includes a mini review of gene function; experimentally determined gene properties such as molecular weight, pi, and protein structure; alternative names and aliases used in the literature; assignment of ontological terms describing gene function; the availability of reagents such as antibodies and clones; and also, as available, information on the functional effects of mutations. all of the information to construct the web page for this gene is directly provided as the result of a single database query. (the tables storing the manually curated gene information are not shown in figure 2 .) obviously, compiling the data and entering it into the database required a substantial amount of effort, both computationally and manually; however, the information is now much more available and useful to the research scientist. no discussion of databases would be complete without considering errors. as in any other scientific endeavor, the data we generate, the knowledge we derive from the data, and the inferences we make as a result of the analysis of the data are all subject to error. these errors can be introduced at many points in the analytical chain. the original data may be faulty: using sequence data as one example, nucleotides in a dna sequence may have been misread or miscalled, or someone may even have mistyped the sequence. the database may have been poorly designed; a field in a table designed to hold sequence information may have been set to hold only 2000 characters, whereas the sequences imported into that field may be longer than 2000 nucleotides. the sequences would have then been automatically truncated to 2000 characters, resulting in the loss of data. the curator may have mistyped an enzyme commission (ec) number for an rna polymerase, or may have incorrectly assigned a genomic sequence to the wrong taxonomic classification. or even more insidious, the curator may have been using annotations provided by other groups that had justified their own annotations on the basis of matches to annotations provided by yet another group. such chains of evidence may extend far back, and the chance of propagating an early error increases with time. such error propagation can be widespread indeed, affecting the work of multiple sequencing centers and database creators and providers. this is especially true given the dependencies of genomic sequence annotations on previously published annotations. the possible sources of errors are numerous, and it is the responsibility of both the database provider and the user to be aware of, and on the lookout for, errors. the database provider can, with careful database and application design, apply error-checking routines to many aspects of the data storage and analysis pipeline. the code can check for truncated sequences, interrupted open reading frames, and nonsense data, as well as data annotations that do not match a provided controlled vocabulary. but the user should always approach any database or the output of any application with a little healthy skepticism. the user is the final arbiter of the accuracy of the information, and it is their responsibility to look out for inconsistent or erroneous results that may indicate either a random or systemic error at some point in the process of data collection and analysis. it is not feasible to provide a comprehensive and current list of all available databases that contain virus-related information or information of use to virus researchers. new databases appear on a regular basis; existing databases either disappear or become stagnant and outdated; or databases may change focus and domains of interest. any resource published in book format attempting to provide an up-to-date list would be out-of-date on the day of publication. even web-based lists of database resources quickly become out-of-date due to the rapidity with which available resources change, and the difficulty and extensive effort required to keep an online list current and inclusive. therefore, our approach in this article is to provide an overview of the types of data that are obtainable from available biological databases, and to list some of the more important database resources that have been available for extended periods of time and, importantly, remain current through a process of continual updating and refinement. we should also emphasize that the use of web-based search tools such as google, various web logs (blogs), and news groups, can provide some of the best means of locating existing and newly available web-based information sources. information contained in databases can be used to address a wide variety of problems. a sampling of the areas of research facilitated by virus databases includes . taxonomy and classification; . host range, distribution, and ecology; . evolutionary biology; . pathogenesis; . host-pathogen interaction; . epidemiology; . disease surveillance; . detection; . prevention; . prophylaxis; . diagnosis; and . treatment. addressing these problems involves mining the data in an appropriate database in order to detect patterns that allow certain associations, generalizations, cause-effect relationships, or structure-function relationships to be discerned. table 1 provides a list of some of the more useful and stable database resources of possible interest to virus researchers. below, we expand on some of this information and provide a brief discussion concerning the sources and intended uses of these data sets. the two major, overarching collections of biological databases are at the ncbi, supported by the national library of medicine at the nih, and the embl, part of the european bioinformatics institute. these large data repositories try to be all-inclusive, acting as the primary source of publicly available molecular biological data for the scientific community. in fact, most journals require that, prior to publication, investigators submit their original sequence data to one of these repositories. in addition to sequence data, ncbi and embl (along with many other data repositories) include a large variety of other data types, such as that obtained from gene-expression experiments and studies investigating biological structures. journals are also extending the requirement for data deposition to some of these other data types. note that while much of the data available from these repositories is raw data obtained directly as the result of experimental investigation in the laboratory, a variety of 'valueadded' secondary databases are also available that take primary data records and manipulate or annotate them in some fashion in order to derive additional useful information. when an investigator is unsure about the existence or source of some biological data, the ncbi and embl websites should serve as the starting point for locating such information. the ncbi entrez search engine provides a powerful interface to access all information contained in the various ncbi databases, including all available sequence records. a search engine such as google might also be used if ncbi and embl fail to locate the desired information. of course pubmed, the repository of literature citations maintained at ncbi, also represents a major reference site for locating biological information. finally, the journal nucleic acids research (nar) publishes an annual 'database' issue and an annual 'web server' issue that are excellent references for finding new biological databases and websites. and while the most recent nar database or web server issue may contain articles on a variety of new and interesting databases and websites, be sure to also look at issues from previous years. older issues contain articles on many existing sites that may not necessarily be represented in the latest there are several websites that serve to provide general virus-specific information and links of use to virus researchers. one of these is the ncbi viral genomes project, which provides an overview of all virus-related ncbi resources including taxonomy, sequence, and reference information. links to other sources of viral data are provided, as well as a number of analytical tools that have been developed to support viral taxonomic classification and sequence clustering. another useful site is the all the virology on the www website. this site provides numerous links to other virus-specific websites, databases, information, news, and analytical resources. it is updated on a regular basis and is therefore as current as any site of this scope can be. one of the strengths of storing information within a database is that information derived from different sources or different data sets can be compared so that important common and distinguishing features can be recognized. such comparative analyses are greatly aided by having a rigorous classification scheme for the information being studied. the international union of microbiological societies has designated the international committee on taxonomy of viruses (ictv) as the official body that determines taxonomic classifications for viruses. through a series of subcommittees and associated study groups, scientists with expertise on each viral species participate in the establishment of new taxonomic groups, assignment of new isolates to existing or newly established taxonomic groups, and reassessment of existing assignments as additional research data become available. the ictv uses more than 2600 individual characteristics for classification, though sequence homology has gained increasing importance over the years as one of the major classifiers of taxonomic position. currently, as described in its eighth report, the ictv recognizes 3 orders, 73 families, 287 genera, and 1950 species of viruses. the ictv officially classifies viral isolates only to the species level. divisions within species, such as clades, subgroups, strains, isolates, types, etc., are left to others. the ictv classifications are available in book form as well as from an online database. this database, the ictvdb, contains the complete taxonomic hierarchy, and assigns each known viral isolate to its appropriate place in that hierarchy. descriptive information on each viral species is also available. the ncbi also provides a web-based taxonomy browser for access to taxonomically specified sets of sequence records. ncbi's viral taxonomy is not completely congruent with that of ictv, but efforts have been under way to ensure congruency with the official ictv classification. the primary repositories of existing sequence information come from the three organizations that comprise the international nucleotide sequence database collaboration. these three sites are genbank (maintained at ncbi), embl, and the dna data bank of japan (ddbj). because all sequence information submitted to any one of these entities is shared with the others, a researcher need query only one of these sites to get the most up-to-date set of available sequences. genbank stores all publicly available nucleotide sequences for all organisms, as well as viruses. this includes whole-genome sequences as well as partial-genome and individual coding sequences. sequences are also available from largescale sequencing projects, such as those from shotgun sequencing of environmental samples (including viruses), and high-throughput low-and high-coverage genomic sequencing projects. ncbi provides separate database divisions for access to these sequence datasets. the sequence provided in each genbank record is the distillation of the raw data generated by (in most cases these days) automated sequencing machines. the trace files and base calls provided by the sequencers are then assembled into a collection of contiguous sequences (contigs) until the final sequence has been assembled. in recognition of the fact that there is useful information contained in these trace files and sequence assemblies (especially if one would like to look for possible sequencing errors or polymorphisms), ncbi now provides separate trace file and assembly archives for genbank sequences when the laboratory responsible for generating the sequence submits these files. currently, the only viruses represented in these archives are influenza a, chlorella, and a few bacteriophages. an important caveat in using data obtained from gen-bank or other sources is that no sequence data can be considered to be 100% accurate. furthermore, the annotation associated with the sequence, as provided in the genbank record, may also contain inaccuracies or be outof-date. genbank records are provided and maintained by the group originally submitting the sequence to genbank. genbank may review these records for obvious errors and formatting mistakes (such as the lack of an open reading frame where one is indicated), but given the large numbers of sequences being submitted, it is impossible to verify all of the information in these records. in addition, the submitter of a sequence essentially 'owns' that sequence record and is thus responsible for all updates and corrections. ncbi generally will not change any of the information in the genbank record unless the sequence submitter provides the changes. in some cases, sequence annotations will be updated and expanded, but many, if not most, records never change following their initial submission. (these facts emphasize the responsibility that submitters of sequence data have to ensure the accuracy of their original submission and to update their sequence data and annotations as necessary.) therefore, the user of the information has the responsibility to ensure, to the extent possible, its accuracy is sufficient to support any conclusions derived from that information. in recognition of these problems, ncbi established the reference sequence (refseq) database project, which attempts to provide reference sequences for genomes, genes, mrnas, proteins, and rna sequences that can be used, in ncbi's words, as ''a stable reference for gene characterization, mutation analysis, expression studies, and polymorphism discovery''. refseq records are manually curated by ncbi staff, and therefore should provide more current (and hopefully more accurate) sequence annotations to support the needs of the research community. for viruses, refseq provides a complete genomic sequence and annotation for one representative isolate of each viral species. ncbi solicits members of the research community to participate as advisors for each viral family represented in refseq, in an effort to ensure the accuracy of the refseq effort. in addition to the nucleotide sequence databases mentioned above, uniprot provides a general, all-inclusive protein sequence database that adds value through annotation and analysis of all the available protein sequences. uniprot represents a collaborative effort of three groups that previously maintained separate protein databases (pir, swissprot, and trembl). these groups, the national biomedical research foundation at georgetown university, the swiss institute of bioinformatics, and the european bioinformatics institute, formed a consortium in 2002 to merge each of their individual databases into one comprehensive database, uniprot. uniprot data can be queried by searching for similarity to a query sequence, or by identifying useful records based on the text annotations. sequences are also grouped into clusters based on sequence similarity. similarity of a query sequence to a particular cluster may be useful in assigning functional characteristics to sequences of unknown function. ncbi also provides a protein sequence database (with corresponding refseq records) consisting of all protein-coding sequences that have been annotated within all genbank nucleotide sequence records. the above-mentioned sequence databases are not limited to viral data, but rather store sequence information for all biological organisms. in many cases, access to nonviral sequences is necessary for comparative purposes, or to study virus-host interactions. but it is frequently easier to use virus-specific databases when they exist, to provide a more focused view of the data that may simplify many of the analyses of interest. table 1 lists many of these virus-specific sites. sites of note include the nih-supported bioinformatics resource centers for biodefense and emerging and reemerging infectious diseases (brcs). the brcs concentrate on providing databases, annotations, and analytical resources on nih priority pathogens, a list that includes many viruses. in addition, the lanl has developed a variety of viral databases and analytical resources including databases focusing on hiv and influenza. for plant virologists, the descriptions of plant viruses (dpv) website contains a comprehensive database of sequence and other information on plant viruses. the three-dimensional structures for quite a few viral proteins and virion particles have been determined. these structures are available in the primary database for experimentally determined structures, the protein data bank (pdb). the pdb currently contains the structures for more than 650 viral proteins and viral protein complexes out of 38 000 total structures. several virus-specific structure databases also exist. these include the viperdb database of icosahedral viral capsid structures, which provides analytical and visualization tools for the study of viral capsid structures; virus world at the institute for molecular virology at the university of wisconsin, which contains a variety of structural images of viruses; and the big picture book of viruses, which provides a catalog of images of viruses, along with descriptive information. ultimately, the biology of viruses is determined by genomic sequence (with a little help from the host and the environment). nucleotide sequences may be structural, functional, regulatory, or protein coding. protein sequences may be structural, functional, and/or regulatory, as well. patterns specified in nucleotide or amino acid sequences can be identified and associated with many of these biological roles. both general and virus-specific databases exist that map these roles to specific sequence motifs. most also provide tools that allow investigators to search their own sequences for the presence of particular patterns or motifs characteristic of function. general databases include the ncbi conserved domain database; the pfam (protein family) database of multiple sequence alignments and hidden markov models; and the prosite database of protein families and domains. each of these databases and associated search algorithms differ in how they detect a particular search motif or define a particular protein family. it can therefore be useful to employ multiple databases and search methods when analyzing a new sequence (though in many cases they will each detect a similar set of putative functional motifs). interpro is a database of protein families, domains, and functional sites that combines many other existing motif databases. interpro provides a search tool, interproscan, which is able to utilize several different search algorithms dependent on the database to be searched. it allows users to choose which of the available databases and search tools to use when analyzing their own sequences of interest. a comprehensive report is provided that not only summarizes the results of the search, but also provides a comprehensive annotation derived from similarities to known functional domains. all of the above databases define functional attributes based on similarities in amino acid sequence. these amino acid similarities can be used to classify proteins into functional families. placing proteins into common functional families is also frequently performed by grouping the proteins into orthologous families based on the overall similarity of their amino acid sequence as determined by pairwise blast comparisons. two virus-specific databases of orthologous gene families are the viral clusters of orthologous groups database (vogs) at ncbi, and the viral orthologous clusters database (vocs) at the viral bioinformatics resource center and viral bioinformatics, canada. many other types of useful information, both general and virus-specific, have been collected into databases that are available to researchers. these include databases of gene-expression experiments (ncbi gene expression omnibus -geo); protein-protein interaction databases, such as the ncbi hiv protein-interaction database; the immune epitope database and analysis resource (iedb) at the la jolla institute for allergy and immunology; and databases and resources for defining and visualizing biological pathways, such as metabolic, regulatory, and signaling pathways. these pathway databases include reactome at the cold spring harbor laboratory, new york; biocyc at sri international, menlo park, california; and the kyoto encyclopedia of genes and genomes (kegg) at kyoto university in japan. as indicated above, the information contained in a database is useless unless there is some way to retrieve that information from the database. in addition, having access to all of the information in every existing database would be meaningless unless tools are available that allow one to process and understand the data contained within those databases. therefore, a discussion of virus databases would not be complete without at least a passing reference to the tools that are available for analysis. to populate a database such as the vgd with sequence and analytical information, and to utilize this information for subsequent analyses, requires a variety of analytical tools including programs for . sequence record reformatting, . database import and export, . sequence similarity comparison, . gene prediction and identification, . detection of functional motifs, . comparative analysis, . multiple sequence alignment, . phylogenetic inference, . structural prediction, and . visualization. sources for some of these tools have already been mentioned, and many other tools are available from the same websites that provide many of the databases listed in table 1 . the goal of all of these sites that make available data and analytical tools is to provide -or enable the discovery of -knowledge, rather than simply providing access to data. only in this manner can the ultimate goal of biological understanding be fully realized. see also: evolution of viruses; phylogeny of viruses; taxonomy, classification and nomenclature of viruses; virus classification by pairwise sequence comparison (pasc). gene ontology: tool for the unification of biology. the gene ontology consortium national center for biotechnology information viral genomes project virus taxonomy: classification and nomenclature of viruses. eighth report of the international committee on taxonomy of viruses the molecular biology database collection: 2006 update reactome: a knowledgebase of biological pathways virus bioinformatics: databases and recent applications immunoinformatics comes of age hiv sequence databases hepatitis c databases, principles and utility to researchers poxvirus bioinformatics resource center: a comprehensive poxviridae informational and analytical resource biological weapons defense: infectious diseases and counterbioterrorism exploring icosahedral virus structures with viper national center for biomedical ontology: advancing biomedicine through structured organization of scientific knowledge los alamos hepatitis c immunology database aidsinfo key: cord-340907-j9i1wlak authors: zarai, yoram; zafrir, zohar; siridechadilok, bunpote; suphatrakul, amporn; roopin, modi; julander, justin; tuller, tamir title: evolutionary selection against short nucleotide sequences in viruses and their related hosts date: 2020-04-27 journal: dna res doi: 10.1093/dnares/dsaa008 sha: doc_id: 340907 cord_uid: j9i1wlak viruses are under constant evolutionary pressure to effectively interact with the host intracellular factors, while evading its immune system. understanding how viruses co-evolve with their hosts is a fundamental topic in molecular evolution and may also aid in developing novel viral based applications such as vaccines, oncologic therapies, and anti-bacterial treatments. here, based on a novel statistical framework and a large-scale genomic analysis of 2,625 viruses from all classes infecting 439 host organisms from all kingdoms of life, we identify short nucleotide sequences that are under-represented in the coding regions of viruses and their hosts. these sequences cannot be explained by the coding regions’ amino acid content, codon, and dinucleotide frequencies. we specifically show that short homooligonucleotide and palindromic sequences tend to be under-represented in many viruses probably due to their effect on gene expression regulation and the interaction with the host immune system. in addition, we show that more sequences tend to be under-represented in dsdna viruses than in other viral groups. finally, we demonstrate, based on in vitro and in vivo experiments, how under-represented sequences can be used to attenuated zika virus strains. viruses, the most abundant type of biological entity, are small infectious agents that can only replicate inside the living cells of other organisms (hosts). 1 the viral genetic material is composed of either rna or dna molecule, single or double stranded. viral genomes typically encode three types of protein: proteins for replicating the genome, proteins for packing the genome, and proteins for modifying the function of the host's cell to enhance the replication of the virus's material. 2, 3 viruses are believed to play a central role in evolution, 4 (e.g. via horizontal gene transfer 2,5-7 ), be responsible for various human diseases (e.g. aids and respiratory diseases 8, 9 ) , and also have important applications to biotechnology 10 and nanotechnology. 11 for instance, the recent zika virus (zikv) epidemic in the americas have led the world health organization to declare a 'public health emergency of international concern', 12, 13 and just recently the novel coronavirus (2019-ncov) outbreak in china was declared pandemic by the same organization. 14 due to their complete reliance on the host gene expression machinery, viruses are under constant evolutionary pressure to effectively interact with the host intracellular factors, and at the same time effectively evade its immune system. 3, 15 thus, understanding how viruses co-evolve with their hosts to ensure their fitness may help in developing novel viral based applications such as vaccines, oncologic therapies, and anti-bacterial treatments. it is natural to expect that viruses and hosts co-evolution patterns are also encrypted in the viral genome. for example, it was shown that high correlation of gc content exists between bacteriophage and related hosts, 16 that a pattern of cpg dinucleotides is suppressed in vertebrate hosts and in their related rna viruses, 17, 18 that the frequency of tpa dinucleotides is suppressed in invertebrate hosts and in their related rna viruses, 19 and that many long sequences are shared between hosts and their related viruses. 20, 21 identification and analysis of short dna sequences that are under-represented (also referred to as suppressed or avoided) in genomes of different species were analysed in the past. 22, 33 for example, in, 22 markov chain models were used to analyse short sequences in the dna of two hosts: escherichia coli and bacillus subtilis. markovian models were used in 23 to predict the frequencies of short sequences and applied them to many prokaryotic species, and the authors in 24 introduced an efficient algorithm to identify sequences that are avoided. in this paper, we analyse under-represented nucleotide sequences in the coding regions of all types of viruses and in the coding regions of their corresponding hosts using a novel statistical framework. these sequences are analysed separately in each of the three reading frames. we provide a large database of these sequences, identify unique and interesting patterns within these sequences, and demonstrate how these sequences can be utilized to attenuate the zikv via in vitro and in vivo experiments. in this section, we briefly describe the main steps of our methodology. a detailed description appears in the supplementary document. the general flow of our analysis is depicted in fig. 1a . the dataset of virus-host associations was retrieved from previously published data. 34 these include 2,625 unique viruses and 439 corresponding hosts, where all the corresponding coding sequences were downloaded and processed. randomization models were used to generate many random variants of the host and virus coding sequences. two different randomization models were used, each control for different biases. a dinucleotide randomization model preserves both aminoacid order and content and the distribution of all 16 possible pairs of nucleotides, whereas a synonymous codon randomization model preserves both amino-acid order and content, and the codon usage bias. these were then used to statistically infer short nucleotide sequences that are under-represented within both the original host and virus genome coding regions, in each reading frame, and those that are common to all three reading frames. these under-represented sequences were analysed and compared among different viral groups and viral proteins, revealing some interesting evolutionary patterns that will be discussed later on. based on this analysis, an attenuated variant of the zikv was engineered and its attenuation was demonstrated in cell lines and in mice. the virus and host coding sequences and association information was retrieved from a published database. 21 in brief, the association between viruses and hosts was derived from the genomenet virus-host database. 34 the database contains 2,625 unique viruses and 439 corresponding unique hosts from all kingdoms of life (see supplementary table s1 ). figure 1b protists), where we specify for each host domain the portion of the corresponding viruses belonging to each virus type. the virus types in the database are reverse-transcribing (retro), double-stranded dna (dsdna), double-stranded rna (dsrna), single-stranded dna (ssdna), single-stranded rna (ssrna, positive and negative sense), and other (unclassified). the question that we must first address is: what constitutes an under-represented sequence in a coding region? to detect sequences that are statistically under-represented in the coding regions, our statistical background model must capture well-understood coding region features, which are known to be under selection. for example, selection for codon usage bias may cause few short sequences to be in low abundance in the coding regions (as opposed, for example, to regions that are not translated). this, however, does not imply that these short sequences were directly selected against by evolutionary forces. our definition of under-represented short nucleotide sequences in the coding region must then be formulated with respect to all known coding region features (i.e. amino-acids content and order, codon usage bias, and dinucleotide distribution), to suggest possibly new evolutionary forces acting on the viral coding regions. to that end, two randomization models were used to evaluate our hypothesis for short, under-represented nucleotide sequences in the coding regions of the viruses and in the coding regions of their corresponding hosts. the first, called dinucleotide randomization, preserves both amino acid order and content (and thus the resulting protein), and the frequencies of the 16 possible pairs of adjacent nucleotides (dinucleotides). the second, called synonymous codon randomization preserves both amino-acids order and content (and thus the resulting protein) and the codon usage bias. figure 1c depicts a schematic description of both randomization methods. a selection against short nucleotide sequences that cannot be explained by the canonical genomic features that are preserved by both randomization models implies that these sequences will appear more frequently in the random variants (generated by the above randomization models) than in the original genome. empirical p-values were derived from the empirical null model defined by the above two randomization models. the p-value estimates the probability of obtaining a random value (i.e. the number of occurrences of a sequence in the coding regions) that is the same or larger than the observed value in the original genome. this was performed separately in each of the three reading frames. a sequence was declared underrepresented if its p-values corresponding to the two randomization models were both 0.05. note that in the case of synonymous codon randomization, no under-represented sequence of size three nucleotides can be identified in the first reading frame. specifically, when analysing under-represented sequences in the viruses, we compared the original genome to 1,000 corresponding randomization variants generated by each of the randomization models described above. under-represented sequences were then identified separately in each reading frame. in addition, common under-represented nucleotide sequences were identified (i.e. sequences that are under-represented in all three reading frames-see supplementary document, section 1.4.1). this may indicate selection against sequences that may 'interfere' with the process of mrna translation. see supplementary document, sections 1.4 and 2.3 for an additional method of identifying under-represented sequences in the viruses based on the corresponding hosts (i.e. host-based as oppose to random-based analysis). due to the large size of the host genome, the analysis of underrepresented sequences in the hosts was performed differently than in the viruses. instead, the hosts were analysed relative to their corresponding viruses. recall that a host can be infected by several viruses. specifically, for each pair of a host and a corresponding virus (i.e. a virus that infects that host), we randomly sampled the host coding sequences with a sample size equals the total size of the virus coding sequences. twenty host samples were used for each host-virus pair. each sample was compared with 1,000 corresponding randomization variants generated by each of the random models. thus, twenty sets of under-represented sequences were identified in the host, for each reading frame, given a corresponding virus. a sequence that is under-represented in at least ten of the twenty samples, per reading frame, is then considered as under-represented in the host, given the corresponding virus. this is referred to as the sampled majority under-represented set of the host given a corresponding virus (see supplementary document, section 1.4.2). the final set of sequences that are under-represented in the host was defined by the intersection over all the corresponding viruses. see more details in supplementary document, section 1.4. the genome of a thai-strain zikv from an infectious-clone plasmid 35 was evaluated to uncover under-represented sequences (see supplementary document, section 1.8 for more details on the zikv strain). first, the two randomized models (dinucleotides and synonymous codons) were used on the zikv coding sequence to identify short sequences that are under-represented. next, oligos of five nucleotides (5-mers) that were identified by both models and showed significant p-values were selected and ranked according to their significance level (see the list of oligos detected in supplementary document, section 2.7). following, the sequence of the thai strain zikv ns5 protein was systematically scanned at the nucleotide level (according to the significance in the relevant frame) to identify locations that can be modified with each 5-mer, but without affecting the amino acid sequence of the protein (fig. 2 ). specifically, we were able to identify and introduce 29 synonymous codon changes in the first reading frame, and 70 synonymous codon changes in the second reading frame. figure 2 . a general scheme of engineering a synthetic sequence. specifically, in the case of the synthetic zikv ur99 sequence, we introduced different under-represented 5-mer oligo in the first two reading frames (identified using both randomization models), replacing the original nucleotide sequence while verifying that the protein aa sequence remains unchanged. the modified ns5 sequence (hereafter named ur99) was later synthesized as plasmid dna, amplified by pcr, and used to build zikv-ur99 strain by gibson assembly. 36 the first-passage stock virus was produced using vero cells. synthetic strain preparation: the infectious-clone plasmid of the thai-strain zikv was constructed from pcr products of viral cdna. the transfection of the plasmid into mammalian cells generated infectious virus with replication kinetics similar to those of the original virus. the sequence of the infectious-clone plasmid was indeed verified. 35 the viral sequence from this infectious-clone plasmid was evaluated to uncover under-represented sequences as discussed above. cell lines: bhk21 with rtta3 was used to generate virus from assembled dna. 37 the supernatant from the transfected bhk21 was then used to infect vero cells to prepare the virus stock for subsequent experiments. replication kinetics of the wild-type (wt) virus and the ur99 virus were characterized in vero cells with moi ¼ 0.01. the infectious titre was quantitated with vero cells using immunostaining against e protein by 4g2 monoclonal antibodies. animals: the 45 male and female ag129 mice produced by an in-house colony were used. groups of animals of both genders were randomly assigned to experimental groups and individually marked with ear tags. animals were challenged with malaysian zikv, zikv wt synthetic, ur99, or vehicle. serum was collected from all mice 14 dpi for assessment of neutralizing antibodies (neutabs) via prnt assay. mice were monitored for mortality and disease signs daily. individual weights were recorded daily throughout the course of the study. virus: wt zikv (malaysian strain, p6-740) was prepared by two passages in vero cells. a challenge dose of 100 ccid50 was administered via s.c. injection in a volume of 0.1 ml. the virus was generated from the same infectious-clone plasmid as the designed variants. quantification of neutab: neutab was quantified using a 50% plaque reduction neutralization titre (prnt50) assay. serum samples were heat inactivated at 56 c for 30 min in a water bath. one half serial dilution, starting at a 1/10 dilution of test sera was made. dilutions were then mixed 1:1 with an appropriate titre of zikv in mem containing 2% fetal bovine serum (fbs) and incubated at 4 c overnight. the virus-serum mixture was then added to individual wells of a 12-well tissue culture plate with vero76 cells (4e5 cells/ well). viral adsorption proceeded for 1 h at 37 c and 5% co 2 , followed by addition of 1.7% (4,000 cps) methylcellulose overlay medium containing 10% fbs to each well. plates were incubated for 4 days, and then stained with crystal violet [with 1% (wt/vol) crystal violet in 10% (vol/vol) ethanol] for 20 min. the reciprocal of the dilution of test serum that resulted in >50% reduction in average plaques from virus control was recorded as the prnt50 value. to identify short under-represented nucleotide sequences, we compared the number of appearances of each 3, 4, and 5 nucleotides sequences in each reading frame of the original genome with many corresponding randomization variants. our randomization models preserve the basic canonical features of the coding sequences, i.e. amino-acids composition, codon usage bias, and dinucleotide distribution (see section 2.3). thus, an under-represented sequence cannot be explained by these canonical features and may be selected against by other evolutionary forces. to estimate the false discovery rate, we performed two separate evaluations. first, we generated 10,000 randomizations (instead of 1,000) for few randomly selected viruses and verified that underrepresented sequences that were detected using 1,000 randomizations were also detected using 10,000 randomizations. in the second evaluation, we performed identifications of under-represented sequences in random variants of the viruses (rather than in the original genome). specifically, a random variant of each virus was randomly selected, and the p-value was evaluated relative to this (random) variant (see supplementary document, section 1.4.1.1). comparing the number of under-represented sequences identified in the original viruses and the randomized variants of the viruses yields an estimation of a false discovery rate of 1.38% (for m ¼ 3), 1.39% (for m ¼ 4), and 1.43% (for m ¼ 5). the under-represented sequences identified were further processed by analysing different virus and host groups. specifically, we analysed under-represented sequences for each virus group, for each host domain, for all viruses that corresponds to the same host, and for different combinations of host domains and virus groups (see supplementary document, section 1.5). a complete list of the most abundant under-represented sequences among the different virus groups is available in supplementary table s2 . in addition, we refined our analysis of under-represented sequences in the viruses by analysing different protein groups. we classified all viral genes into five mutually exclusive functional groups [surface, structural, enzymatic, unknown (unclassified genes), and other (hypothetical genes)] and showed that the selection against short nucleotide sequences depends on the viral protein function. finally, we performed a test study using zikv, where we engineered under-represented sequences into the genome of an asian zikv and studied their effect both in vitro and in vivo. figure 3a and b depicts the average number of under-represented sequences of size m ¼ 3, 4, and 5 nucleotides, identified in few subsets of viruses in both the original and random variants of the virus. see supplementary document, section 1.5 for details about the different subsets, and supplementary document, section 1.4.1.1 for generating random variants of viruses. as shown in the figures, the average number does indeed increase with the sequence size. also, many under-represented sequences are found in dsdna viruses that infect bacteria and vertebrate hosts. the average number of underrepresented sequences found in the random variants of the viruses is between 1 and 2% of the average number found in the original genome, suggesting a false discovery rate <2%. since the genome of dsdna viruses tend to be on average larger than the genome of rna viruses, we aimed at evaluating if the larger number of under-represented sequences identified can be simply attributed to a better statistical signal due to the larger nucleotide size of these viruses. a sampling analysis that we performed (see supplementary document, section 2.8) suggests that the number of under-represented sequences identified in dsdna viruses matches their genomic size, when compared with rna viruses. a complete list of under-represented sequences of sizes m ¼ 3, 4, and 5 nucleotides in all viruses in the database is available in supplementary table s3 (random-based) and in supplementary table s4 (host-based). our analysis suggests that among the most abundant common underrepresented nucleotide sequences (i.e. sequences that are underrepresented in all three reading frames) are homooligonucleotide repeats, specifically in viruses. these are sequences of the form xx.x, where all x contain the same nucleotide. figure 4a note that among these, specifically in viruses, are sequences containing the same nucleotide repeated m ¼ 3, 4, or 5 times (i.e. sequences that correspond to the same colour repeating m times in the figure) . a finer resolution of these common under-represented sequences is provided in fig. 4b , where we depict these sequences separately for different subsets of hosts (left figure) and subsets of viruses (right figure) . see supplementary document, section 1.5 for more details of the different subsets. table 1 lists the six most abundant common under-represented nucleotide sequences of size m ¼ 3, 4, and 5 in dsdna viruses. all homooligonucleotide sequences (shown in red coloured text) are among these most abundant sequences. one possible reason for this general selection against homooligonucleotide (in all three reading frames) in both viruses and hosts is to reduce erroneous frame shifts as ribosomes traverse the mrna while decoding it codon by codon. a sequence containing a repetition of the same nucleotide in the coding sequence may cause the ribosome to miss the codon boundary, resulting in a frame shift and thus a non-functional and most likely deleterious protein. 2, 38 this must be recognized and degraded by energy-consuming intracellular proteolytic mechanisms. since translation is the most energetically consuming process in the cell, it is believed that transcripts undergo selection to minimize this energy cost. [39] [40] [41] [42] [43] selection against sequences of repetitive nucleotides reduces faulty translation, thus minimizing the overall translation cost. it is possible that this selection against homooligonucleotide repeat is indeed more pronounced in viruses than in hosts since viruses are under much stronger evolutionary selection as they have a larger effective population size and thus a stronger effect of these types of mutations on their fitness. another possible reason may be related to different host immune evasion mechanisms used by viruses (see section 3.2). we also evaluated the sequence overlap between common underrepresented sequences in viruses and transcription factor binding sites and again found a general selection against homooligonucleotide repeats. these are reported in supplementary document, section 2.4. a nucleotide sequence is called palindromic if it is identical to its reverse complement. obviously, palindromic sequences are of even length. our analysis reveals that 32.5% of all common underrepresented sequences of size m ¼ 4 nucleotides in viruses are palindromes. excluding homooligonucleotide repeats this becomes 51%. note that only 6.25% of all possible sequences of size m ¼ 4 nucleotides are palindromes. we also evaluated the number of palindromes in random variants of the viruses. these random variants preserve basic transcript features such as amino-acid order and content, codon usage bias and dinucleotide distributions. only 5.7% of all common under-represented sequences of size m ¼ 4 in the random variants of the viruses were found to be palindromes. these findings suggest that indeed the coding regions of viruses are selected against short palindrome sequences. figure 5a and b depicts the percentage of palindromic sequences of size m ¼ 4 nucleotides that are common under-represented sequences in subsets of hosts and viruses. it was found that palindromic sequences are selected against only in one subset of hosts: bacterial hosts that are infected by dsdna viruses. in addition, palindromic sequences were found to be selected against in dsdna viruses that infect either bacteria (i.e. bacteriophage) or vertebrate hosts. 44, 45 as depicted in fig. 5a 32 ). figure 5c and d depicts the total number of occurrences of each palindrome as under-represented sequence in dsdna viruses that infect bacteria and vertebrate hosts, respectively. in these sub-figures we analysed under-represented sequences regardless of reading frames. two cases are shown: the case where the real virus genome is used (shown in blue colour), and the case where a randomized variant of the virus genome is used (shown in red colour). note the scale difference in the y-axis between the real and the randomized results. the results in the figures imply that dsdna viruses undergo selection against short palindrome sequences. it has been proposed that the principal underlying reason for the apparent avoidance of short palindromes in dsdna viruses is because they are targets for many restriction-modification systems and possibly for general recombination systems as well. 25, 29, 31, 46, 47 restrictionmodification systems protect bacteria and archaea from attacks by bacteriophages and archaeal viruses. a restriction-modification system specifically recognizes short sites in foreign dna and cleaves it, while such sites in the host dna are protected by methylation. to evaluate the hypothesis of palindromes avoidance in viruses due to restriction-modification systems, we downloaded all restriction enzyme patterns from the rebase 48 database (we used version 811, which contains information for 952 different restriction enzymes) and evaluated the overlap between the common under-represented nucleotide sequences we identified and the restriction sites from rebase. figure 5e depicts the number of exact matches between the most abundant common under-represented palindrome sequences of size m ¼ 4 nucleotides in dsdna viruses and restriction sites. the figure also depicts the corresponding enzyme name and the p-value for each common under-represented sequence. the p-value was computed by evaluating the match between common under-represented sequences of random variants of the viruses and the restriction sites. figure 5f depicts the number of restriction sites that are supersets of the most abundant common under-represented palindrome sequences. p-values were computed as in the case of an exact match. to show that the correspondence between selection against short palindromic sequences in viruses and restriction sites cannot be explained by basic coding region features such as amino-acid content and order, codon usage bias and dinucleotide distribution, we also evaluated the overlap between restriction sites and common under-represented sequences of random variants of viruses. this is reported in supplementary document, section 2.5. a complete list of all common under-represented palindromes of size m ¼ 4 is provided in supplementary table s6. common under-represented sequences were only identified in two subsets of hosts. on the other hand, common under-represented sequences were identified in all eight subsets of viruses. our analysis reveals that dsdna viruses infecting bacteria and vertebrate hosts have the largest number of common underrepresented sequences among the different virus subsets. this, as suggested above, seems to be due to the size of dsdna viruses when compared with ssdna and rna viruses. on the other hand, bacteria that are infected by dsdna viruses have the largest number of common under-represented sequences among the different host subsets. thus, the stronger selection for under-represented sequences in bacteria may induce stronger selection for under-represented sequences in viruses that utilize this host. in addition, we evaluated the number of under-represented sequences identified in the real genome of the viruses when compared with the randomized genome of the viruses. this is reported in supplementary document, section 2.9. indeed, many more sequences are identified as under-represented in the real genome of the virus. on average over all viruses and the three sequence sizes, there are 45 stds more under-represented sequences in the real genome in comparison to the random genomes, implying that these cannot be explained by basic coding region features, and suggesting possibly new evolutionary forces acting on the viral coding regions. note that since we analyse each pair of a host and a corresponding virus separately, the set of under-represented sequences in a host above is the sampled majority under-represented set. for obvious reasons, sequences that are not under-represented in both host and virus coding regions constitute the majority of the sequences and are thus not reported here. a complete list of all under-represented sequences within the three classes above for all hosts and viruses in our database is available in supplementary table s5 . in general, an under-represented sequence of m nucleotides may contain sub-sequences that are themselves under-represented. thus, it may be interesting to identify unique under-represented sequences, i.e. sequences that do not contain any sub-sequences that are underrepresented. for each pair of a host and a corresponding virus, a sequence belonging to one of the three classes above is referred to as a unique under-represented sequence if it does not contain any subsequence that is under-represented in that class. specifically, a unique common under-represented sequence of size m ¼ 4 (m ¼ 5) nucleotides doesn't contain any sub-sequence of size m ¼ 3 (of size m ¼ 3 and of size m ¼ 4) nucleotides that is common under-represented sequences. a complete list of all unique common under-represented sequences within the three classes above for all hosts and viruses in the database is available in supplementary table s7 . the correspondence of the most abundant under-represented sequences between viruses and their related hosts is depicted in fig. 7 for different host and virus subsets. each panel depicts both the most abundant common under-represented sequences (left) and the most abundant unique common under-represented sequences (right), where the panel names correspond to the class names. our first observation is that many under-represented sequences are indeed unique. for example, comparing the cases of m ¼ 4 and m ¼ 5 of class a (left sub-figure middle and bottom rows, respectively) with the corresponding unique set (right sub-figure top and bottom rows, respectively) reveals that the majority of the most abundant sequences is unique. second, homooligonucleotide repeats are among the most abundant sequences in all three classes. in addition, more sequences were identified in class b over the different subsets than in the other two classes. for example, table 2 lists the most abundant unique sequence of classes b and c in all the different subsets of hosts and viruses. as shown in the table, unique sequences were identified in all subsets in class b, as oppose to class c. the viral genome encodes different types of proteins that are necessary for the life cycle of viruses in their respective hosts. these, in general, include surface proteins that interact with the host receptors and enable attachment and entry to the host cell, structural proteins that serve as the building blocks of the virus, and replicating enzymes, such as rna and dna polymerase, that are required for the replication of the virus. in addition, many other proteins, some of which are uncharacterized, are diversely involved in different regulatory and accessory functions. here, our aim is to refine the analysis of under-represented sequences in viruses by analysing, separately, different protein groups. to that end, and similarly to, 21 we classified all viral genes into five mutually exclusive functional groups (functional sets): surface, structural, enzymatic, unknown (unclassified genes), and other (hypothetical genes). specifically, for each virus in the database, we divided its genome into the five gene sets defined above. each gene set contains all the virus genes of the same functional group. for example, the surface gene set of a virus contains all the genes that encode surface proteins in the virus's genome. a set might be empty for a particular virus if no genes of the corresponding functional group exist in that virus. see supplementary document, table s2 for a list of the total number of sets and genes of each functional group in the database. the analysis of under-represented sequences was then performed separately in each of the five gene sets for each of the viruses in the database (see more details in supplementary document, section 1.6). a complete list of all under-represented sequences in each viral functional group over all viruses in the database is available in supplementary table s8 . we first analysed the average number of under-represented sequences identified in each gene set. to control for the difference in the average gene size and the number of genes in each set, we randomly selected 1,500, 1,240, 1,450, 3,300, and 2,210 genes from each of the surface, structural, enzymatic, unknown, and hypothetical functional groups, respectively. this means that the number of identified under-represented sequences is analysed over similar region sizes, and the differences between the different sets cannot be explained by the genes' nucleotide size in each set. figure 8a depicts the average number of under-represented sequences (over all three reading frames) identified in each of the gene set over the (randomly selected) subset of genes. relatively small number of under-represented sequences were identified in surface genes (that participate in the recognition of the host receptors), when compared with the other gene sets. at least twice as many were identified in many of the enzymatic genes. these proteins interact closely with the host cell machinery, are essential for the viral replication cycle, and thus must use mechanisms that guarantee their function. figure 8b depicts the most abundant common under-represented sequences within each viral functional group. these differ between the different functional groups; however, homooligonucleotide sequences appear among the most abundant common underrepresented sequences in all groups. we designed an attenuated zikv variant based on the underrepresented analysis we performed. such variants may be useful in the future for generating a live-attenuated vaccine. specifically, we introduced synonymous mutations to the ns5 nucleotide sequence, which includes under-represented sequences, and named the new variant ur99 (see details in section 2). infection studies in vero cells demonstrated fractional variant attenuation of the ur99 virus, which was correlative with our model predictions (see foci size in fig. 9a, right bottom) . in addition, infectious virus collected and evaluated from the ur99 variant showed substantial attenuation relative to wt zikv (fig. 9a) . there is evidence that ag129 mice lacking ifn-a/b and ifn-c (types i and ii interferon) receptors can be valuable for evaluating the efficacy of new vaccines and anti-viral treatments for zikv. 49, 50 therefore, as these mice are immune compromised, various strains of zikv cause lethal infection and disease, and will typically cause morbidity and mortality. depending on the strain, severe disease is observed between 1 and 2 weeks after virus challenge. 50, 51 thus, to further test the synthetic vaccine attenuation level in vivo, ag129 mice were challenged with attenuated zikv preparations as well as synthetic wt zikv. these inoculations were done in parallel with the original virus grown in cell culture. infection with the synthetically attenuated zikv strains was lethal in all inoculated ag129 mice. however, the mortality curve of mice infected with ur99 was delayed, when compared with that of wt malaysia and wt synthetic zikv (average of 20.4 days in ur99 vs. 15 and 17.5 in wt malaysia/synthetic zikv, respectively; see fig. 9b ). no mortality was observed in unvaccinated controls, and mice vaccinated with vehicle (fig. 9b) . weight loss was also observed in all the infected mice (30-40%; see fig. 9c ). normal control mice experienced general weight gain throughout the experimental period (fig. 9c) . weight loss corresponded well with mortality, and mice typically lost substantial weight, requiring humane euthanasia. neutab is the primary mediator of protection in vaccine studies in this model. 52, 53 therefore, serum samples were taken to determine the presence of neutab in infected mice. the neutab titre was evaluated in vaccinated mice 2 weeks after vaccination. mice vaccinated with synthetic wt or ur99 had significantly (p < 0.0001) elevated neutab titres when compared with vehicle controls (see fig. 9d ). as expected, no neutab was detected in mice vaccinated with vehicle or in normal control groups (see fig. 9d ). the virulence levels of ur99 were somewhat lower than the levels of the malaysian and synthetic wt strains, thus demonstrating that under-represented sequences can be potentially used in the design of live attenuated zikv strains. accordingly, additional attenuation of this variant (e.g. by introducing similar changes to other zikv proteins) may further decreases the lethality of the mice infected by it. since ag129 mice are very susceptible to zikv infection, 49 mouse model might be too stringent to test these live attenuated vaccine candidates, as human infection is generally sub-clinical after natural zikv infection, hence the attenuated strain might be effective in an immunocompetent model. we compared the average number of under-represented sequences identified in each pair of a virus and its corresponding host. see supplementary document, section 2.10 for more details. we found that in 75% of the cases the average number was larger in the hosts. we believe that this is due to the fact that the viral genome is usually populated with many overlapping codes and genes, when compared with cellular organisms. [54] [55] [56] this introduces many constraints along the viral genome, which can decrease the number of under-represented sequences in the virus. for example, a sub-optimal codon within the host coding region may be synonymously replaced by evolution without affecting the host fitness. however, due to overlapping codes, replacing a sub-optimal codon within the viral coding region may affect multiple proteins and genes, and thus be deleterious to the virus. in this study, we analyse sequences of three, four, and five nucleotides long that are under-represented in the coding regions of viruses of all types and in their corresponding host coding regions. this study is based on a novel statistical evaluation that controls for classical coding region features, which is performed separately in each of the three reading frames. we provide various novel discoveries that may shed light on the evolution of viral dna sequences and on the virus co-evolution with its respective hosts. it is important to emphasize that the observed patterns may be related to various variables and their complex interactions, include gene expression optimizations, various mechanisms for escaping the host immune system, and co-evolution with the corresponding hosts. for example, it was reported that suppression of cg dinucleotides in hiv-1 is due to coevolution with its vertebrate host to avoid the host defence mechanisms. 18 in general, our analysis reveals that under-represented viral sequences are related to different mechanisms such as restriction modification systems and possibly to alternative or unknown immune escape mechanisms, as these sequences cannot be explained by canonical mechanisms that may suggest, for example, classical viral recognition using antibodies. we show that homooligonucleotide repeats are the most abundant under-represented sequences in both viruses and hosts. a possible explanation for this avoidance is to reduce an erroneous ribosomal frame shifts and thus reduce faulty translation and consequentially the overall translation cost. however, as this motif is shown to be shared between hosts and viruses, our analysis also indicates that a stronger selection pressure against these sequences exists in viruses. this again can be attributed to escape mechanisms from the host immune system, as the virus nucleotide composition evolves to be similar to the host, and it is certainly possible that an excess avoidance of homooligonucleotide repeats reduces viral recognition by classical host immune mechanisms. there may be other relevant explanations such as interaction with small rna genes (e.g. mirnas). it is possible, for example, that these sequences may increase the efficiency of mirna and mrna interactions and thus decrease expression levels. this should be studied further. in addition to homooligonucleotide repeats, we show that palindromes are among the most abundant under-represented sequences in viruses. specifically, excluding homooligonucleotide repeats, our analysis reveals that 51% of all under-represented sequences of four nucleotides long in viruses are palindromes (where only 6.25% of all possible sequences of that size are palindromes). indeed, analysis of palindromes avoidance in viruses was performed previously. it was shown that palindromes are the most under-represented short sequences in a prokaryotic genome. [25] [26] [27] for example, it was reported that short palindromic sequences are avoided at a statistically significant level in the genomes of several bacteria. 25 four and six nucleotides palindromic sequences that are avoided were reported for few viruses and hosts in, 57 and avoidance of palindromes in several dozen phage genomes was reported in 26 these analyses are based on statistical counts of certain sequences in the given dna and thus do not control for canonical coding region features (codon usage bias, amino acid order and content and dinucleotide distribution) as was done in this study. in addition, our analysis is performed over a larger set of viruses of all types and their corresponding hosts, and at a reading frame resolution. thus, we believe that the results reported here may be more accurate, and should provide a better understanding of this phenomenon. one plausible explanation for avoidance of palindromes in viruses is because they are targets for many restriction-modification systems and possibly for general recombination systems as well. we statistically show a high overlap between under-represented palindromes in viruses and restriction enzyme patterns. this overlap cannot be explained by classical coding region features. restriction of recognition sites has been observed in genomes of prokaryotic organisms. 26, [28] [29] [30] 46 the authors in 29 analysed the avoidance of restriction sites in few bacteriophage, and concentrated on sites containing six nucleotides. rusinov et al. 46 studied most known recognition sites (both palindromic and asymmetric) in thousands of prokaryotic genomes and found factors that influence their avoidance. it was also shown that the recognition site avoidance correlates with the lifespan of restriction-and-modification systems. recently, the authors in 31 the numbers in parenthesis indicate the frequency of occurrences in percentage. x indicates that no corresponding sequence was identified. analysed avoidance of recognition sites of restriction-modification systems in the genomes of prokaryotic viruses and found it to be a widespread but not a universal anti-restriction strategy of these viruses. the method used by the authors is based on a compositional bias calculation, which is the ratio of the observed to the expected frequency of a sequence, where the expected frequency is estimated based on the observed frequencies of all sub-sites of a given sequence. the compositional bias measure was originally used in 32 for analysing over-and under-represented sequences in dna viruses. since the compositional bias measure doesn't account for a statistical background that preserves know evolutionary forces, we believe that a more accurate and comprehensive procedure of identifying underrepresented sequences is the one used here. in addition, we analyse the distribution of these underrepresented sequences among various viral and host groups. we show, for example, that dsdna viruses infecting bacteria or vertebrate hosts contain a larger set of under-represented sequences than other viral types and that this may be related to their larger genome size. furthermore, we show that on average the set of sequences that are under-represented in viruses but are not under-represented in their related hosts is the largest set among different host-virus underrepresented correspondence. we also show that the selection against under-represented sequences in viruses depends upon the protein function. for example, larger number of sequences is shown to be under-represented in enzyme genes than in surface genes. moreover, even larger number of sequences is found to be under-represented in genes with (currently) unknown functionality, prompting further investigation into the nature of these genes. the differences between these groups may also be related to the expression levels of the different proteins. if, for example, surface genes tend to have low expression levels then they may be under weaker selection for features such as under-represented sequences. vaccines are a topic of a singular importance in present day biomedical science. however, the discovery of vaccines has so far been primarily empirical in nature requiring considerable investments of time, efforts, and resources. 58 to overcome the numerous pitfalls attributed to the classical vaccine design strategies, more efficient and robust rational approaches are highly desirable. one direction in designing in silico vaccine candidates may be based on exploiting the synonymous information, encoded in the viral genomes and related to gene expression, for attenuating the viral replication cycle while retaining its genotype and structure. the analysis and results reported here may have important implications in vaccine synthesis. specifically, the outcomes of this study may provide clues and guidance into practical design of efficient and safe viral vaccines via attenuated viral material. furthermore, it may also prove to be beneficial for other biotechnological objectives related to viral based products such as developing oncolytic viruses and engineering phages to fight bacteria. [59] [60] [61] [62] [63] [64] indeed, we demonstrate, both in vitro and in vivo, how under-represented sequences can be utilized to obtain an attenuated zikv. the aim of these experiments is an initial proof of concept. of course, additional experiments with more variants and controls are needed to better understand the effect of these under-represented sequences on the viral growth rate and fitness. for example, it will be helpful to study additional mutants that do not possess underrepresented sequences but include other types of mutation. however, it is important to emphasize an interesting and a non-obvious aspect of these experiments. the introduced mutations are silent and thus did not alter the encoded protein. based on our experience, in many cases silent mutations may not affect the viral fitness, and furthermore, there are cases where they may even improve its growth rate. also, it is important to emphasize that in these experiments both the wild-type and the mutant variants were generated by the same process and from the same infectious-clone plasmid. finally, the randomization models used in this study may not completely preserve the viral rna secondary structure, and thus the selection for under-represented sequences may be partially due to alterations in secondary structures. fields virology tinkering with translation: protein synthesis in virus-infected cells the role played by viruses in the evolution of their hosts: a view based on informational protein phylogenies horizontal gene transfer in prokaryotes: quantification and classification evolution of complexity in the viral world: the dawn of a new vision giant viruses, giant chimeras: the multiple evolutionary histories of mimivirus genes rates of hospitalizations for respiratory syncytial virus, human metapneumovirus, and influenza virus in older adults viral infectious disease and natural products with antiviral activity negative-strand rna viruses: applications to biotechnology viruses and their uses in nanotechnology zika virus outbreak rapid spread of emerging zika virus in the pacific area the evolutionary genetics of viral emergence viral adaptation to host: a proteome-based analysis of codon usage and amino acid preferences patterns of evolution and host gene mimicry in influenza and other rna viruses cg dinucleotide suppression enables antiviral defence targeting non-self rna virus-host coevolution: common patterns of nucleotide motif usage in flaviviridae and their hosts evidence of a direct evolutionary selection for strong folding and mutational robustness within hiv coding regions universal evolutionary selection for high dimensional silent patterns of information hidden in the redundancy of viral genetic code exceptional motifs in different markov chain models for a statistical analysis of dna sequences comparison of methods of detection of exceptional sequences in prokaryotic genomes on avoided words, absent words, and their application to biological sequence analysis, algor avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes evolutionary role of restriction/modification systems as revealed by comparative genome analysis computational dna sequence analysis restriction-modification systems interplay causes avoidance of gatc site in prokaryotic genomes molecular evolution of bacteriophages: evidence of selection against the recognition sites of host restriction enzymes the significance of distance and orientation of restriction endonuclease recognition sites in viral dna genomes avoidance of recognition sites of restriction-modification systems is a widespread but not universal anti-restriction strategy of prokaryotic viruses over-and under-representation of short oligonucleotides in dna sequences forbidden penta-peptides linking virus genomes with host taxonomy infectious clone plasmid of a thai-strain zika virus and its fluorescent reporter system for high-throughput assay and vaccine development a simplified positive-sense-rna virus construction approach that enhances analysis throughput multi-color fluorescent reporter dengue viruses with improved stability for analysis of a multi-virus infection ribosomal frameshifting and transcriptional slippage: from genetic steganography and cryptography to adventitious use translational accuracy and the fitness of bacteria an integrated approach reveals regulatory controls on bacterial translation elongation gradients in nucleotide and codon usage along escherichia coli genes selection for reduced translation costs at the intronic 5' end in fungi multiple roles of the coding sequence 5' end in gene expression regulation non-ltr retrotransposons encoding a restriction enzyme-like endonuclease in vertebrates predicting rad-seq marker numbers across the eukaryotic tree of life lifespan of restriction-modification systems critically affects avoidance of their recognition sites in host genomes one recognition sequence, seven restriction enzymes, five reaction mechanisms rebase-a database for dna restriction and modification: enzymes, genes and genomes protection from secondary dengue virus infection in a mouse model reveals the role of serotype cross-reactive b and t cells characterization of lethal zika virus infection in ag129 mice characterization of a novel murine model to study zika virus protective efficacy of zika vaccine in ag129 mouse model a zika vaccine targeting ns1 protein protects immunocompetent adult mice in a lethal challenge model pacing a small cage: mutation and rna viruses evolution of viral proteins originated de novo by overprinting hidden silent codes in viral genomes statistical analyses of counts and distributions of restriction sites in dna sequences rna virus attenuation by codon pair deoptimisation is an artefact of increases in cpg/upa dinucleotide frequencies live attenuated influenza virus vaccines by computer-aided rational design changes in codon-pair bias of human immunodeficiency virus type 1 have profound effects on virus replication in cell culture bacteriophages and their implications on future biotechnology: a review bacteriophages and biotechnology: vaccines, gene therapy and antibacterials taking aim on bacterial pathogens: from phage therapy to enzybiotics experimental molecular evolution of bacteriophage t7 we are grateful to the anonymous referees for comments that greatly helped in improving this paper. the work of y.z. was supported by the israeli ministry of science, technology and space and by the edmond j. safra center for bioinformatics at tel-aviv university. the animal research ethics committee at utah state university approved this research. supplementary data are available at dnares online. key: cord-343863-q1y8uscj authors: whitney, joe; esteban, david j; upton, chris title: recent hits acquired by blast (rehab): a tool to identify new hits in sequence similarity searches date: 2005-02-08 journal: bmc bioinformatics doi: 10.1186/1471-2105-6-23 sha: doc_id: 343863 cord_uid: q1y8uscj background: sequence similarity searching is a powerful tool to help develop hypotheses in the quest to assign functional, structural and evolutionary information to dna and protein sequences. as sequence databases continue to grow exponentially, it becomes increasingly important to repeat searches at frequent intervals, and similarity searches retrieve larger and larger sets of results. new and potentially significant results may be buried in a long list of previously obtained sequence hits from past searches. results: rehab (recent hits acquired from blast) is a tool for finding new protein hits in repeated psi-blast searches. rehab compares results from psi-blast searches performed with two versions of a protein sequence database and highlights hits that are present only in the updated database. results are presented in an easily comprehended table, or in a blast-like report, using colors to highlight the new hits. rehab is designed to handle large numbers of query sequences, such as whole genomes or sets of genomes. advanced computer skills are not needed to use rehab; the graphics interface is simple to use and was designed with the bench biologist in mind. conclusions: this software greatly simplifies the problem of evaluating the output of large numbers of protein database searches. advances in technology have increased the speed and reduced the cost of dna sequencing. this has resulted in a dramatic increase in the number of sequences contributed by both large sequencing centres and individual laboratories to sequence databases. public biological sequence databases are growing at an ever-increasing rate, with 9 million new sequences being added to genbank from august 2002 to august 2003 alone [1] . currently, the genbank database has almost 42 billion nucleotides from over 32 million sequences. the number of whole genome sequences of eukaryotes, prokaryotes and viruses is also increasing rapidly. accordingly, tools like ncbi blast, which search those databases for sequences similar to a given query sequence, return larger and larger sets of results. sequence similarity searching is a powerful tool to help develop testable hypotheses in the quest to characterize genes and other dna sequences and infer structural, functional or evolutionary relationships. researchers interested in identifying new matches to query sequences, which may be a few genes or even whole genomes, must search through massive amounts of alignment data to retrieve new and interesting matches. in order to keep up with the growing databases, the researcher must submit the same queries periodically. however, the new results, no matter how significant, are often buried in a long list of results that were previously obtained on past searches. rehab (recent hits acquired from blast) is a new software package that was developed to address these problems. rehab performs psi-blast [2] searches of a protein sequence database and keeps a database of all significant alignments ("hits") obtained; these searches are performed on a regular schedule against updated versions of the sequence database. it then compares the sequences in the new psi-blast result with the rehab hits database to identify new matches resulting from recently deposited sequences. the complete rehab hits database can then be queried by date using a simple gui to allow the researcher to easily identify new hits; these are highlighted, and pairwise or multiple alignments can be performed to assess the quality of the match. as well as filtering out results that have been found previously, the rehab browser can filter out hits against sequences that are identical to the sequences being submitted as queries (such as orthologs of the query sequence). rehab is designed to be a practical tool for searching ncbi database updates with large numbers of query sequences. for example, our laboratory uses it with all open reading frames (orfs) from fully sequenced poxvirus genomes (over 7000 query sequences). as the number of sequenced virus genomes continues to increase, the number of hypothetical orfs of unknown function also expands. this is particularly true for large viruses like poxviruses, baculoviruses, and herpesviruses that possess many virulence genes that are not part of the core set of genes that define a virus family [3] . there are also numerous core genes for which no known function has yet been identified; for example, of 49 completely conserved protein families in poxviruses, there are 11 with completely unknown function and at least 5 others with only poorly defined function. other programs have been previously created to deal with this particular issue, including dbwatcher [4] , seals [5] , swiss-shop [6] , sequence alerting system [7] and blast search updater [8] . however, www-based programs are not well suited to searching with large numbers of query sequences, and there may be concerns with a shut-down of service (as occurred with sequence alerting system) or allowing proprietary data out of a secure network. other programs may be complicated to use, or require users to directly interact with unix operating systems. rehab is specifically designed for searching with large numbers of query sequences and can support a number of research groups; it also provides a user-friendly graphical interface. the client will run on most major operating systems including mac os x, windows, linux and solaris. rehab was implemented for the java platform to simplify the support of multiple operating systems including linux, microsoft windows, solaris, and mac os x. users initially access and launch the application (client) from a web page using java web start, which also automatically downloads updated versions as they become available. this ensures users are taking advantage of improvements or added features in the latest software version. furthermore, coding in java allows interoperability with existing applications developed in our laboratory, including base-by-base [9] . our choice was reinforced by past successes with the java platform and java web start for implementation and distribution of programs such as vocs [10] , vgo [11] , and base-by-base [9] . rehab consists of four main components ( figure 1 ): (1) a mysql relational database that stores information about hits, including biological sequences, alignments between them, and other categorization and annotation data; (2) a java server that provides access to programs which cannot be run locally by the client on arbitrary user workstations, such as ncbi blast and emboss [12] utilities; (3) a java swing graphical client, downloaded and launched on client machines using java web start; (4) and a back-end java program which runs alignment programs and compiles results in the database. each of these components is described in more detail below. although all components can be run on a single machine, it is envisioned that a single server will support a variety of users dispersed on an intranet or the internet; if required, it is simple to offload the batch database searching to a more powerful cluster or grid system. there are four types of information stored in the rehab database: (1) biological sequences and their annotations, both those used as queries in blast searches and those which have been returned as hit subjects; (2) information on each query/subject pair (hit), gathered from individual search results and alignment programs (including bitscore, date entered, and percent identity); (3) organizing information, such as which query sequences belong to which organisms; and (4) other caching information, used to speed performance of server-side program functions. to reduce the amount of required storage space, actual alignments are not stored, but are regenerated for presentation when the user selects the specific query sequence or query-target pair to be viewed. query sequences, which are entered using a simple fasta-like format that includes additional annotation information in the identifying line, need only be submitted once to rehab since they are stored for future search cycles. the work of running psi-blast searches is done in batch mode by the ncbi blastpgp program against a local copy of the ncbi non-redundant (nr) protein database. psi-blast is performed for three iterations without filtering procedures (such as for low-complexity regions). hits with an e-value less than 0.001 are used to generate the scoring matrix for the subsequent cycle. to increase speed, searches with query sequences that result in no new hits are terminated after the first cycle. those with new hits scoring below the threshold continue to the third cycle. psi-blast was chosen because it is a more sensitive search method than blastp. the searches do not need to be performed on the same machine on which the database or server components are installed. xml output from blastpgp is parsed and relevant information about each hit is stored in the rehab database. in addition to scores and identifying information, target sequences are copied into the rehab database to ensure that they are available for analysis in the future, even if they are no longer available from ncbi. this is important because, although ncbi does not actually remove sequences from its database, it may change the identifier of a sequence if it is corrected, updated, or merged with another identical entry. any changes to a pre-existing entry are added to the database, but it is not registered as a new hit. the server component consists of java rmi classes that provide remote access to local facilities, and a loading program that registers those classes with an rmi registry installed on the server from which the client will be downloaded. requirement for the server is a system that can support java 1.4.1 and mysql 4.0. the java swing client component allows a user to browse the information collected in the database by the back-end program. when the client is downloaded and launched from a website, it connects to the server and database specified in its configuration file. the client program visually presents summary information about hits added to the database, and allows the user to quickly locate new, relevant hits and the sequences involved. there are five main views available in the client: rehab is a tool that works with blast to identify new hits in updated versions of sequence databases. it allows the researcher to ask the question: "what new sequences match my sequences since the last time i searched?" in the example of our work, the query sequences are all the orfs of the fully sequenced poxvirus genomes (36 genomes, 7075 query sequences). these sequences are used to query the nr ncbi protein database and a mysql database of all hits is generated and stored ( figure 2 ). databases of hits for other virus families maintained by the virus bioinformatics resource (tvbr; herpesviruses, baculoviruses, coronaviruses, and adenoviruses) will also be available in the near future. the hits database can then be accessed by double-clicking on the database name or by selecting "browse by organism" in the action menu. this opens a new window to present browsing, sorting, and highlighting options ( figure 3 ). the user can browse by organism name, such as "variola virus strain bangladesh-1975". to highlight recent hits, the "date option" is chosen to define the date after which the hits are considered new. the available dates are those on which the query sequences were searched against a then current nr ncbi database. the output can be sorted based on three criteria (name, new hit date, or maximum new hit bit-score) by selecting the appropriate radio button. the results are presented in a new window, using colors to indicate new hits (figure 4 ). figure 2 rehab management console. a database is selected from the list on the left, and statistics are displayed on the right. more hits than target sequences are displayed because query sequences can match multiple targets. double-clicking on the database or selecting an option from the action menu allows users to browse the selected database. since all new hits are not necessarily significant, results are highlighted in different colors depending on the bit-score. the user can change the default threshold of the minimum bit-score, to show new hits scoring above this cut-off in red and new hits scoring below it in yellow. since all query sequences that have new hits are highlighted, any that remain unhighlighted do not have new hits. however, unhighlighted queries may have significant hits from previous searches. the "latest hit" column indicates this fact: query sequences showing hits only from previous searches show an older hit date, and a bit-score of "0" in the "new hit score" column. unhighlighted sequences with no information in the "latest hit" column do not have any hits in the database or they have been filtered out (see below). figure 3 hits browser window. the rehab database is searched by selecting an organism name, then choosing the desired highlighting and filtering options. clicking on "show summary" opens a new window to display the results. information about the hits can be viewed in two ways. selecting "html report" launches the user's default web browser and displays the "hit-list" in familiar blast-style (figure 5a ). hits are displayed in descending order of bitscore, however, a key feature of this program is that new hits are highlighted in red or yellow. the pairwise alignment can be displayed rapidly by clicking on the score (a hyperlink). in contrast to the usual blast output, which presents the local alignment found by blast, a global alignment produced by needle [12, 13] is shown. more information can be obtained about the target sequence by clicking on the link to the ncbi file for that entry. alternatively, a full list of hits can be viewed in the "hits manager" window ( figure 5b) . here, sequences can be sorted by highlight, and pairwise or multiple alignments can be performed. pairwise alignments are displayed by selecting a single target sequence and clicking the "global" button for global alignments produced by the emboss program query sequences with new hits are highlighted figure 4 query sequences with new hits are highlighted. a user defined threshold (in the browser window) is used to define the minimum bit-score that is highlighted in red, and all new hits with lower scores are highlighted in yellow. the latest hit column indicates the date of the most recent hit. those with no entry in this column have no hits in the database (for example, varv-bsh-a33.5l). sorting of the entries can be changed by clicking on the column heading. details about the hits can be obtained by right-clicking on the entry or selecting an option in the action menu. figure 5 analysis of hits. hits can be viewed in a) html output, showing all hits listed in order of descending score, followed by a pairwise needle alignment of the query and target sequence. the info hyperlink links to the ncbi entry for the target sequence, and the score hyperlink takes the user to the needle alignment. b) the hits manger window, which allows the user to sort hits and view pairwise or multiple alignments, or view selected sequences in fasta format. a global alignment is shown between the query sequence and the top scoring new hit. rehab set up for other users figure 6 rehab set up for other users. a) different laboratories in a department could have different query databases, which can be accessed as described in the text. b) the sequences within a lab's database could be annotated with individual lab member's names, or other identifying information, permitting individuals to view results for their own sequences of interest. in this way, large numbers of sequences of interest to a lab can be run simultaneously and frequently, and individuals can then browse results. needle or the "local" button for local alignments generated by the emboss program water [12, 14] . multiple alignments are generated by selecting more than one target sequence and clicking the "base-by-base" button; the software automatically retrieves the appropriate sequences from the rehab hits database, performs a clustalw alignment and passes the resulting multiple alignment to base-by-base, which functions as an alignment viewer and editor [9] . finally, sequences in the hits manager can be viewed in fasta format by clicking the "show" button and can be copied from this window using standard keystrokes. unless a sequence has not been deposited in the public database, a sequence similarity search will return results including the query sequence itself, as well as nearly identical sequences that are orthologs of the query. rehab can block the highlighting of hit sequences that are also present in the query database when the "don't show my own sequences" option is selected; such sequences will not be shown or highlighted in the hits results window. however, these sequences and their alignments with the query can still be visualized in the html report and hits manager windows, thus allowing the user to access all the available information. this feature becomes essential when new poxvirus genomes are added to the public database, since a large fraction of the queries will hit proteins in the new genome and signal a "new hit" report when there may be no other new hits in the database. although these are clearly high scoring matches, they are expected and therefore must be masked in the analysis if the full value of rehab is to be realized. in the browser window (figure 3 ), the user can chose to browse by the annotation included in each sequence's information line. in the case of our poxvirus sequences, useful annotations are organism name and protein family (as determined in pocs [10] ). selecting an item from the "group by annotation" list loads the new category in the list on the left side of the window. this sorting allows the user to quickly find query sequences of particular interest. for example, one may be interested in looking at only sequences from the ankyrin family. results can then be viewed and analyzed as described above. researchers can use rehab to search databases with their own set of query sequences. in the example of our research, it is most practical to organize the query sequences by organism and protein family. other researchers, however, may find other naming schemes to be more useful; no changes to the program or database are required. for example, a research group could organize query sequences and the hits results databases by laboratory name, and browsing of results could be by the researcher's name ( figure 6 ). individual laboratory members would add query sequences to the database including their name in the identifying information line. in this example, the laboratory name would replace virus family, and user names would replace organism names. all query sequences would be searched in the same batch process, and each individual could then browse their sequences of interest. users interested in establishing their own rehab database should contact the authors for assistance. the goal of this project was to build a software package to aid in the identification of new results returned from sequence similarity searches. to this end, we developed rehab, a tool that highlights new hits by comparing results from previously run searches to those with a recently updated database. rehab allows researchers to query the nr protein database with large numbers of sequences and can highlight, sort, and analyze results in a user-friendly graphical interface. it can also be used to rapidly create multiple alignments with any set of sequences returned by a blast search. this enables researchers to recognize new significant sequence matches in the mass of results generated by high throughput database search protocols. genbank: update lipman dj: gapped blast and psi-blast: a new generation of protein database search programs poxvirus orthologous clusters: toward defining the minimum essential poxvirus genome dbwatcher seals sequence alerting systemm blast search updater: a notification system for new database matches base-by-base: single nucleotide-level analysis of whole viral genome alignments poxvirus orthologous clusters (pocs) viral genome organizer: a system for analyzing complete viral genomes emboss: the european molecular biology software suite a general method applicable to the search for similarities in the amino acid sequences of two proteins identification of common molecular subsequences this work was funded by niaid/darpa grant u01 ai48653-02 and canadian nserc strategic grant stpgp 269665-03. we would like to thank angelika ehlers for systems administration and dr. rachel roper for helpful insights. cu described and specified the features of and problems to be solved by rehab, tested the program and provided usage examples. jw implemented the software, both the java components and the database schemata used to store alignment results. dje tested the program and provided usage examples. all authors contributed to writing of the manuscript. key: cord-344782-ond1ziu5 authors: zhang, jing; finlaison, deborah s.; frost, melinda j.; gestier, sarah; gu, xingnian; hall, jane; jenkins, cheryl; parrish, kate; read, andrew j.; srivastava, mukesh; rose, karrie; kirkland, peter d. title: identification of a novel nidovirus as a potential cause of large scale mortalities in the endangered bellinger river snapping turtle (myuchelys georgesi) date: 2018-10-24 journal: plos one doi: 10.1371/journal.pone.0205209 sha: doc_id: 344782 cord_uid: ond1ziu5 in mid-february 2015, a large number of deaths were observed in the sole extant population of an endangered species of freshwater snapping turtle, myuchelys georgesi, in a coastal river in new south wales, australia. mortalities continued for approximately 7 weeks and affected mostly adult animals. more than 400 dead or dying animals were observed and population surveys conducted after the outbreak had ceased indicated that only a very small proportion of the population had survived, severely threatening the viability of the wild population. at necropsy, animals were in poor body condition, had bilateral swollen eyelids and some animals had tan foci on the skin of the ventral thighs. histological examination revealed peri-orbital, splenic and nephric inflammation and necrosis. a virus was isolated in cell culture from a range of tissues. nucleic acid sequencing of the virus isolate has identified the entire genome and indicates that this is a novel nidovirus that has a low level of nucleotide similarity to recognised nidoviruses. its closest relatives are nidoviruses that have recently been described in pythons and lizards, usually in association with respiratory disease. in contrast, in the affected turtles, the most significant pathological changes were in the kidneys. real time pcr assays developed to detect this virus demonstrated very high virus loads in affected tissues. in situ hybridisation studies confirmed the presence of viral nucleic acid in tissues in association with pathological changes. collectively these data suggest that this virus is the likely cause of the mortalities that now threaten the survival of this species. bellinger river virus is the name proposed for this new virus. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 the bellinger river snapping turtle, myuchelys georgesi, is a species of freshwater turtle that, prior to this outbreak, was rare and has a very restricted habitat. it is confined solely to a 60 kilometre section of the bellinger river, and a short section of the adjacent kalang river in northern coastal new south wales (nsw), australia. while the turtle had been described as "locally abundant" it was also described as "meets the criteria for being listed as a vulnerable species under the threatened species conservation act of nsw" [1] . in 2014 it was estimated that there were approximately 2500 of this species in the wild [2] . commencing in mid-february 2015, a number of people using and managing the environment surrounding the bellinger river observed a large number of deaths in m. georgesi. a multi-agency investigation was undertaken to establish the cause and extent of the outbreak. mortalities continued for 7 weeks and involved mostly adult animals. more than 400 dead or dying animals were observed and population surveys conducted after the outbreak had ceased indicated that only a small proportion of the total population had survived with very few adults remaining. no other species, including the sympatric murray river turtle (emydura macquarii), appeared to be affected. full details of the prevailing environmental conditions and the extent of the investigation have been described by others [2] . a number of moribund and dead m. georgesi were collected for post mortem examination at the australian registry of wildlife health, taronga conservation society australia mosman, nsw. tissue samples were referred to a number of different laboratories to test for a wide range of potential pathogens, toxins and water quality assessment. no consistent results were obtained from bacterial cultures and tests for mycoplasma, trichomonas, chlamydia and toxins gave negative results. on the basis of the histological changes and a failure to identify any other causative agent, a viral aetiology was suspected. testing was undertaken at several other laboratories and ranaviruses, adenoviruses, paramyxoviruses (ferlavirus) and herpesviruses were excluded [2] . samples were also referred to the elizabeth macarthur agriculture institute, menangle nsw. here we report the isolation and characterisation of a novel nidovirus and provide evidence for its involvement as the principal pathogen in this disease outbreak. the nsw office of the environment and heritage (oeh) was consulted during the early stages of the investigation and confirmed that, as this was a diagnostic investigation, no animal care and ethics (aec) or other approvals were necessary. when sample collection was undertaken as part of the follow-up epidemiology investigations, all procedures were approved by the aec of the taronga conservation society of australia (aec approval number 3e/10/ 15). twenty one turtles that were either dead or moribund and then euthanased were submitted to the australian registry of wildlife health, taronga conservation society of australia where post mortem examinations were conducted. various samples from each of the 21 m. georgesi were submitted to the virology laboratory at the elizabeth macarthur agriculture institute. swabs had been collected from the buccal cavity (n = 5), conjunctiva (n = 5) and cloaca (n = 5) and placed in 3ml of phosphate buffered gelatin saline (pbgs, ph 7.3) and held at 4˚c. samples of brain (n = 4), conjunctiva (n = 5), kidney (n = 10), liver (n = 10), lung (n = 9), myocardium (n = 5), ovary (n = 2), spleen (n = 10), urinary bladder (n = 5) and serum or plasma (n = 11) (either fresh or after holding frozen at -80˚c) were sent on frozen blocks for 'same day' delivery to the elizabeth macarthur agriculture institute. samples of frozen kidney (n = 5) or liver (n = 3) were also submitted from an archival collection of m. georgesi tissues that had been held at the australian museum, sydney since collection in 1991. suspensions (approximately 20%, w/v) of fresh tissues from affected animals were prepared by homogenising in serum free minimal essential medium (mem), containing penicillin (230μg/ml), streptomycin (250μg/ml) and amphotericin-b (5μg/ml), using a bead beater. the supernatant was collected after centrifugation at approximately 3000g for 20 min at 4˚c and filtered using a 0.45μm syringe filter. a wide range of tissues was also collected and fixed for 48 hours in 10% neutral buffered formalin solution. sections of formalin fixed tissue were cut and stained with haematoxylin and eosin, giemsa and periodic acid-schiff stains using standard methods and examined by light microscopy. sub-confluent monolayer cultures of buffalo african green monkey kidney (bgm) cells [3] initially grown at 37˚c in 10ml cell culture tubes containing mem supplemented with 10% (v/v) foetal bovine serum (fbs), penicillin (115μ/ml), streptomycin (125μg/ml) and amphotericin-b (2μg/ml) were used for the first virus isolation attempts. immediately before use the culture medium was replaced with fresh maintenance medium (mem, 2% fbs and antibiotics) and 200μl of filtered supernatant from tissue homogenate was added to the medium. based on experience with other aquatic pathogens, all cultures were maintained at 25˚c. cells were observed frequently and passaged after 7 days by scraping the cells off the tube surface and adding 200μl of suspension to new sub-confluent monolayers. after changes were observed in the morphology of bgm cell monolayers, tissue culture fluids were also passaged onto sub-confluent monolayers of other mammalian, avian, fish, reptile or mosquito cell cultures. full details of the cell cultures used and the relevant culture conditions are summarised in table 1 . cell culture supernatants were examined by electron microscopy by placing 10μl of sample on parlodion/carbon coated 400 mesh copper grids. after washing briefly with de-ionised water, negative staining was achieved by the addition of 2% aqueous uranyl acetate. the stained specimens were examined in a philips 208 transmission electron microscope. culture supernatant from bgm cell cultures showing advanced cytopathology was clarified by centrifugation at 3,000g for 30 min and passed through a 0.45μm filter. the virus was then pelleted at 25,000g for 2 hours at 4˚c and resuspended in 500μl of nuclease free water. host cell nucleic acids were removed by incubating with dnase i (250u; stratagene) and an rnase cocktail (0.5u rnase a and 20u rnase t1; ambion) at 37˚c for 2 hours. total ribonucleic acids were then extracted and eluted in 20μl of rnase free water using an rneasy minikit (qiagen). a dna library was prepared using a truseq stranded mrna sample prep kit, omitting the poly (a) mrna purification step and sequencing with 150bp paired end reads was completed on an illumina miseq platform at the australian genome research facility (brisbane, australia). raw ngs data has been submitted to the sequence read archive (sra) (accession srp158959) [6] .trimmomatic software [7] was used to determine and remove low quality bases and adapter sequences using a minimum quality score of 20. scaffolds (nodes) were then assembled in velvet optimiser [8] and run in the megablast 2.2.26 software [9] to identify homology with sequences in the genbank database [10] . some scaffolds were identified as having homology with ball python nidovirus sequences. to remove extraneous sequence data only the trimmed sequences that aligned to ball python nidovirus genome (kj541759), using burrows-wheeler aligner [10] , were assembled into scaffolds using velve-toptimiser [8] . the scaffolds were then aligned in sequencher [11] to form a single contig. additional sanger sequencing was completed after further amplification using the primers listed in s1 table. open reading frames (orfs) were determined from the full sequence using open reading frame viewer software [12] . similarity to other viruses for each of the orfs and their predicted amino acid sequences were determined by searches using blastn and blastp [13] algorithms through the ncbi server (http://blast.ncbi.nlm.nih.gov/blast.cgi). a region of the orf1b which was predicted to code for an rna-dependant rna polymerase (rdrp) was selected for comparative studies because this region was found to be the most highly conserved across the torovirinae subfamily. the amino acids of this conserved region of the orf1b, were aligned against other members of the torovirinae and representatives from the other families within the order nidovirales. a phylogenetic tree was produced using the maximum likelihood method based on the jtt matrix-based model [14] with 1000 bootstrap replicates. initial trees for the heuristic search were obtained automatically by applying the neighbor-joining method to a matrix of pairwise distances estimated using a jtt model. the analysis involved amino acid sequences from 44 virus strains. all positions containing gaps and missing data were excluded. there was a total of 268 amino acid positions in the final dataset. phylogenetic analyses and percentage similarities were calculated using clustal w alignment in mega6 [15] . using the methods described above, similar comparisons were made for the conserved putative helicase (342 positions) and m pro c-like (302 positions) domains of polyprotein 1ab, as well as the whole length polyprotein 1a and spike protein. after the detection of a novel rna viral sequence by ngs, qrt-pcr taqman assays were designed to detect nucleic acid sequences that were specific for the novel virus and directed at the sequence encoding the presumptive polyprotein 1a (replicase 1a). the resulting assay was used to test serum and a range of tissue homogenates from affected animals. later, a selection of samples was also tested in an assay directed at the region encoding the 'spike' protein. for testing of samples in these assays, total nucleic acid was extracted using a magnetic bead based kit and a magnetic particle handling system [16] . the details of primers and probe are as follows: the qrt-pcr assays, both of which produced a 77bp product, used agpath mastermix (life technologies) and was run on an abi 7500 thermocycler for 45 cycles under standard conditions as specified by the mastermix manufacturer. on each occasion the assay was run, included on the plate were two positive control samples, representing approximately 10000 (pc1) and 1000 (pc2) copies of viral rna, a negative control (nc, a trna solution) and a 'no template' control (ntc), the latter consisting of nuclease free water. the positive and negative controls were included throughout the procedure from extraction through to the completion of the pcr reactions while the ntc was only added during the pcr setup. an exogenous rna control [16] was also included to monitor the efficiency of the nucleic acid extraction and pcr reaction and to detect the presence of inhibitors. any evidence of rna amplification was recorded and results were expressed as cycle threshold (ct) values as described previously [16] . a conventional rt-pcr was run to produce template for in situ hybridisation (ish) probe production. pcr template was created using conventional pcr primers targeting the gene encoding a putative viral membrane protein (m). primer details were as follows: brvmp forward: 5' atggagtccacctcga 3', brvmp reverse: 5' ttatggtaggatg gctgtt 3'. rt-pcr reactions were prepared using the mytaq one-step rt-pcr kit (bioline) according to the manufacturer's instructions and pfuultra ii fusion dna polymerase (agilent) was added to the reaction at a final dilution of 1/100. the template used (5μl) was purified from the cell culture isolate as described previously. cycling parameters for the pcr assay were a 20 min initial reverse transcription step at 45˚c and a 1 min denaturation step at 95˚c, followed by 40 cycles of denaturation at 95˚c for 10s, annealing at 56˚c for 30s, extension at 72˚c for 30s and a final extension step at 72˚c for 7 min. the resulting template (of approximately 650 base pairs) was visualised using electrophoresis in a 1.5% agarose gel stained with gel red nucleic acid gel stain (biotium). approximately 1ng of pcr product was used in the pcr digoxigenin (dig) ish probe synthesis kit (roche) with thermocycling parameters as described in the kit instructions. the size of the dig-labelled probe was determined using electrophoresis (as described above) and was quantified using a nanodrop spectrophotometer (thermo fisher). a hybaid omnislide thermal cycling system (thermo scientific) was used to maintain humidity and temperature control during all incubation steps. hybrislip coverslips (sigma) were used at every step. sections of tissue 4μm thick were cut onto superfrost plus slides (menzel gläser) and dried at 65˚c for 30 min. slides were dewaxed in xylene and rehydrated in an ethanol series. slides were treated with 100 μl proteinase k (dako) for 15 min at 37˚c, followed by a 3 min tris buffer wash (0.1 m, ph 8.0) and pre-hybridisation for 1h at 37˚c in 100 μl prehybridisation buffer (50% formamide, 1 x denhardt's solution, 4 x ssc, 0.25 mg yeast trna/ml, 10% dextrans sulfate). an exchange was made with 100 μl hybridization buffer containing prehybridisation solution with 5 ng/μl ish probe and slides were heated at 95˚c for 5 min. slides were cooled on ice for 5 min then incubated overnight at 42˚c. the next day slides were washed in wash buffer (0.1 m maleic acid, 0.15 m nacl and 0.3% tween20, ph 7.5) (roche) at 40˚c for 10 min. sections were then blocked with 500 μl blocking buffer (1% roche blocking reagent in wash buffer) at room temperature (rt) for 30 min. blocking buffer was exchanged with a 1/200 dilution of anti-dig antibody (roche) in blocking buffer and incubated for 1 hr. sections were washed for 30 min at rt in wash buffer and equilibrated for 5 min at rt in detection buffer (roche) and incubated with 500 μl bcip/nbt in the dark at rt for 5 h. slides were then rinsed in tap water and counterstained with 0.2% fast green (australian biostain) for 20 sec and mounted in dpx mounting medium (sigma-aldrich). negative control slides included additional tissue sections processed without the ish probe and sections with ish probe applied that were taken from an eastern long neck turtle (chelodina longicollis) that had given negative results by pcr. the latter samples were included in the absence of fixed tissues from presumptively uninfected m. georgesi. following the detection of the novel virus, in november 2015 (about 6 months after the cessation of the outbreak) an intensive survey of the parts of the river where affected turtles had been detected [2] was undertaken by groups of biologists and ecologists and samples collected from a wide range of aquatic species and some terrestrial animals (n = 360) to establish the size of the remaining population and whether any other animals were carrying this virus. a total of 502 samples were collected, consisting of various amphibians, arthropods, fish and reptiles. for animals of sufficient size, swabs were collected separately from mucosal surfaces (usually conjunctiva, oral and cloaca), placed in pbgs and held at approximately 4˚c for up to 10 days before being sent to the laboratory. blood samples were also collected from larger animals (n = 83) and serum was submitted for testing. small invertebrates and small vertebrates were preserved in absolute ethanol. pooled tissues or whole bodies from ethanol fixed animals were prepared for nucleic acid extraction by first digesting in proteinase k solution as described previously [17] . full details of the species, the samples collected and the numbers examined are listed in table 2 . when examined at necropsy, animals were usually in poor body condition, had bilateral swollen eyelids and conjunctivitis. some animals had tan foci on the skin of the ventral thighs. consistent histological findings included severe peri-orbital, splenic and nephric inflammation and necrosis. a proportion of affected animals also had evidence of a fibrinoid vasculopathy. full details of the gross and histopathology will be described elsewhere. when bgm cells were inoculated with pooled homogenates of spleen and lung tissues from 5 animals, after 2 passages cytopathology consisting of lytic destruction of cells in small foci was in subsequent studies, this virus replicated in mdbk cells but without cytopathology. replication could only be detected by pcr and virus loads were lower than those obtained in cv-1 cells. examination of the culture supernatant from cv-1 cells by electron microscopy identified bacilliform particles approximately 170 nm long and 25 nm wide (fig 1) . the name "bellinger river virus" (brv) is proposed for this virus. subsequent to the successful detection of virus in the pool of tissues, further virus isolation attempts were undertaken on a number of individual tissue samples. viruses were isolated in cell culture from heart (10/10 samples), kidney (11/14) , plasma (2/4) and spleen (4/5). the nucleic acid sequencing analysis identified a sequence of 30742 nucleotides of a virus with a genome organisation most closely related to viruses from the family coronaviridae and subfamily torovirinae (fig 2) . this sequence was subsequently confirmed by primer walking and further sanger sequencing to establish what is considered to be the full length sequence (genbank reference mf685025) of a nidovirus. the pattern of orfs of this virus closely matches the pattern observed for other nidoviruses. however, this appears to be a novel nidovirus with less than 50% nucleotide similarity to recognised nidoviruses. phylogenetic analysis using aligned amino acid sequences from the conserved rdrp site of polyprotein 1b (pp1b) confirmed the classification of the virus and places it in the proposed barnivirus (bacilliform reptile nidovirus) genus [18] (fig 3) . very similar results were obtained from comparisons involving the conserved putative helicase and m pro c-like domains of polyprotein 1ab, as well as the whole length polyprotein 1a and spike protein (s1-s4 figs). this new nidovirus appears most closely related to ball python nidovirus 1 [18, 19] , python nidovirus [20] and morelia viridus nidovirus [21] , however these viruses have yet to be assigned to a genus. the highest amino acid sequence similarity was found in the pp1b region (68-69%), while all other proteins had lower levels of similarity (table 3) . blast search also identified lower levels of homology with shingleback nidovirus (sbnv) [22] , a virus of australian shingleback lizards, two virus sequences that were identified from intestinal nematodes in asian snake species [23] , and four virus sequences identified in asian snake species [24] . these viruses are likely to also be classified within the proposed genus barnivirus. in common with most other members of the coronaviridae [25-27] a ribosomal frameshift sequence was identified within the first orf of the turtle nidovirus sequence. an alignment of this region across the possible members of the proposed barnivirus genus is shown in fig 4. watson-crick base pairing of the rna sequence shows a common pseudo-knot motif downstream to the ribosomal frameshift sequence. this pseudoknot secondary structure is postulated to play an important role in the discontinuous synthesis of subgenomic rnas [28] . the structure of this pseudo-knot differs from most other similar structures in coronaviruses by the presence of an additional stem in the region between the first stem and the "kissing stem loop". this predicted structure is common to all of the putative members of the proposed barnivirus genus [18, 29] (fig 5) . the qrt-pcr assay that was developed to detect the presumptive replicase 1a protein coding region of this virus allowed the detection of viral rna in tissues from affected animals. samples of kidney, liver, lung and spleen from the 5 animals that had contributed to the virus isolation pool were tested. very high virus loads were detected in kidneys and spleen (ct range 15.9-16.4) of 2 of the 5 animals but variable and moderately high levels of viral rna were also detected in other tissues of these and the other 3 animals (table 4 , animals 1-5). testing of a wider range of tissues from another 5 dead animals from the outbreak demonstrated high virus loads in heart, kidney, lung and spleen while virus was present in all other tissues at moderate levels ( table 4 , animals 6-10). relatively high levels of viral rna were also detected in oral, conjunctival and cloacal swabs of these animals (table 5 ). serum samples from another 11 animals gave strong to moderate reactivity (ct range 19.4-35.6) ( table 6 ). very similar results were obtained for all samples when tested in the spike protein assay. as the reactivity in both assays was extremely similar, the extracts of the archival samples were only tested in the replicase 1a qrt-pcr and gave negative results. ish was conducted on a selection of affected tissues and confirmed the presence of viral nucleic acid in association with severely necrotic histological lesions. within a severely affected lacrimal gland there was staining of residual glandular epithelial cells and staining within areas of necrotising inflammation. the kidney lesions from two affected turtles showed similar staining in areas containing intense inflammatory infiltrates and necrotic cellular debris, within degenerate or necrotic renal tubule epithelial cells (fig 6a and 6b ) and within foci of vasculitis. staining was also detected within dense foci of necrotising cystitis, as well as in scattered granulocytes in the multifocally oedematous urothelium (fig 6c and 6d) . occasional granulocytes stained within the myocardial interstitium. no staining was observed in the negative control preparations which included tissue sections processed without the probe and tissues from an eastern long neck turtle (chelodina longicollis) that had given negative results by pcr). samples of all animals collected (table 2) were tested in the replicase 1a qrt-pcr assay. of the myuchelys georgesi turtles tested (n = 31), viral rna was detected in conjunctival swabs from eight animals and also in oral swabs from 2 of these animals. ct values were consistently high (ct >31), with most ct >35. low levels of viral rna were also detected in conjunctival swabs from two of the 49 emydura macquarii turtles tested and in 2 of 13 samples of egg casings of unknown species (ct >37 in each instance) ( table 2) . these casings had been adherent to the plastron of 2 m. georgesi. the reactivity detected in each of these samples collected during this field survey was confirmed with similar results obtained when tested in the replicase 1a qrt-pcr. the virus described in this study, for which the name "bellinger river virus" (brv) is proposed, is believed to have had a profound impact on the wild myuchelys georgesi population. the data presented provide strong, indirect evidence that this virus is the principal aetiological agent involved in the deaths of these m. georgesi. unfortunately, as this is a now a critically endangered species of turtle [30], it is not possible to undertake experimental transmission studies to fulfil koch's postulates. nevertheless, the criteria for disease causation defined by fredericks and relman [31] have been met. brv, as a novel nidovirus, was isolated from tissues of diseased animals, very high levels of viral rna were detected in tissues with marked pathological changes and in situ hybridisation assays demonstrated the presence of specific viral rna in lesions in kidneys and eye tissue-two of the main affected organs. no or very low levels of viral rna were detected in normal animals tested at the time of the outbreak. collectively these data suggest that this virus is the likely cause of these mortalities. the high levels of viral rna in several different organ systems would suggest that this virus is actively replicating in these organs and these detections are not an incidental finding or due to contamination as a result of either ingestion or inhalation of virus from the environment. nevertheless, although we believe that brv is the principal pathogen, it is inevitable that other factors are likely to have contributed to the onset and severity of disease. the turtles were probably already in a stressed and potentially immunosuppressed state as they had lost considerable body condition [2] . the higher water temperatures may have supported and perhaps enhanced virus replication, a phenomenon that is well known for a number of aquatic viruses [32, 33] and, although other pathogens have not been identified, it is likely that other microbes, even commensals, may have contributed to disease severity and ultimately the death of these turtles. the m. georgesi population has been reduced to such an extent that the species has now been classified as "critically endangered" [30], with perhaps less than 100-200 animals present in the wild. the survival of the species may be dependent on a captive breeding population [2] due to the very small number of mature adults (possibly 10-15) that have survived in the wild (chessman and jones, pers comm [s1 file]. the impact of bellinger river virus, the constrained geographical location of m. georgesi, the potential for hybridisation with emydura macquarii and perhaps a range of environmental factors [2] have combined to seriously threaten the survival of this species of turtle. there are few documented instances where a pathogen has been directly incriminated in the potential extinction of an animal species [2] and to have such an impact in a very short time period. for example, while the chytrid fungus of batrachochytrium dendrobatidis has resulted in significant declines in amphibian populations in a number of countries and on different continents [34, 35] , its impact on amphibian populations has taken place over almost two decades, admittedly on a broad geographical scale. table 5 . qrt-pcr results for viral transport medium containing swabs collected from mucosal surfaces of affected m. georgesi turtles using an assay detecting a segment of the putative replicase 1a gene. results for samples with very high rna loads (ct<20) shown in bold. these animals correspond to the st 5 animals from which tissues were collected (table 4) . at the time of its identification, this nidovirus was the first of the members of the proposed barnivirus genus (family coronaviridae, subfamily torovirinae) to be associated with disease in an aquatic reptile. other barniviruses have been linked to infection and often disease in terrestrial reptiles. these include infections in ball pythons (python regius) [18, 19, 36] , indian rock python, (python molurus) [20, 36] , burmese python (python bivittatus) [36] , green tree python (morelia viridis) [21] , carpet python (morelia spilota) [36] , boa constrictor (boa constrictor) [36] and shingleback lizards (tiliqua rugose) [22] . in the terrestrial reptiles, the predominant clinical presentation has been respiratory disease [18] [19] [20] [21] [22] 37] . experimental infections undertaken in ball pythons [37] have also demonstrated a tropism for the respiratory tract. in this respect, brv differs in that, while there are pathological changes in several organ systems, the most severe changes are in the kidneys, corresponding to the high viral loads detected. it is interesting to note that there was only evidence of brv replication in vitro in kidney cell cultures. while there was evidence of visible cytopathology in primate kidney cell cultures, virus replication was also detected in mdbk cells when culture supernatants were tested by qrt-pcr. the tropism of this virus for kidney cells would suggest that renal disease was a key factor contributing to the death of these turtles. chelonian species have been reported to be infected by a wide range of viral taxa including adenoviruses, bunyaviruses, flaviviruses, herpesviruses, paramyxoviruses, picornaviruses, ranaviruses, retroviruses and togaviruses [38, 39] . a virus in the order nidovirales, family arteriviridae, has been identified in the chinese softshell turtle, pelodiscus sinensis, during an outbreak of severe haemorrhagic disease affecting multiple organs including gonads, intestine, kidney, liver, lung, and spleen [40] and, recently, partial sequence of another virus from the arteriviridae was detected by nucleic acid sequencing of the gut, liver and lungs of an apparently normal chinese broad headed pond turtle (mauremys megalocephala) [24] . based on the extent of the epizootic in the bellinger river, it would appear that brv was introduced into a naive population. the origin of brv is yet to be determined as there was no evidence of virus in other species sampled from the affected area, except for the detections of low levels of brv rna in emydura macquarii, a closely related species that can interbreed with m. georgesi and in 2 clusters of egg casings. in each instance the viral rna levels were close to the limit of detection and are of uncertain significance. the possibility that these are the result of superficial contamination from the environment cannot be excluded, and is likely for the egg casings as they were removed from the carapace of m. georgesi turtles in the bellinger river. brv is the first nidovirus in the proposed barnivirus genus that has been isolated from a non-squamatid reptile and is phylogenetically placed between the recognised python nidoviruses and the shingleback lizard nidovirus. this may indicate that the barniviruses are quite widespread among both aquatic and terrestrial reptile species. the apparent abundance of reptilian nidoviruses would suggest that this turtle virus may have originated in a snake, lizard or other reptile and that m. georgesi could be an incidental host. however, as detections of viruses from the order nidovirales have also been reported in fish [41] [42] [43] [44] [45] and more distantly related viruses even in insects [46] [47] [48] , there are many potential reservoirs for this virus. finally, from a practical perspective, the detection of viral rna in cloacal and ocular swabs, serum and plasma indicates that these are samples that could be collected from live endangered animals for future screening or surveillance. these sample types have already been used to select a group of presumptively virus-free animals to establish a captive breeding colony [2] that remains healthy and apparently free of virus after 2 years in isolation. serum, conjunctival and cloacal swabs from the animals in this captive colony have given negative results in the brv qrt-pcr assays on several occasions over this 2 year period. however, the development of serological assays to detect antibodies to this virus in various animal species would also be advantageous to provide additional confidence in the presumptive virus-free status of captive breeding colonies. these assays would also be invaluable to support epidemiological studies and to assist the search for the origins of this novel virus. the risk of inter-specific competition in australian short-necked turtles profiling a possible rapid extinction event in a long-lived species characteristics of the bgm line of cells from african green monkey kidney spontaneous malignant transformation of hamster lung cells in tissue culture growth of lymphocystis virus in a sea bass, lates calcarifer (bloch), cell line. singapore vet the sequence read archive trimmomatic: a flexible trimmer for illumina sequence data velvet: algorithms for de novo short read assembly using de bruin graphs database indexing for production megablast searches dna sequence analysis software database resources of the national center for biotechnology gapped blast and psi-blast: a new generation of protein database search programs the rapid generation of mutation data matrices from protein sequences mega6: molecular evolutionary genetics analysis version 6.0 longitudinal study of the detection of bluetongue virus in bull semen and comparison of real-time polymerase chain reaction assays application of an embryonated chicken egg model to assess the vector competence of australian culicoides midges for bluetongue viruses ball python nidovirus: a candidate etiologic agent for severe respiratory disease in python regius identification of a novel nidovirus in an outbreak of fatal respiratory disease in ball pythons (python regius) novel divergent nidovirus in a python with pneumonia nidovirus-associated proliferative pneumonia in the green tree python (morelia viridis) discovery and partial genomic characterisation of a novel nidovirus associated with respiratory disease in wild shingleback lizards (tiliqua rugosa) redefining the invertebrate rna virosphere the evolutionary history of vertebrate rna viruses programmed translational frameshifting ribosomal frameshifting on viral rnas the primary structure and expression of the second open reading frame of the polymerase gene of the coronavirus mhv-a59 a highly conserved polymerase is expressed by an efficient ribosomal frameshifting mechanism an rna pseudoknot in the 3' end of the arterivirus genome has a critical role in regulating viral rna synthesis changes to taxonomy and the international code of virus classification and nomenclature ratified by the international committee on taxonomy of viruses sequence-based identification of microbial pathogens: a reconsideration of koch's postulates molecular comparison of isolates of an emerging fish pathogen, koi herpesvirus, and the effect of water temperature on mortality of experimentally infected koi is horizontal transmission of the ostreid herpesvirus oshv-1 in crassostrea gigas affected by unselected or selected survival status in adults to juveniles? chytridiomycosis causes amphibian mortality associated with population declines in the rain forests of australia and central america global emergence of batrachochytrium dendrobatidis and amphibian chytridiomycosis in space, time, and host detection of nidoviruses in live pythons and boas: tierä rztliche praxis ausgabe k: kleintiere / heimtiere respiratory disease in ball pythons (python regius) experimentally infected with ball python nidovirus viruses of lower vertebrates j vet med b infect dis vet public health partial sequence of a novel virus isolated from pelodiscus sinensis hemorrhagic disease isolation of the fathead minnow nidovirus from muskellunge experiencing lingering mortality genetic analysis of a novel nidovirus from fathead minnows characterization of white bream virus reveals a novel genetic cluster of nidoviruses identification and ultrastructural characterization of a novel virus from fish fathead minnow nidovirus infects spotfin shiner cyprinella spiloptera and golden shiner notemigonus crysoleucas identification and characterization of genetically divergent members of the newly established family mesoniviridae discovery of the first insect nidovirus, a missing evolutionary link in the emergence of the largest rna virus genomes an insect nidovirus emerging from a primary tropical rainforest we acknowledge staff of the nsw office of environment and heritage, particularly gerry mcgilvray and shane ruming for their coordination of the disease response. the nsw office of environment and heritage, nsw department of primary industries, nsw local lands services and taronga conservation society australia provided considerable financial and logistical support. we also thank bellingen shire council, dr mark crane and his staff at bellingen veterinary hospital, adam skidmore, and paul thompson and herpetology staff from taronga, and ecologist bruce chessman for their considerable contributions to the disease response. the willingness of scott ginn and colleagues from the australian museum, sydney to provide valuable tissues from archived m. georgesi is greatly appreciated. we are indebted to the staff of virology laboratory who assisted with sample preparation, cell culture and virus isolation studies and for assistance with the pcr assays. key: cord-328259-3g4klpyg authors: guajardo-leiva, sergio; chnaiderman, jonás; gaggero, aldo; díez, beatriz title: metagenomic insights into the sewage rna virosphere of a large city date: 2020-09-21 journal: viruses doi: 10.3390/v12091050 sha: doc_id: 328259 cord_uid: 3g4klpyg sewage-associated viruses can cause several human and animal diseases, such as gastroenteritis, hepatitis, and respiratory infections. therefore, their detection in wastewater can reflect current infections within the source population. to date, no viral study has been performed using the sewage of any large south american city. in this study, we used viral metagenomics to obtain a single sample snapshot of the rna virosphere in the wastewater from santiago de chile, the seventh largest city in the americas. despite the overrepresentation of dsrna viruses, our results show that santiago’s sewage rna virosphere was composed mostly of unknown sequences (88%), while known viral sequences were dominated by viruses that infect bacteria (60%), invertebrates (37%) and humans (2.4%). interestingly, we discovered three novel genogroups within the picobirnaviridae family that can fill major gaps in this taxa’s evolutionary history. we also demonstrated the dominance of emerging rotavirus genotypes, such as g8 and g6, that have displaced other classical genotypes, which is consistent with recent clinical reports. this study supports the usefulness of sewage viral metagenomics for public health surveillance. moreover, it demonstrates the need to monitor the viral component during the wastewater treatment and recycling process, where this virome can constitute a reservoir of human pathogens. viruses are the most abundant biological entities on earth, with an estimated 10 31 particles worldwide [1] . urban environments impacted by human activity, such as wastewater treatment plants (wwtps) and sewage are not an exception. sewage and wwtps form an ecosystem that supports thriving microbial communities (prokaryotic and eukaryotic), plants and animals, such as rodents, birds and bats [2] . in these environments, viruses associated with the biological waste of a city are mixed with viruses from all the organisms living in the wwtp, thus forming an untapped source of viral diversity [2, 3] . in general, the sewage virosphere can be considered a mixture of human viruses excreted in the feces, urine and skin peeling, and viruses from animals, invertebrates, plants, fungi and bacteria [4, 5] . sewage has been historically used to monitor known human viral pathogens, such as noroviruses, hepatitis viruses, enteroviruses, rotaviruses and adenoviruses [6, 7] . the presence of viral pathogens in wastewater reflects ongoing infections being transmitted in the human population served by the given wwtp [2] [3] [4] . likewise, sewage can reveal new and unknown viral genomes that could, in the future, be associated with idiopathic human diseases [3] . nowadays, decreased water availability due to global warming and increased human water consumption have turned water scarcity into a cyclical problem. to solve this, recycled water derived from wwtp effluent has been intensively used for industrial operations, agricultural irrigation and even recreational activities [8] . in this way, recycling of treated sewage generates a potential public health risk, as well as a critical risk for agricultural and animal production industries, due to insufficient removal of pathogenic viruses [6, 8, 9] . current regulations have promoted improved treatment guidelines, which now combine mechanical, biological and chemical processes, such as flocculation, sedimentation, filtration, chlorination and uv-radiation [6, 8, 9] . these treatments have significantly reduced microbiological contamination by inactivating and removing bacteria and protozoa, but they have little effect on human viruses, such as adenoviruses and enteroviruses, which are later dispersed in effluent waters [8] [9] [10] . in latin america, the sanitary systems, which are still under development, have a low sewage treatment coverage (30%), generating a potential health risk for human and animal populations [11] . chile is a middle-income country with a population of 19 million inhabitants, of which approximately 7 million inhabitants are concentrated in the capital, santiago de chile, making it the seventh largest city in the americas. most of santiago's sewage is decontaminated by three wwtps: el trebal, la florida and la farfana. el trebal serves 3.2 million equivalent inhabitants, and water decontaminated by this wwtp provides irrigation for 57,800 agricultural hectares of land that is mainly used for the production of fruits and vegetables consumed in santiago and exported according to international environmental standards. the sewage virosphere has been monitored world-wide using molecular techniques, such as pcr and quantitative pcr (qpcr) [5, 7] . these methods can only provide information about the presence and abundance of known and characterized viruses because there is no universal marker for viruses, such as the 16s rrna gene for bacteria [4, 5] . despite the increasing application of high-throughput sequencing techniques (hts) for viral metagenomics, their use in identifying viruses in sewage has not been well explored [4, 5, 11] . however, the few existing studies have demonstrated that the study of the viral metagenomics of sewage is an excellent tool to monitor, identify and explore the diversity of the viral communities circulating among human and livestock populations [2, 4, 5, 7, 11, 12] . this is especially important for rna viruses that, in the last decade and along with viral metagenomics, have undergone a revolution in terms of their discovery, thereby contributing to our understanding of viral diversity [13] . to our knowledge, few studies [4, 7] have used viral metagenomics to investigate rna viral communities in sewage around the world, none of which have been conducted in latin america. with this in mind, we conducted a pilot study to investigate the rna virosphere using a single sample equivalent to the sewage treated during one day in santiago (el trebal wwtp) using viral metagenomics. our main results, despite bias related to the overrepresentation of dsrna viruses (in comparison to other studies), show that the rna virosphere of el trebal is mainly composed of unknown sequences (microbial and viral dark-matter). the known viral sequence fraction was dominated by viruses that infect bacteria and invertebrate hosts. a high diversity and novelty were discovered within the picobirnaviridae family, which can fill major gaps in the evolutionary history of this group. likewise, we unveiled abundant and emergent rotavirus a genotypes never before recorded in chile, thus representing changes in the prevalence of the classical genotypes. the latter discovery provides evidence for the benefits of using viral metagenomics to aid public health surveillance based on excreted viruses in sewage. additionally, this study reveals the importance of analyzing viral dark matter through self-clustering of the sequences independent of their direct comparison to databases, which can result in the discovery and classification of new viral sequences. el trebal (hereafter referred to as trebal) is a wwtp (33 • 32 28.5" s 70 • 50 08.2" w) located in santiago, chile (figure 1 ), that serves a population equivalent to 3.2 million inhabitants. a composite sample (1 l), representing 24 h of raw sewage, was obtained on 5 jun 2017. the sample was sequentially filtered through 8-and 3-µm pore size polycarbonate filters (isopore, 47 mm diameter, millipore, milford, ma, usa) using a swinex filter holder (millipore) and then a 0.22-µm pore size filter (sterivex pes, millipore). particles in the 0.22-µm filtrate were concentrated by ultracentrifugation to a final volume of approximately 1 ml, as described in [14] . briefly, the 0.22-µm filtered sample was centrifuged at 100,000× g for 1 h. the pellet containing viral particles was resuspended in glycine buffer and then incubated on ice for 30 min. finally, after an additional ultracentrifugation at 100,000× g for 1 h, viruses were recovered by resuspending the viral pellet in 1 ml of pbs. viruses 2020, 12, x for peer review 3 of 16 el trebal (hereafter referred to as trebal) is a wwtp (33°32′28.5″ s 70°50′08.2″ w) located in santiago, chile (figure 1 ), that serves a population equivalent to 3.2 million inhabitants. a composite sample (1 l), representing 24 h of raw sewage, was obtained on 5 jun 2017. the sample was sequentially filtered through 8-and 3-μm pore size polycarbonate filters (isopore, 47 mm diameter, millipore, milford, ma, usa) using a swinex filter holder (millipore) and then a 0.22-μm pore size filter (sterivex pes, millipore). particles in the 0.22-μm filtrate were concentrated by ultracentrifugation to a final volume of approximately 1 ml, as described in [14] . briefly, the 0.22-μm filtered sample was centrifuged at 100,000× g for 1 h. the pellet containing viral particles was resuspended in glycine buffer and then incubated on ice for 30 min. finally, after an additional ultracentrifugation at 100,000× g for 1 h, viruses were recovered by resuspending the viral pellet in 1 ml of pbs. the resuspended viral particles (1 ml) were treated with dnase i (600 u) to remove the remaining free dna from the cellular fraction. the mixture was incubated for 1 h at 37 °c, followed by inactivation at 75 °c for 10 min. viral rna was extracted using the high pure viral rna kit (roche, basel, switzerland) according to the manufacturer's instructions, but without the use of the poly(a) carrier. bacterial dna contamination was checked by 16s rrna gene pcr amplification using a universal bacterial primer set (515f: 5'-gtgycagcmgccgcggtaa-3' and 806r: 5'-ggactacnvgggtwtctaat-3') https://earthmicrobiome.org/protocols-and-standards/16s/. bacterial (e. coli jm109) dna was used as a pcr spike control to check for pcr inhibition of viral rna. the purified rna sample was then sequenced using illumina hiseq technology (roy j. carver biotechnology center, urbana, il, usa). briefly, the rnaseq library was prepared with the illumina truseq stranded mrna sample prep kit (illumina, san diego, ca, usa). the library was quantitated by qpcr and sequenced from one end of the fragment in a single lane for 151 cycles on a hiseq 4000. fastq files were generated and demultiplexed with the bcl2fastq v2.17.1.14 conversion software (illumina). raw metagenomic reads were quality filtered using cutadapt v2.1 [15] , leaving only sequences longer than 50 bp (-m 50), and conducting 3′ end trimming for bases with a quality below 30 (-q 30) and hard clipping of the first five leftmost bases (-u 5). finally, sequences representing simple the resuspended viral particles (1 ml) were treated with dnase i (600 u) to remove the remaining free dna from the cellular fraction. the mixture was incubated for 1 h at 37 • c, followed by inactivation at 75 • c for 10 min. viral rna was extracted using the high pure viral rna kit (roche, basel, switzerland) according to the manufacturer's instructions, but without the use of the poly(a) carrier. bacterial dna contamination was checked by 16s rrna gene pcr amplification using a universal bacterial primer set (515f: 5 -gtgycagcmgccgcggtaa-3 and 806r: 5 -ggactacnvgggtwtctaat-3 ) https://earthmicrobiome.org/protocols-and-standards/16s/. bacterial (e. coli jm109) dna was used as a pcr spike control to check for pcr inhibition of viral rna. the purified rna sample was then sequenced using illumina hiseq technology (roy j. carver biotechnology center, urbana, il, usa). briefly, the rnaseq library was prepared with the illumina truseq stranded mrna sample prep kit (illumina, san diego, ca, usa). the library was quantitated by qpcr and sequenced from one end of the fragment in a single lane for 151 cycles on a hiseq 4000. fastq files were generated and demultiplexed with the bcl2fastq v2.17.1.14 conversion software (illumina). raw metagenomic reads were quality filtered using cutadapt v2.1 [15] , leaving only sequences longer than 50 bp (-m 50), and conducting 3 end trimming for bases with a quality below 30 (-q 30) and hard clipping of the first five leftmost bases (-u 5). finally, sequences representing simple repetitions, which are usually due to sequencing errors, were removed using prinseq v0.20.4 [16] at a dust threshold of 7 (-lc_method dust, -lc_threshold 7). details of the obtained sequences are shown in supplementary table s1 . viral rna metagenomes were assembled using de bruijn graphs, as implemented in the megahit v1.2.9 [17] and idba-ud v1.13 assemblers [18] in metagenomic mode. only contig sequences >200 pb were further analyzed. both assemblies were merged by clustering contigs at 100% identity and 100% coverage of the shortest sequence, leaving the largest contig using the nucmer algorithm implemented in mummer v3 [19] . after assembly, prodigal software [20] was used to predict protein-coding regions, with options (-p meta -n). the predicted proteins were aligned against the ncbi nr database using diamond v2.0.4 [21] (-e-value 0.00001) and parsed using the lowest common ancestor algorithm in megan 6 [22] (lca score = 50) using ncbi taxonomy tree to obtain the taxonomic annotation of each viral protein. species classification of viral proteins was used to infer putative hosts based on the virus-host db [23] , as described in [24] . the abundance of mapped proteins was quantified through read recruitment via bowtie2 [25] , with parameters (-end-to-end-very sensitive-n 1). the resulting sam file was parsed by the bbmap pileup script (bushnell b.-sourceforge.net/projects/bbmap/) and the relative protein abundances were normalized by gene length. the predicted proteins were functionally annotated using the pfam v32 database [26] through hmmscan options (-cut_ga) implemented on hmmer3 [27] . proteins annotated as rna-dependent rna polymerase (rdrp; 2266 predicted proteins) were taxonomically identified as described before, and the species classification was used to infer putative hosts. pairwise genetic distances were calculated for trebal rdrp nucleotide sequences and refseq (release 97) using word-based alignment-free distances implemented on alfree software [28] , with options (-s 2-d braycurtis-v counts). the pairwise distances between samples were analyzed by hierarchical clustering (hclust function in r) using a minimal increase in the sum of squares method (ward's method). the resulting dendrogram was cut (cutree function in r) into 12 groups that represent the main clusters based on the "unrooted" dendrogram representation. the trebal proteins annotated as rdrp (>450 aa) using the pfam database [26] and classified as picobirnaviridae were aligned by mafft v7 [29] using default parameters, with the option -globalpair. a phylogenetic rdrp gene tree was constructed using the maximum likelihood method implemented with iqtree (-bb 10,000-nm 10,000-alrt 1000-abayes) and 10,000 ultrafast bootstraps to evaluate branch robustness [30] . the amino acid substitution model used in the phylogenetic analysis (vt + f + i + g4) was determined by modelfinder [31] . reference sequences were obtained from ncbi refseq based on the phylogenetic analysis of the picobirnaviridae family from the international committee on taxonomy of viruses (ictv) [32] , and a reference genome of the genus alphapartitivirus was used as an outgroup. a ribosomal binding site (rbs) prediction of the 31 rdrp-predicted proteins used in the phylogenetic analysis was performed using prodigal software [20] , as described before. nucleotide sequences of trebal proteins annotated as rotavirus vp4, vp6 and vp7 using the pfam database [26] were aligned against the ncbi nt database using blastn [33] . the best hit (e-value < 0.00000001) classification of each nucleotide sequence was used to assign rotavirus species and type (g and p). raw sequences are available under ncbi sra bioproject prjna648644. assembled contigs and their annotation are available in the mg-rast server under the accession number mgm4904696.3 and can be accessed at the following link https://www.mg-rast.org/linkin.cgi? project=mgp95699. the trebal rna viral fraction, sequenced on the illumina hiseq platform, yielded~45 m high-quality reads (supplementary table s1 ). assembly of the rna metagenomic reads resulted in 62,164 contigs, which included 34.8 m reads and 72,313 predicted proteins. as with most viral metagenomes, only a small fraction (~12%) of the viral proteins matched to large databases [34, 35] such as ncbi nr, representing~15% of the assembled reads. the high number of unmapped contigs (85% of the assembled reads) are most likely derived from novel and uncharacterized microbial and viral sequences, as has been observed in other wastewater studies [2, 7] . a taxonomic survey of the predicted proteins shows that the known trebal reads ( figure 2a ) were assigned mostly to the virus domain (72%), followed by bacteria (24%) and eukarya (2%). cellular contamination of viral enriched fractions is a common feature of viral metagenomes [36] [37] [38] . in this study, this could correspond to cellular transcripts, probably due to the lack of rnase treatment during nucleic acid preparation, or the presence of some residual dna after the dnase treatment, although the 16s rrna gene was not amplified by pcr. viral sequences can also be misannotated to homologous cellular genes [36, 39] , which relies on the low number and diversity of viral sequences in the databases. additionally, horizontal gene transfer between viral and host genomes can lead to incorrect annotation based on the closest homologous sequences [36, 39] . misleading annotation is a frequent phenomenon in underexplored communities, such as environmental rna viruses where new schemes of classification are needed [13] . viruses 2020, 12, x for peer review 6 of 16 "unclassified rna viruses shim-2016" category in the ncbi taxonomy (~25% abundance; figure 2b ) and totiviriade family were also highly abundant in treated and untreated sewage samples from the eu [5, 7] . the cystoviridae family (~4% abundance; figure 2b ) is the only ictv-recognized bacteriophage family detected at more than 1% abundance. these bacteriophages are known to be abundant in the gastrointestinal tract of vertebrates and as part of raw sewage samples [42] . only recently was this family reported in metagenomic assessments of sewage samples using previously published viral metagenomes from pittsburgh, barcelona and addis ababa [2, 42] . finally, the reoviridae family, represented by the genus rotavirus, accounted for ~2.4% abundance. these human pathogenic viruses are routinely found in raw sewage samples using amplification techniques such as rt-pcr and rt-qpcr [14, 40] ; however, they have been detected by hts in a recent investigation in wales, uk [7] . since some human pathogenic viruses are seasonal (e.g., rotavirus and norovirus), their presence in wastewater can also vary, which could explain the absence of rotavirus in previous metagenomic surveys of sewage [8] . the assignment of viral species through the last common ancestor allowed us to classify the known viral sequences by their putative host using the virus-host database [23] , as described in [24] . most of the sequences belong to viruses putatively infecting bacteria (60%) and invertebrates (~37%) in further sections, we only analyze proteins classified as having viral origin. most of the viral proteins in the trebal rna viral metagenome ( figure 2b ) were classified as belonging to the picobirnaviridae family (55%), followed by partitiviridae (~25%) and other families, such as totiviridae (~7%), cystoviridae (~4%) and reoviridae (~2%). in summary, most of classified viral sequences (97%) belong to families comprising group iii (dsrna viruses) of the baltimore classification. nevertheless, ssrna (e.g., virgaviridae and leviviridae), dsdna (e.g., myoviridae and siphoviridae), and ssdna (microviridae) viral families were detected, but their relative abundances were below 1% (supplementary table s2 ). it is important to emphasize that since we did not perform any amplification step (e.g., mda) before library construction, it is not expected that relative viral abundances were affected by amplification bias in any steps before library preparation. thus, a possible explanation for the low abundance of ssrna viruses is that the inclusion of an inactivation step using dnase at 75 • c could potentially enhance the effect of natural rnases present in the sample, as has been described before [7] . however, this must be confirmed experimentally using an ssrna virus as spike control subjected to the indicated conditions in the same type of matrix. despite possibly missing some viral types during the extraction procedure, the characterized viral rna metagenome still harbors a vast diversity of viruses within each family. in general, rna viral families identified here have been identified in other previous studies that describe the rna virus diversity of wastewater [2, 5, 7, [40] [41] [42] . however, the viral community's specific taxonomic composition in the trebal has not been reported in other sewage studies. picobirnaviruses (pbvs) were dominant in sewage influent samples from wales, uk [7] . likewise, pbvs were prevalent in sewage samples across the usa [40] and have been proposed as a potential marker of fecal pollution [41] . viral sequences identified as partitiviridae-like viruses included in the "unclassified rna viruses shim-2016" category in the ncbi taxonomy (~25% abundance; figure 2b ) and totiviriade family were also highly abundant in treated and untreated sewage samples from the eu [5, 7] . the cystoviridae family (~4% abundance; figure 2b ) is the only ictv-recognized bacteriophage family detected at more than 1% abundance. these bacteriophages are known to be abundant in the gastrointestinal tract of vertebrates and as part of raw sewage samples [42] . only recently was this family reported in metagenomic assessments of sewage samples using previously published viral metagenomes from pittsburgh, barcelona and addis ababa [2, 42] . finally, the reoviridae family, represented by the genus rotavirus, accounted for~2.4% abundance. these human pathogenic viruses are routinely found in raw sewage samples using amplification techniques such as rt-pcr and rt-qpcr [14, 40] ; however, they have been detected by hts in a recent investigation in wales, uk [7] . since some human pathogenic viruses are seasonal (e.g., rotavirus and norovirus), their presence in wastewater can also vary, which could explain the absence of rotavirus in previous metagenomic surveys of sewage [8] . the assignment of viral species through the last common ancestor allowed us to classify the known viral sequences by their putative host using the virus-host database [23] , as described in [24] . most of the sequences belong to viruses putatively infecting bacteria (60%) and invertebrates (~37%) ( figure 2c ). other relevant groups belong to known viruses that infect humans (~2.4%). finally, there was a small contribution of viruses infecting plants, fungi, unicellular eukaryotes, non-mammal vertebrates and non-human mammals ( figure 2c ). most bacterial viruses belong to the picobirnaviridae (55%), cystoviridae (~4.4%) and levoviridae (~0.2%) families, while invertebrate viruses are associated with the partitiviridae-like sequences in the "unclassified rna viruses shim-2016" group and totiviridae family. human viruses were composed exclusively of sequences from the reoviridae family. viruses 2020, 12, 1050 7 of 15 as discussed above, most of these viral families and their hosts have been reported in previous metagenomic wastewater studies [2, 5, 7, 11, 40] . interestingly, the most abundant viruses in these studies belong to the virgaviridae family and the caudovirales order that infect plants and prokaryotes, respectively; however, these viruses appeared in low abundance in our sample [2, 11] . the only rna-based metagenomic study that was methodologically similar to our study also reported a high abundance of pbvs related sequences and rotaviruses [7] . the presence of pbvs sequences in the trebal wastewater, which were genetically close to those found in animal feces, could be related to farm runoff, since the location of this wwtp (figure 1 ) is outside the urban zone of santiago and surrounded by many irrigation channels. even though farm waste should not end up in the trebal, whose exclusive purpose is the treatment of sewage from santiago city, negligent handling of this waste could cause this result. the presence and high abundance of bacterial viruses (bacteriophages) is frequent in wwtps. their numbers are in part due to the release of phages that infect intestinal bacteria by human or animal defecation, but also from new infections of bacteria whose natural niche is the sewage ponds and sludge [42] . bacterial rna viruses are poorly understood in comparison with their dna counterparts that are commonly found in sewage viral metagenomic studies [2, 11] . the international committee on taxonomy of viruses (ictv) has only recognized two rna bacteriophage families, the ssrna family leviviridae and the dsrna family cystoviridae [41] . both families are represented by only 67 genomes in the ncbi refseq database (january 2020), which is low when compared to the 9661 viral dna genomes in the same database. however, it has been proposed that the picobirnaviridae is a new family of rna bacteriophages based on the presence of bacterial ribosome binding sites (rbss) upstream of the coding sequences, and also due to the lack of any consistent epidemiologic association with animal and human diseases [7, 43, 44] . in contrast to previous studies [2, 41] , most of the sequences associated with rna bacteriophages that we found correspond to the ictv-unrecognized picobirnaviridae family and the cystoviridae family, with only a small fraction associated with the leviviridae family. all the known members of the cystoviridae family infect pseudomonas species, which are commonly present in eutrophic environments, such as sewage and the human body [45] . therefore, the abundance of these viruses in the trebal metagenome can expand the known sequence space associated with this family (only 10 genomes are currently available in the ncbi database) and contribute to a better understanding of the bacteriophage biology related to rna genomes. invertebrate virus categories mostly include sequences of those that putatively infect annelids and arthropods, which was expected since these phyla have high densities in sewage stabilization ponds and aquatic environments of wwtps [46, 47] . in previous studies, a high prevalence of these invertebrate rna viruses was not reported since only recently was their vast diversity discovered and their sequences made available in the databases [40] . human viruses (which in trebal correspond to rotavirus a and c) are common in sewage and come from feces, urine, and respiratory secretions of infected hosts [9] . the most commonly identified viral pathogens in wastewater are adenovirus, enterovirus, hepatitis a and e viruses, norovirus, sapovirus, and rotavirus a [48] . these viruses are considered a potential public health risk because they are usually found at high concentrations in raw sewage, and their removal efficiency in wwtps is not commonly assessed [8, 49] . moreover, since wastewater is composed of the excreta of thousands to millions of inhabitants, it is a representative sample that can be used for epidemiological surveillance purposes [5] . therefore, metagenomic analysis of wastewater can highlight the presence of viral strains that circulate within a population, while enabling the discovery of new viruses that spread between humans and that are outside surveillance programs. to uncover the untapped viral diversity present in the trebal metagenome, which was not possible to recover through direct mapping of viral sequences against databases, we searched for rdrp homologous sequences using hidden markov models (hmms) for the protein. subsequently, we estimated the genetic diversity of the rna viruses using an alignment-free comparison of the retrieved rdrp sequences with those available in the ncbi refseq database (figure 3 ). the rdrp is the most essential and conserved protein in rna viruses [50] . it catalyzes rna synthesis from rna templates and is responsible for viral genome replication and transcription processes [49] . viruses 2020, 12, x for peer review 9 of 16 figure 3 . hierarchical clustering analysis of 2266 rna-dependent rna polymerase (rdrp) predicted protein sequences from trebal and ncbi refseq database based on bray-curtis amino acid distance (k = 2). dendrogram was divided in 12 main cluster based on the "unrooted" dendrogram. pie charts represent the frequency of sequences in each cluster classified by the putative host trough lca algorithm. bar charts represent the source (ncbi or trebal) from which sequences were retrieved inside each cluster. to assess the novelty of the most abundant viral sequences observed in trebal, namely those of the picobirnaviridae family, we performed a phylogenetic analysis. first, we identified the rdrp protein sequences inside the picobirnaviridae family and then filtered them by size to include only proteins ≥450 aa, which is the size of the smaller full-length pbv rdrp in the ncbi nr database. next, we reconstructed a phylogenetic gene tree (figure 4 ) using 31 rdrps that met the filtering criteria and 25 reference pbv sequences. our results show that seven trebal sequences were associated with the known pbv genogroups [32] i and ii associated with vertebrate stools. interestingly, the rest of the environmental sequences (24) formed three separate monophyletic clades. two of these exclusive trebal monophyletic groups, tg1 and tg2 (13 sequences), could be considered sister clades of the known pbv genogroup three (giii) associated with invertebrate samples [32] . in contrast, the third monophyletic group, tg3, which contains ten trebal sequences, is between the pbv genogroups one (gi) and two (gii) [32] , but closer to gii. therefore, tg3 can represent a highly divergent version of viruses that infect bacteria of the gastrointestinal tract from other vertebrates, such as domestic, farm or wild animals. this is highly probable since pbvs are ubiquitous in the feces of a vast range of animal species worldwide [39, 60, 61] , including cattle, monkeys, dogs, cats, bats, horses, poultry and chickens [61] . for instance, pbv rdrps sequences recovered from a metagenomic survey of bat stools in cameroon showed a large group of highly divergent sequences that were closely related to the pbv figure 3 . hierarchical clustering analysis of 2266 rna-dependent rna polymerase (rdrp) predicted protein sequences from trebal and ncbi refseq database based on bray-curtis amino acid distance (k = 2). dendrogram was divided in 12 main cluster based on the "unrooted" dendrogram. pie charts represent the frequency of sequences in each cluster classified by the putative host trough lca algorithm. bar charts represent the source (ncbi or trebal) from which sequences were retrieved inside each cluster. we identified 2180 predicted proteins as rdrps using protein hmms in the pfam database [26] . most of the rdrp sequences (65%) were classified as unknown since they do not align with any known protein in the ncbi nr database under standard cutoff parameters (e-value ≤ 1 × 10 −5 and score ≥ 50). the latter is expected since the software that was used is designed to detect remote homologs in the most sensitive way, based on the strength of the underlying probability models (hmms) [27] . the remaining 35% of the rdrps corresponded to known viruses that putatively infect invertebrates (13%), bacteria (11%), unicellular eukaryotes (4%), fungi (3%), humans (2%), and plants (2%). hierarchical clustering analysis based on the genetic distances of rdrps showed 12 well-defined genetic clusters, five of which were formed exclusively by ncbi sequences and five of which were formed mostly by the trebal sequences ( figure 3) . moreover, the sequences were clustered by a single host only in three of the 12 clusters (e.g., human viruses of cluster c05, and plant viruses of clusters c08-c09). additionally, most of the rdrps formed a continuous sequence space represented by highly heterogeneous clusters (e.g., c01-c03 and c06-c07). the latter feature was expected due to the orthologous nature of viral rdrps and their degree of structural conservation inside the riboviria viruses 2020, 12, 1050 9 of 15 realm [50, 51] . despite this, a closer inspection of the clusters with stricter cutoffs could reveal more specific associations. the animal virus cluster c04 was formed exclusively by ncbi reference sequences inside the astroviridae and coronaviridae families, but no further precision regarding the host was feasible. astroviruses are a commonly known cause of viral gastroenteritis in animals and humans [52] . specifically, sequences of cluster c04 corresponded to avian astrovirus associated with poultry and wild aquatic birds from a 2012 study in asia [52] . coronaviruses have a global distribution and infect a wide range of mammals and birds. they can cause respiratory and enteric infections that are usually mild, but severe infections of the respiratory system can develop, such as severe acute respiratory syndrome (sars) and the infection currently responsible for a global pandemic, sars-cov-2 [53] . coronaviridae sequences from cluster c04 were described in two studies from 2017 that investigated the viral population in bats in china and vietnam [54, 55] . plant virus clusters c08-c09 exclusively represent ncbi reference sequences of the luteoviridae (c08) and bromoviridae (c09) families. both viral families have a global distribution and are transmitted by specific aphid vectors [56, 57] . these two viral families have a broad host range of genera within many plant families, causing necrosis in most of their hosts [56, 57] . finally, the human virus cluster c05 was formed exclusively by ncbi sequences of the norovirus genogroup ii (nov gii). noroviruses are a genetically diverse genus within the caliciviridae family that can cause acute gastroenteritis in mammalian hosts [58] . most of the human noroviruses belong to genogroups gi and gii, where nov gii is usually the causal agent of acute gastroenteritis outbreaks [57, 58] . these pandemic characteristics are probably related to the nov gii epidemiology, which resembles that of influenza a viruses, with the emergence of new variants every 2-3 years that replace the previously established variant [59] . interestingly, bacterial viruses form two clusters, c11 and c12, which group picobirnaviridae reference sequences from the ncbi, recovered from animal stools and trebal new pbvs, together with unknown sequences that escaped our analyses using direct mapping against ncbi databases. this reflects the great diversity within the picobirnaviridae family, which is not represented in current databases. taken together, our results show that metagenomic surveys of rna viruses in sewage samples and the use of hmms could uncover extraordinary viral diversity through the detection of remote homologs in these human-impacted environments. additionally, the use of alignment-free genetic distances, such as the bray-curtis distance used here [28] , can provide a reliable method for clusterization and classification based on related sequences for a large number of sequences, such as those generated by metagenomics methods. to assess the novelty of the most abundant viral sequences observed in trebal, namely those of the picobirnaviridae family, we performed a phylogenetic analysis. first, we identified the rdrp protein sequences inside the picobirnaviridae family and then filtered them by size to include only proteins ≥450 aa, which is the size of the smaller full-length pbv rdrp in the ncbi nr database. next, we reconstructed a phylogenetic gene tree ( figure 4 ) using 31 rdrps that met the filtering criteria and 25 reference pbv sequences. our results show that seven trebal sequences were associated with the known pbv genogroups [32] i and ii associated with vertebrate stools. interestingly, the rest of the environmental sequences (24) formed three separate monophyletic clades. two of these exclusive trebal monophyletic groups, tg1 and tg2 (13 sequences), could be considered sister clades of the known pbv genogroup three (giii) associated with invertebrate samples [32] . in contrast, the third monophyletic group, tg3, which contains ten trebal sequences, is between the pbv genogroups one (gi) and two (gii) [32] , but closer to gii. s3). we found that all except one of the full sequences (those not predicted in contig edges) contain an rbs, being aggagg and aggag, which are the most frequent motifs. this matches with other sewage pbvs that have the aggagg motif in 100% of the full rdrp [7] . this result is relevant because only prokaryotic viral families contain species whose genomes are highly enriched in rbs sequences [43, 44] . finally, pbvs have been assumed to be animal pathogens based on inferences from a few studies reporting the virus in diarrhea stool samples. however, they have not been cultured in animal cell lines, nor do they have any consistent epidemiologic association with diarrhea [43] . rotavirus species are classically defined by their major capsid protein vp6, whereas rotavirus genotypes are defined based on their outer capsid proteins vp4 and vp7 [63] . in chile, over the last therefore, tg3 can represent a highly divergent version of viruses that infect bacteria of the gastrointestinal tract from other vertebrates, such as domestic, farm or wild animals. this is highly probable since pbvs are ubiquitous in the feces of a vast range of animal species worldwide [39, 60, 61] , including cattle, monkeys, dogs, cats, bats, horses, poultry and chickens [61] . for instance, pbv rdrps sequences recovered from a metagenomic survey of bat stools in cameroon showed a large group of highly divergent sequences that were closely related to the pbv giii [62] , as is the case of tg1 and tg2. this last evidence provides a probable origin for tg1 and tg2 since sewage ponds are known to be a feeding area for insectivorous bats [47] . new evidence of pbv sequences from sewage samples shows that sewage-recovered rdrps were spuriously distributed between many pbv genogroups [7] . the latter reinforce the idea that pbvs do not infect mammals but are a new family of rna bacteriophages, due to the consistent presence of bacterial rbss upstream of the coding sequences [7, 43, 44] . to test this hypothesis, we searched for prokaryotic rbs motifs in the 31 rdrp sequences from the pbvs (supplementary table s3 ). we found that all except one of the full sequences (those not predicted in contig edges) contain an rbs, being aggagg and aggag, which are the most frequent motifs. this matches with other sewage pbvs that have the aggagg motif in 100% of the full rdrp [7] . this result is relevant because only prokaryotic viral families contain species whose genomes are highly enriched in rbs sequences [43, 44] . finally, pbvs have been assumed to be animal pathogens based on inferences from a few studies reporting the virus in diarrhea stool samples. however, they have not been cultured in animal cell lines, nor do they have any consistent epidemiologic association with diarrhea [43] . rotavirus species are classically defined by their major capsid protein vp6, whereas rotavirus genotypes are defined based on their outer capsid proteins vp4 and vp7 [63] . in chile, over the last ten years, globally common rotavirus genotypes, such as g1p (8), g4p(8), g2p(4) and g9p(8) have alternated in their dominance, while emerging genotypes, such as g8p (8), have only recently been reported [63] . here, using pfam annotation of the trebal predicted proteins, we recovered 30 vp4, 29 vp6 and 27 vp7 human rotavirus proteins. local alignment-based classification ( figure 5 ) shows that the most abundant rotavirus species belongs to human rotavirus a, which is the most common cause of hospitalization due to viral gastroenteritis worldwide [63] . however, it is interesting to note that the presence (in low abundance; 0.4%) of human rotavirus c sequences that are closely associated to asian strains, have, to date, only been reported in chile through personal communication. likewise, the most abundant rotavirus genotypes were g8 (51%), g6 (21%) and g3 (12%) and p8 (78%) and p9 (15%). the segmented nature of rotavirus genomes does not allow us to infer the g-p genotype classification because we do not know which combinations of vp4 and vp7 were inside the same viral particle. however, it is highly probable that the most abundant g genotypes were combined with the most abundant p genotypes-for example, g8p(8), g6p(8) or g6p (9) . of these genotypes, g8p(8) was recently reported as an emergent rotavirus strain in a medium-sized city near santiago between 2016 and 2018 [63] . this strain has not been previously reported in south or north america, but has a highly similar identity to sequences detected between 2010 and 2016 in asia [63] . in the same line of evidence, the g8 sequences found in trebal also shared a 99% nucleotide identity with sequences detected in 2013 in thailand [64] and between 2017 and 2018 in japan [65] . the rotavirus g6 genotype is also an emergent strain, with most of the reports concentrated in europe between 2010-2013, but spurious circulation has been reported since the 2000s [66, 67] . in our analysis, the g6 recovered from the trebal sequences are closely related (99% nucleotide identity) to a g6p(9) strain detected in germany in 2014 and to a g6p(6) strain detected in belgium in 2002; yet, to our knowledge, this is the first report of rotavirus g6 in chile. these results emphasize the relevance of sewage viromes as epidemiological surveillance tools. likewise, our study demonstrates the advantage of using viral metagenomics for this task, which, despite using short sequences, can deliver reliable results that can later be confirmed by other methodologies. the latter point is especially relevant for rotavirus since surveillance based on pcr has shown serious bias related to primer specificity [63] . these results are especially significant for chile, as it is one of the few south american countries that has not implemented a national rotavirus vaccination program [63] . therefore, identification of genotypes that have not been previously reported in chile represents a first step in rotavirus prevention. furthermore, this can be used to generate valuable information to improve or implement new vaccination programs against this disease worldwide. this study explored the use of viral metagenomics to discover rna viruses in sewage and it is the first insight into the wastewater virosphere from a large city in south america. we have demonstrated the utility of metaviromics for the discovery of new groups and viral genotypes associated with known families. this is especially important due to the underrepresentation of pbvs in databases and the presence of uncommon rotavirus genotypes that are usually beyond the range of pcr-based surveillance. consequently, viral metagenomics can be used for the exploration and surveillance of emergent viruses, to better design new viral markers for diagnoses and routine epidemiological work, and for the implementation of national vaccination programs. here a virus, there a virus, everywhere the same virus raw sewage harbors diverse viral populations high variety of known and new rna and dna viruses of diverse origins in untreated sewage characterisation of the sewage virome: comparison of ngs tools and occurrence of significant pathogens metagenomics for the study of viruses in urban sewage as a tool for public health surveillance seasonal and spatial dynamics of enteric viruses in wastewater and in riverine and estuarine receiving waters viromic analysis of wastewater input to a river catchment reveals a diverse assemblage of rna viruses relative abundance and treatment reduction of viruses during wastewater treatment processes-identification of potential viral indicators differential removal of human pathogenic viruses from sewage by conventional and ozone treatments risk management of viral infectious diseases in wastewater reclamation and reuse quito's virome: metagenomic analysis of viral diversity in urban streams of ecuador's capital city k-mer content, correlation, and position analysis of genome dna sequences for the identification of function and evolutionary features a decade of rna virus metagenomics is (not) enough detection of rotavirus a in sewage samples using multiplex qpcr and an evaluation of the ultracentrifugation and adsorption-elution methods for virus concentration cutadapt removes adapter sequences from high-throughput sequencing reads kenkyuhi hojokin gan rinsho kenkyu jigyo quality control and preprocessing of metagenomic datasets 0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices idba-ud: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth versatile and open software for comparing large genomes prodigal: prokaryotic gene recognition and translation initiation site identification fast and sensitive protein alignment using diamond megan community edition-interactive exploration and analysis of large-scale microbiome sequencing data linking virus genomes with host taxonomy active crossfire between cyanobacteria and cyanophages in phototrophic mat communities within hot springs fast gapped-read alignment with bowtie 2 the protein families database a new generation of homology search tools based on probabilistic inference alignment-free sequence comparison: benefits, applications, and tools a novel method for rapid multiple sequence alignment based on fast fourier transform w-iq-tree: a fast online phylogenetic tool for maximum likelihood analysis fast model selection for accurate phylogenetic estimates ictv virus taxonomy profile: picobirnaviridae assessing the diversity and specificity of two freshwater viral communities through metagenomics viromes as genetic reservoir for the microbial communities in aquatic environments: a focus on antimicrobial-resistance genes phages rarely encode antibiotic resistance genes: a cautionary tale for virome analyses viral metagenomics eukaryotic viruses in wastewater samples from the united states redefining the invertebrate rna virosphere hyperexpansion of rna bacteriophage diversity bacteriophages and genetic mobilization in sewage and faecally polluted environments extensive conservation of prokaryotic ribosomal binding sites in known and novel picobirnaviruses multiple divergent picobirnaviruses with functional prokaryotic shine-dalgarno ribosome binding sites present in cloacal sample of a diarrheic chicken tracking down antibiotic-resistant pseudomonas aeruginosa isolates in a wastewater network wading bird use of wastewater treatment wetlands in central florida, usa. colon use of sewage treatment works as foraging sites by insectivorous bats microbial contamination of drinking water and human health from community water systems comparative enteric viruses and coliphage removal during wastewater treatment processes in a sub-tropical environment a structure-function diversity survey of the rna-dependent rna polymerases from the positive-strand rna viruses common and unique features of viral rna-dependent polymerases a novel group of avian astroviruses in wild aquatic birds coronavirus in water environments: occurrence, persistence and concentration methods-a scoping review discovery and genetic analysis of novel coronaviruses in least horseshoe bats in southwestern china detection of potentially novel paramyxovirus and coronavirus viral rna in bats and rats in the mekong delta region of southern viet nam luteovirus-aphid interactions ictv virus taxonomy profile: bromoviridae proposal for a unified norovirus nomenclature and genotyping human norovirus transmission and evolution in a changing world epidemiology, phylogeny, and evolution of emerging enteric picobirnaviruses of animal origin and their relationship to human strains novel picobirnaviruses in respiratory and alimentary tracts of cattle and monkeys with large intra-and inter-host diversity cameroonian fruit bats harbor divergent viruses, including rotavirus h, bastroviruses, and picobirnaviruses using an alternative genetic code predominance of rotavirus g8p(8) in a city in chile, a country without rotavirus vaccination full genome characterization of novel ds-1-like g8p(8) rotavirus strains that have emerged in thailand: reassortment of bovine and human rotavirus gene segments in emerging ds-1-like intergenogroup reassortant strains molecular characteristics of novel mono-reassortant g9p(8) rotavirus a strains possessing the nsp4 gene of the e2 genotype detected in tokyo detection of unusual g6 rotavirus strains in italian children with diarrhoea during the 2011 surveillance season increased detection of g3p(9) and g6p(9) rotavirus strains in hospitalized children with acute diarrhea in bulgaria this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license we want to thank to christina ridley for her help in the language editing of this manuscript and for her valuable opinion on it. we are grateful to aguas andinas staff, marcela etcheberrigaray, jacqueline pizarro, and christian sepulveda, for their invaluable support in obtaining the sewage samples. the authors declare no conflict of interest. key: cord-341879-vubszdp2 authors: li, lucy m; grassly, nicholas c; fraser, christophe title: genomic analysis of emerging pathogens: methods, application and future trends date: 2014-11-22 journal: genome biol doi: 10.1186/s13059-014-0541-9 sha: doc_id: 341879 cord_uid: vubszdp2 the number of emerging infectious diseases is increasing. characterizing novel or re-emerging infections is aided by the availability of pathogen genomes. in this review, we evaluate methods that exploit pathogen sequences and the contribution of genomic analysis to understand the epidemiology of recently emerged infectious diseases. when a pathogen crosses over from animals to humans, or an existing human disease suddenly increases in incidence, the infectious disease is said to be 'emerging'. the number of emerging infectious diseases (eids) has increased over the last few decades, driven by both anthropogenic and environmental factors [1] . these include the expansion of agricultural land, which increases the exposure of livestock and humans to infections in wildlife [2] ; a greater volume of air traffic, enabling eids to rapidly spread across the world [3, 4] ; and climate change, which alters the ecology and density of animal vectors, thereby introducing diseases to new geographic locations [5] . novel strains of existing pathogens also have the potential to cause large epidemics. the over-and misuse of antimicrobial drugs have contributed to the growing number of drug-resistant pathogen strains [6, 7] . detecting, characterizing and responding to an eid requires co-ordination and collaboration between multiple sectors and disciplines. laboratory-based research helps to characterize the pathogen and its interactions with host cells, but is less useful for quantitative understanding of population-level disease dynamics. modeling approaches enable a large number of hypotheses to be tested, which might not be logistically or ethically feasible in laboratory and field experiments. in addition to characterizing past disease dynamics, modeling future trends informs decisions regarding outbreak response and resource allocation [8] . modeling plays an especially important role in epidemiological studies of infectious disease spread, because the transmission of infectious disease between individuals is not directly observable. at the individual level, transmission times and who infected whom are typically unknown. and at the population level, disease burden needs to be inferred from observable data. important public health questions such as how quickly an epidemic spreads and how many people will be infected are hard to quantify without a mechanistic understanding of underlying factors driving disease transmission. by expressing disease spread in mathematical terms, statistical properties of epidemics can be estimated to help address specific questions regarding disease spread and control efforts [9] . another discipline contributing to the study of eids is pathogen genomics. as sequencing technology has become more accessible and affordable, genetic analysis has played an increasingly important role in infectious disease research. sequencing pathogens can confirm suspected cases of an infectious disease, discriminate between different strains, and classify novel pathogens. in addition to examining individual pathogen sequences, multiple sequences can be analyzed together using phylogenetic methods to elucidate evolutionary [10] and transmission [11] history. just as mathematical models of disease transmission help to capture the epidemiological properties of an infectious disease, modeling the molecular evolution of pathogen genomes is important for phylogenetic methods. besides characterizing the genetics and evolution of a pathogen, mathematical models used in population genetics link demographic and evolutionary processes to temporal changes in population-level genetic diversity. the coalescent population genetics framework was developed so that demographic history could be inferred from the shape of the genealogy linking sampled individuals [12, 13] . more recently, the birth-death model has been applied to infectious diseases to infer epidemiological history from a genealogy [14, 15] . given the link between pathogen evolution and disease transmission, there is a trend towards integrating both epidemiologic and genetic data in the same analytical framework [16] [17] [18] . in this review, we provide an overview of recent developments in genomic methods in the context of infectious diseases, evaluate integrative methods that incorporate genetic data in epidemiological analysis, and discuss the application of these methods to eids. over the last two decades, sequence data have increased in quality, length and volume due to improvements in the underlying technology and decreasing costs. as a result, pathogen sequences are regularly collected during routine surveillance and clinical studies. just as mathematical modeling can be used to analyze surveillance data to reveal details of disease transmission (box 1), analysis of pathogen genomes employs mathematical frameworks to elucidate pathogen biology, evolution and ecology (figure 1 ). at the most basic level, mathematical models are used to find the optimal alignment of pathogen sequences. multiple sequence alignment is useful for finding highly conserved or variable regions, shedding light on the molecular biology of the pathogen. furthermore, coupling sequences with clinical information can help identify the contribution of polymorphic sites to disease. revealing the evolutionary history of a pathogen requires a quantitative description of relatedness. based on polymorphic sites in the sequence alignment, a model of sequence evolution is then used to reconstruct the phylogeny [19] . often, there is insufficient genetic diversity in the sample to fully infer the phylogeny without ambiguity. in such a case, it is useful to consider a tree as an unknown set of parameters and obtain its posterior probability distribution using a bayesian framework, such as the markov chain monte carlo (mcmc) approaches [20, 21] . biological samples from which pathogen genetic material is sequenced are usually associated with geographic or temporal information (figure 1b ). when this additional information is available, phylogenetic methods can reveal the spatiotemporal spread of the pathogen in the population. if an outbreak is densely sampled, then the pathogen phylogeny provides information about the underlying transmission network and helps to uncover who infected whom [22, 23] , though phylogenetic clustering alone is usually not sufficient to prove direct transmission or direction of infection ( figure 1b) . incorporating sampling times helps to convert a phylogeny specified in units of nucleotide substitutions to a phylogeny specified in units of time [24] . the conversion is straightforward if sequence evolution follows a strict molecular clock, whereby the rate of substitution remains constant over time. however, selection pressure and population bottlenecks can lead to changes in the rate of substitution [25] . more flexible models have been developed to incorporate time-varying rates of evolution [26, 27] . with branch lengths in units of real time, the start date of an epidemic can be estimated. whereas phylogenetics aims to delineate the relationship between individuals, population genetics aims to link population processes to observed patterns of genetic diversity. inferences regarding pathogen population history are based on the genealogy, or ancestry, of sequences from sampled individuals, and often carried out in a retrospective population genetics framework known as the coalescent [12] (box 2). a genealogy describes the ancestry of sampled individuals. going backwards in time, pairs of lineages coalesce when they share a common ancestor, until the last two lineages coalesce at the time of the most recent common ancestor (tmrca) for the entire sample. since the turn of the century, the coalescent has been increasingly applied to infectious disease research to infer epidemic history from pathogen sequences, thereby linking pathogen evolutionary history to disease epidemiology ( figure 1c ). the method is especially useful for analyzing infectious diseases with mild or asymptomatic infections, for which case-based surveillance data severely underestimate prevalence, because the coalescent assumes a small sample compared to the population size [28] [29] [30] . other approaches have been developed to make epidemiological inferences from genetic data. of particular note is the birth-death model [31] , which describes the rates of transmissions, recoveries and deaths, and sampling events in terms of the sample genealogy [14] . just as there are coalescent methods incorporating population structure [32] [33] [34] and compartmental models [35] [36] [37] , similar methods exist in the birth-death framework [38, 39] . unlike the coalescent framework, the birthdeath model is still valid for densely sampled populations, which makes it more useful for studying small outbreaks. however, accurately inferring epidemiological parameters depends on correctly specified sampling proportions [40] . although the two approaches are methodologically different, both aim to reconstruct pathogen population history and produce estimates of epidemiological parameters, such as the reproductive number (r 0 ). the focus on the coalescent framework in this review is due to its more pervasive use in the literature and its greater versatility when integrated with epidemiological models compared to birth-death models. because of the simplistic assumptions of population genetics models, the population size inferred using coalescentbased methods cannot be directly interpreted as pathogen population size (prevalence of infection). it is rather the effective population size, n e (box 2), which refers to the size of a wright-fisher population that would produce the same level of genetic diversity as observed in the sample. in real populations, the variance of the offspring distribution (box 1) is higher than expected in a wright-fisher population due to heterogeneity in host infectiousness, non-random mixing of the population, and migration events. the consequence of a large variance is that there is a greater discrepancy between the effective and census population sizes [41] . accounting for the dispersion of the offspring distribution is especially important when analyzing infectious disease data because of the widespread occurrence of transmission heterogeneity [42] . another statistical property of epidemics affecting the results of modeling studies is the generation time distribution, which describes the time between infection of the primary case and of secondary cases. obtaining an estimate of the generation time is important for two reasons. first, estimates of r 0 from the initial growth rate of an epidemic depend on the generation time distribution [43] . as r 0 is the mean of the offspring distribution, its value affects the relationship between the effective population size, n e , and the census population size, n. second, the coalescent model was originally specified in units of generations, and so estimates in this framework need to be converted to natural units using the generation time, t g . because transmission events are rarely observed, the generation time distribution is often approximated by the distribution of the serial interval, which is the time between onset of symptoms in the primary and secondary cases. the two distributions generally share the same mean but might have different variances [44] . furthermore, the observed generation time decreases as the epidemic grows but increases again after the epidemic peak due to right censoring [45] . as both sequence and surveillance data contain information regarding the transmission process, simultaneously analyzing both datasets should yield more accurate estimates of epidemiological parameters than separate analyses [17] . the recently established discipline of phylodynamics takes an interdisciplinary approach to understand the pathogen phylogenetics and epidemiology in terms of disease transmission. most efforts thus far have focused on enhancing phylogenetic and population genetic analyses by incorporating spatial and temporal information about the sequences. the molecular clock model assumes a constant rate of evolution and thus helps to estimate the time of the most recent common ancestor of the sample, which approximates the start date of an epidemic. molecular clock analysis has been used to date the emergence of a range of emerging pathogens from hiv [46] to multidrug-resistant streptococcus pneumoniae [47] . linking geographic information with sequences can reveal the spatial spread of infectious disease. phylogenetic reconstruction of seasonal influenza (h3n2) sequences has revealed the contribution of viral circulation in temperate regions to the global genetic diversity of influenza, and determined that not all epidemics in temperate regions are seeded by strains from south east asia [48, 49] . also using global sequences, hepatitis c virus (hcv) subtypes were shown to spread from developed to developing countries [50] . finally, phylogeographic analysis of methicillin-resistant staphylococcus aureus samples identified england as the source of the emrsa-15 lineage [51] . by contrast, there have been relatively few studies incorporating genetic data into epidemiological frameworks. although genetic analysis plays an important role in elucidating transmission links in disease outbreaks [20, 21, 52] , its integration with epidemiological models to understand population-level disease dynamics has been more limited. in one of the first papers to link coalescent inference to mathematical models in epidemiology, the effective population sizes of hiv-1 subtypes a and b were estimated from the maximum likelihood trees of viral sequences [53] . in addition to revealing population sizes, pybus et al. [54] estimated the r 0 values of hcv subtypes (1a, 1b, 4 and 6) by inferring the epidemic growth rate from viral genealogy. taking integration a step further, the coalescent process has been described for compartmental epidemiological models such as the susceptible-infected-recovered (sir) model, thereby enabling epidemiological parameters to be inferred from the genealogy [35] . to infer demographic history from both pathogen genomes and epidemiological data, rasmussen et al. [17] developed a markovian framework in which the population size at each time step was estimated by taking into account both the surveillance data and the genealogy. the epidemic history reconstructed using both datasets was more accurate than when analyzing each type of data separately. in all the above methods, the genealogy of the sampled sequences was fixed. however, there might be great uncertainty regarding the order and the timing of coalescence, especially if the sequences are sampled within a short time period. while genealogical reconstruction using bayesian mcmc approaches allows phylogenetic uncertainty to be incorporated into estimates of population size [13, 31] , an integrative model is lacking in which uncertainties arising from both genetic and epidemiological data are incorporated during demographic reconstruction. models of pathogen evolution and mechanistic models of disease spread have increased in complexity. there is also greater computational power to test these models with data. however, these sophisticated models have mostly been applied to infectious diseases for which abundant data are available. for example, new methods are most often tested on the hiv-1 pandemic [15, 34, 35, 55] , for which data have been extensively collected from various settings and sources since the virus was first characterized three decades ago. it is worthwhile to evaluate how genomic methods have been applied to other diseases that have emerged more recently. in this section, we will present three case studies of recently emerged infectious diseases to illustrate the power and shortcomings of genomic methods discussed in this review. since emerging in guinea in march 2014, ebola virus (ebov) has spread to other countries in western africa, resulting in the largest outbreak of ebola since it was first identified in 1976. the first viral genomes were made available just a month after alarm was raised about a new ebola outbreak in guinea [56] , with further sequences collected in sierra leone [57] . by aligning all the genomes, a number of polymorphic sites were identified, including eight in highly conserved regions of the genome. further association studies are needed to clarify the role of these genetic variants in determining disease outcome. using the sampling dates of the sequences and a molecular clock model, phylogenetic analysis of 81 ebov sequences revealed a start date of february 2014 in guinea, spreading to sierra leone by april 2014 [57] . uncovering the relationship between the 2014 ebov lineage and previous ebov outbreaks has proved trickier than understanding the disease dynamics during the 2014 outbreak. initial phylogenetic analysis suggested that lineages causing the present outbreak did not cluster with ebov strains that caused earlier outbreaks in central africa [56] . however, dudas and rambaut [58] noted that the divergence of guinea sequences from those of previous outbreaks was because they were sequenced most recently and had accumulated the highest number of substitutions. assuming that the ebov genome followed a molecular clock model, the authors re-rooted the tree to a lineage that caused an outbreak in 1976 [58] . instead of silently circulating in west africa, the ebov lineage causing the current outbreak likely descended from a lineage that previously caused outbreaks in the democratic republic of congo. these studies highlight two issues. first, correct rooting of a phylogeny is important for accurate inference of past epidemic history. correct rooting can be achieved by using an out-group, but one was not available in the case of this ebov strain. this leads onto the second issue. without sequences from animal hosts, the mechanism by which ebov was sustained between outbreaks remains unknown. middle east respiratory syndrome coronavirus (mers-cov) first appeared in saudi arabia in 2012, and has since been reported in several neighboring countries in the arabian peninsula and on other continents [59] . despite the dearth of sequence data, coalescent-based analysis of 10 genomic sequences produced estimates of the tmrca (march 2012; 95% confidence interval (ci): november 2011 to june 2012), r 0 (1.21; 95% ci: 1.08, 1.40), and doubling time (43 days; 95% ci: 23, 104 days) [60] . without further sequencing of the animal reservoirs, the authors could not infer whether these estimates applied to the animal reservoir or the human epidemic, because the methods are agnostic as to where transmission and evolution occur. the credible intervals around the estimates were unsurprisingly large given the small sample size. unlike the 2014 ebov outbreak, which is sustained by human-to-human transmission [57] , there appears to have been multiple introductions of mers-cov into the human population. identification of the animal reservoir is therefore crucial for establishing risk factors of infection and planning appropriate interventions to control the disease. since bats are reservoirs for other coronaviruses, their being a reservoir host is possible. a 182nucleotide-long region of the rna-dependent rna polymerase gene was found to be 100% identical between a viral sample from a patient in saudi arabia and from a bat nearby, though the region is known to be highly conserved [61] . however, antibodies against human mers-cov have been detected in dromedary camels [62] , the camel mers-cov genome is similar to human mers-cov [62] , and there are reports of close contact between patients and camels [63] . phylogenetic analysis of coronavirus sequences from bats, dromedaries and humans indicate a bat origin, with dromedary camel as an intermediate host [64] . it is possible that there are other animal reservoirs not yet sampled, which highlights the need to carry out extensive animal surveillance to characterize the emergence of an infection in humans. unraveling the complex evolutionary history of pandemic h1n1 influenza with sequences collected over three decades from humans, pigs and birds, the origin of the pandemic h1n1 influenza a strain (pdmh1n1 or 'swine flu') was elucidated soon after emergence. within two months of the first reported case of swine flu in humans, genomic analysis of the novel influenza strain had been carried out. a phylogeny was constructed for each of the eight genomic segments with sequences from humans, swine and birds. comparison of these eight phylogenies revealed a complex history of reassortment with a mixture of gene segments from all three groups. the start of the pandemic was estimated to be the end of 2008 or early 2009, and the dates of the reassortment events leading to pdmh1n1 were also obtained [10] . without good surveillance of influenza in the animal reservoir, the origin of the novel strain would have been difficult to uncover. by analyzing 11 hemagglutinin sequences collected over a one-month period, the start date of the epidemic was estimated to be in late january 2009 [65] . repeating the phylogenetic and molecular clock analyses with a further 12 sequences shifted the estimated start date two weeks earlier. fitting an exponential growth model to the sequence data, r 0 was estimated to be 1.22, slightly lower than inferred from epidemiological data but with overlapping confidence intervals. to determine at which point during the pandemic coalescent analysis would have provided accurate and precise estimates of evolutionary rate, r 0 and tmrca, real-time estimates of these parameters were obtained for genomic sequences collected in north america [66] . accurate estimates could have been obtained as early as may, when 100 viral genomes had been sequenced. more precise estimates could have been obtained by the end of june, when 164 had been sequenced. however, inclusion of more sequences of longer length only slightly improved the accuracy of initial estimates [66] . most statistical models in population genetics have focused on the application of such methods to viruses, although this bias is perhaps unsurprising given the large proportion of eids caused by viruses [1] . whole-genome sequencing of bacterial isolates is becoming more widespread, and can help to uncover genetic determinants of clinical severity, elucidate pathogen-host interactions, and quantify evolutionary rates at within-and between-host levels [67] . epidemiological investigations using bacterial genomes have also been possible. even though bacteria acquire point mutations at a lower rate per base than viruses, longer bacterial genomes have provided sufficient genetic resolution for phylogenetic analysis. for example, whole-genome sequencing has been used to refine the tuberculosis transmission network built using contact information [21] , and to investigate an outbreak of methicillin-resistant staphylococcus aureus in a hospital and surrounding community in near real-time [68] . the need for longer sequences when conducting epidemiological studies of bacterial infections adds to the per-sample cost of sequencing, and more computational resources are required for coalescent-based inference of pathogen history. however, this latter limitation may be overcome by only analyzing polymorphic sites if samples are similar. demographic reconstruction of emerging bacterial pathogens using coalescent-based approaches has been limited compared to work on viral pathogens. in one such study, the temporal changes in genetic diversity of streptococcus pneumoniae in iceland were estimated based on the coalescent model [47] . this study was limited to a single multidrug-resistant lineage in a single location, with data collected over decades. over longer evolutionary time-scales, the accumulation of diversity through recombination can obscure phylogenetic relationships. more complex evolutionary models would be required to taken into account these genomic changes, increasing the uncertainty surrounding demographic estimates from genomic data. in addition to performing analyses with longer sequences, there is also a need to develop methods that exploit as many sequences in the sample as possible. for population studies, available sequences are often subsampled to remove individuals from the same household or in the same close contact network to have a representative sample of the population. furthermore, sequences from the same individuals are often discarded, though these may be informative for within-host evolution. although some effort has been made to link within-host to between-host evolution [52, 69] , the effect of within-host evolution on population genetic inference is still not well studied. combining analyses across different scales could improve the accuracy of epidemiological predictions and provide better mechanistic explanations of observed trends. genomic studies have contributed to better understanding of eids and their spatiotemporal spread. sophisticated statistical methods have been developed to uncover the epidemiological features of infectious diseases based on the genealogy of their sequences. there is also growing when the distribution cannot be computed analytically [70] . obtaining estimates of r 0 and t g is not always sufficient to predict epidemic trajectory if there is significant heterogeneity between individuals. the offspring distribution with mean r and variance ï� 2 describes the probability distribution of the number of secondary infections caused by each infected individual. in compartmental models, the offspring distribution is not explicitly specified but follows from the specification of the model -in the case of the sir model it follows a geometric distribution. for certain diseases, the offspring distribution is more dispersed than captured by the geometric distribution [42] . in other words, most individuals cause no further infections whereas a few individuals are super-spreaders who cause the majority of infections. accurate estimate of ï� 2 is important for predicting epidemic outcome and assessing control measures. effort to integrate genomic analysis with analysis of epidemiological data. in recent cases of eids, genomic data have helped to classify and characterize the pathogen, uncover the population history of the disease, and produce estimates of epidemiological parameters. just as compartmental models can be fitted to surveillance data to infer the epidemiological dynamics of an infectious disease (box 1), the coalescent framework allows inference of population history from pathogen sequences. the coalescent model describes the statistical properties of the genealogy underlying a small sample of individuals from a large population. in the simplest case, the forward-time dynamics of the population is assumed to follow the wright-fisher model, in which the haploid population has discrete, non-overlapping generations, undergoes neutral evolution, and remains the same size [71, 72] . extensions to the coalescent have assumed more complex population dynamics described by deterministic population equations [73] , compartmental disease models [35] , or non-parametric approaches [13, 55, 74, 75] . within this framework, going backwards in time, individuals in the current generation are randomly assigned to parents in the previous generation. if two individuals have the same parent, then a coalescent event has occurred. eventually, all lineages in the sample coalesce to a single individual known as the most recent common ancestor of the sample. the rate of coalescence is inversely related to population size. if the population follows the wright-fisher model, evolutionary changes are selectively neutral, so the shape of the genealogy reflects only demographic changes. global trends in emerging infectious diseases agricultural intensification, priming for persistence and the emergence of nipah virus: a lethal bat-borne zoonosis geographic expansion of dengue: the impact of international travel spread of a novel influenza a (h1n1) virus via global airline transportation hantavirus epidemic in europe emergence of new forms of totally drug-resistant tuberculosis bacilli: super extensively drug-resistant tuberculosis or totally drug-resistant strains in iran is neisseria gonorrhoeae initiating a future era of untreatable gonorrhea?: detailed characterization of the first strain with high-level resistance to ceftriaxone mathematical models of infectious disease transmission origins and evolutionary genomics of the 2009 swine-origin h1n1 influenza a epidemic molecular epidemiology of the foot-and-mouth disease virus outbreak in the united kingdom in 2001 on the genealogy of large populations bayesian coalescent inference of past population dynamics from molecular sequences sampling-through-time in birth-death trees estimating the basic reproductive number from viral sequence data unifying the epidemiological and evolutionary dynamics of pathogens inference for nonlinear epidemiological models using genealogies and time series relating phylogenetic trees to transmission trees of infectious disease outbreaks integrating genetic and epidemiological data to determine transmission pathways of foot-and-mouth disease virus whole-genome sequencing and social-network analysis of a tuberculosis outbreak measurably evolving populations time-dependent rates of molecular evolution estimating the rate of evolution of the rate of molecular evolution estimation of primate speciation dates using local molecular clocks beast: bayesian evolutionary analysis by sampling trees mrbayes 3.2: efficient bayesian phylogenetic inference and model choice across a large model space the genomic and epidemiological dynamics of human influenza a virus hiv-1 transmission during early infection in men who have sex with men: a phylodynamic analysis reconciling phylodynamics with epidemiology: the case of dengue virus in southern vietnam simultaneous reconstruction of evolutionary history and epidemiological dynamics from viral sequences with the birth-death sir model complex population dynamics and the coalescent under neutrality modelling tree shape and structure in viral phylodynamics phylodynamic inference for structured epidemiological models phylodynamics of infectious disease epidemics simple epidemiological dynamics explain phylogenetic clustering of hiv from patients with recent infection rates of coalescence for common epidemiological models at equilibrium uncovering epidemiological dynamics in heterogeneous host populations using phylogenetic methods dating phylogenies with sequentially sampled tips birth-death skyline plot reveals temporal changes of epidemic spread in hiv and hepatitis c virus (hcv) hatzakis a: integrating phylodynamics and epidemiology to estimate transmission diversity in viral epidemics superspreading and the effect of individual variation on disease emergence how generation intervals shape the relationship between growth rates and reproductive numbers a note on generation times in epidemic models generation interval contraction and epidemic data analysis lemey p: hiv epidemiology. the early spread and epidemic ignition of hiv-1 in human populations variable recombination dynamics during the emergence, transmission and 'disarming' of a multidrug-resistant pneumococcal clone global migration dynamics underlie evolution and persistence of human influenza a (h3n2) temporally structured metapopulation dynamics and persistence of influenza a h3n2 virus in humans the global spread of hepatitis c virus 1a and 1b: a phylodynamic and phylogeographic analysis a genomic portrait of the emergence, evolution, and global spread of a methicillin-resistant staphylococcus aureus pandemic bayesian inference of infectious disease transmission from whole genome sequence data population dynamics of hiv-1 inferred from gene sequences the epidemic behavior of the hepatitis c virus exploring the demographic history of dna sequences using the generalized skyline plot emergence of zaire ebola virus disease in guinea-preliminary report genomic surveillance elucidates ebola virus origin and transmission during the 2014 outbreak phylogenetic analysis of guinea 2014 ebov ebolavirus outbreak center for disease control and prevention: middle east respiratory virus (mers) middle east respiratory syndrome coronavirus: quantification of the extent of the epidemic, surveillance biases, and transmissibility middle east respiratory syndrome coronavirus in bats, saudi arabia middle east respiratory syndrome coronavirus in dromedary camels: an outbreak investigation detection of the middle east respiratory syndrome coronavirus genome in an air sample originating from a camel barn owned by an infected patient rooting the phylogenetic tree of middle east respiratory syndrome coronavirus by characterization of a conspecific virus from an african bat who rapid pandemic assessment collaboration: pandemic potential of a strain of influenza a (h1n1): early findings real-time characterization of the molecular epidemiology of an influenza pandemic insights from genomics into bacterial pathogen populations whole-genome sequencing for analysis of an outbreak of methicillin-resistant staphylococcus aureus: a descriptive study the genealogical population dynamics of hiv-1 in a large transmission chain: bridging within and among host evolutionary rates teller e: equation of state calculations by fast computing machines the genetical theory of natural selection evolution in mendelian populations sampling theory for neutral alleles in a varying environment an integrated framework for the inference of viral population history from reconstructed genealogies smooth skyride through a rough skyline: bayesian coalescent-based inference of population dynamics genomic analysis of emerging pathogens: methods, application and future trends we would like to thank nick croucher for discussions on bacterial genomics. ll is funded by a medical research council doctoral training partnership studentship. the authors declare that they have no competing interests. key: cord-301827-a7hnuxy5 authors: uversky, vladimir n title: a decade and a half of protein intrinsic disorder: biology still waits for physics date: 2013-04-29 journal: protein science doi: 10.1002/pro.2261 sha: doc_id: 301827 cord_uid: a7hnuxy5 the abundant existence of proteins and regions that possess specific functions without being uniquely folded into unique 3d structures has become accepted by a significant number of protein scientists. sequences of these intrinsically disordered proteins (idps) and idp regions (idprs) are characterized by a number of specific features, such as low overall hydrophobicity and high net charge which makes these proteins predictable. idps/idprs possess large hydrodynamic volumes, low contents of ordered secondary structure, and are characterized by high structural heterogeneity. they are very flexible, but some may undergo disorder to order transitions in the presence of natural ligands. the degree of these structural rearrangements varies over a very wide range. idps/idprs are tightly controlled under the normal conditions and have numerous specific functions that complement functions of ordered proteins and domains. when lacking proper control, they have multiple roles in pathogenesis of various human diseases. gaining structural and functional information about these proteins is a challenge, since they do not typically “freeze” while their “pictures are taken.” however, despite or perhaps because of the experimental challenges, these fuzzy objects with fuzzy structures and fuzzy functions are among the most interesting targets for modern protein research. this review briefly summarizes some of the recent advances in this exciting field and considers some of the basic lessons learned from the analysis of physics, chemistry, and biology of idps. a bit more than ten years ago, protein science published a review entitled "natively unfolded proteins: a point where biology waits for physics" (protein sci 2002 11(4):739-756). 1 the major goal of that article was to bring an intriguing protein family of natively unfolded proteins (which are recognized now to constitute a subset of a very broad class of intrinsically disordered proteins, idps) out of shadow, to emphasize their lack of ordered structure under physiological conditions (at least ordered structure that could be detected by traditional low resolution techniques), to systemize their major structural properties, and to highlight their biological significance. the introduction of such biologically active but essentially unstructured proteins was used to challenge the hitherto dominant structure-centric viewpoint (structure-function paradigm), according to which a specific function of a protein is determined by its unique and rigid three-dimensional (3d) structure. the title of the review ("a point where biology waits for physics") was inspired by the observations that many of such "structure-less" proteins analyzed by that time acted as "binders" that did undergo at least partial folding after interaction with their binding partners. these observations provoked an idea that these biologically important proteins with little or no ordered structure have to wait to become more folded (and functional) as a result of binding to their specific partners. in other words, for these proteins, "biology," that is, the ability to have biological functions, seemed to wait for "physics" which is manifested in their ability to undergo binding-induced folding (at least partial), which is necessary to bring the functional state of these proteins to life. 1 at the beginning, the idea that structure-less proteins can be biologically active was taken as a complete heresy by many researchers, since it was absolutely alien to then dominated structure-function paradigm which represented a foundation of the long-standing belief that the specific functionality of a given protein is determined by its unique 3-d structure. this structure-function paradigm that describes reasonably well the catalytic behavior of enzymes was based on the "lock-and-key" hypothesis formulated in 1894 by emil fischer. 2 this viewpoint was solidified by the successful solution of x-ray crystallographic structures of many proteins (as of february 26, 2013 there were 81,922 protein structures in the protein data bank, 3 with 72,761 of these structures being determined by xray crystallography). these many crystal structures reinforced a static view of functional protein, where a rigid active site of an enzyme can be viewed as a sturdy lock that provides an exact fit to only one key, a specific and unique substrate. 4 despite numerous limitations, this lock-and-key model was an extremely fruitful concept that was responsible for the creation of modern protein science. 1 figure 1 (a) shows some of the most obvious scientific consequences of the application of structure-function paradigm which is deservedly placed at the center of the "big bang" model that gives rise to the protein science universe. 1 obviously, the consideration of a protein as a rigid crystal-like entity is an oversimplification, since even the most stable and well-folded proteins are dynamic systems that possess different degrees of conformational flexibility. this is because of the simple fact that so-called conformational forces, that is, forces stabilizing the secondary structure of a protein and its tertiary fold, are weak and can be broken even at ambient temperatures due to thermal fluctuations. 4 the breaking of these weak interactions releases the groups that were involved in these interactions and gives them the possibility to be involved in the formation of new weak interactions of comparable energy. 4 since these structural rearrangements are of relatively small scale and since they occur typically in a time scale that is faster than the time required for structure determination by x-ray crystallography and many other physical techniques, the 3-d structures of proteins determined by these techniques represent averaged pictures. 6 furthermore, one should keep in mind that not all proteins structures which are deposited to pdb are structured throughout their entire lengths. instead, many pdb proteins have portions of their sequences missing from the determined structures (so-called regions of missing electron density) 7, 8 due to the failure of the unobserved atom, side chain, residue, or region to scatter x-rays coherently caused by their flexible or disordered nature. such flexible/disordered regions are rather common in the pdb, since only about 30% of protein structures deposited in the pdb do not have such regions of missing electron density. 9 in addition to ordered proteins possessing disordered regions of varying length, the literature contains numerous examples of biologically active proteins with flexible structures. 4 therefore, there is another class of functional proteins and protein regions that contain smaller or larger highly dynamic fragments, and some proteins are even characterized by a complete or almost complete lack of ordered structure under physiological conditions (at least in vitro) which appears to be a critical aspect of these proteins' function in vivo. 4, [10] [11] [12] [13] [14] [15] these proteins and protein regions (which are known now as idps and idp regions (idprs)) have no single, well-defined equilibrium structure and exist as heterogeneous ensembles of conformers such that no single set of coordinates or backbone ramachandran angles is sufficient to describe their conformational properties. these proteins were independently discovered one-by-one over a long period of time and therefore they were considered as rare exceptions to the general rule. although the phenomenon of biological functionality without stable structure was repeatedly observed, for a long time it was unnoticed by a wide audience because the authors frequently invented new terms to describe their protein of interest. 16 in fact, an incomplete list of terms coined in the literature to describe these proteins includes floppy, pliable, rheomorphic, 17 flexible, 18 mobile, 19 partially folded, 20 natively denatured, 21 natively unfolded, 12, 22 natively disordered, 15 intrinsically unstructured, 11, 14 intrinsically denatured, 21 intrinsically unfolded, 22 intrinsically disordered, 13 vulnerable, 23 chameleon, 24 malleable, 25 4d, 26 protein clouds, 27 dancing proteins, 28 proteins waiting for partners, 29 and several other names often representing different combinations of "natively/naturally/inherently/intrinsically" with "unfolded/unstructured/disordered/denatured" among several others. therefore, the majority of the names used in the early literature express that the "unfolded, unstructured, disordered, and denatured" state is a "native, natural, inherent, and intrinsic" property of these proteins. 16 although protein intrinsic disorder is considered now as an established concept and pubmed contains hundreds and hundreds of papers talking about different aspects of idps/idprs, the route to recognizing these proteins as a novel functional entity was complex and lengthy. as it is often the case for new scientific concepts, the idea of structure-less functionality went through the stages of passive ignorance and active denial to scrupulous examination and enthusiastic acceptance. for example, it took me more than a year to publish my first paper dedicated to the systematic analysis of such proteins, and the manuscript was successively rejected by 14 journals before it was finally accepted by proteins. 12 however, time showed that the concept of protein intrinsic disorder was a useful invention and could be considered as a universal lock-pick that helps in solving many of the seemingly unsolvable figure 1 . a: protein structure-function paradigm is the "big bang" created universe of the modern protein science. some major directions based on the consideration of protein function as lock-and-key mechanism are shown. modified from ref. 1 . b: paradigm shift caused by the introduction of the protein intrinsic disorder concept opened a wide array of new directions in protein science. in essence, introduction of this concept can be considered as a scientific revolution that, according to kuhn, 5 "occurs when scientists encounter anomalies that cannot be explained by the universally accepted paradigm within which scientific progress has thereto been made" (http://en.wikipedia.org/wiki/paradigm_shift). uversky problems in protein science. one could say that this idea gave a new boost to the development of the protein science, generating a wide array of principally novel research directions [see fig. 1(b) ]. the goals of this review are: (i) to outline some recent advances in the field of idps/idprs; (ii) to illustrate the usefulness of intrinsic disorder for protein function; (iii) to show that intrinsic disorder can affect different levels of protein structural organization; (iv) to indicate intimate involvement of intrinsic disorder in pathogenesis of various maladies; (v) to emphasize the exceptional structural heterogeneity of idps/idprs and to show that idps are definitely much more structurally complex than random coillike polypeptides; (vi) to accentuate that although this structural heterogeneity is very important for protein functionality, it represents a crucial hurdle for structural characterization of idps; (vii) to stress that new experimental and computational approaches and new theories and models are crucially needed for future progression of this field and protein science in general. these and other points highlight the current state of the field, where further advances in understanding of the "biology" of idps still waits for "physics," with "physics" now being new theories, instrumentation, and analytical approaches. identification of idps as unique entities belonging to a new protein tribe is directly related to the recognition that their amino acid sequences are dramatically different from those of ordered proteins. 10, 12, 13, [30] [31] [32] for example, it has been pointed out that the low content of hydrophobic residues combined with the high load of charged residues that often gives rise to high net charge of a polypeptide chain represents a characteristic feature of some idps (so called extended idps or natively unfolded proteins with coil-like or close to coil-like structures, see below). 12 therefore, compact proteins and extended idps can be distinguished based only on their net charges and hydropathies using a simple charge-hydropathy (ch) plot, where the idps are specifically localized within a specific region of ch phase space and are reliably separated from compact ordered proteins. 12 more detailed comparison of amino acid sequences revealed that in comparison with ordered proteins and domains, the idps/idprs are significantly depleted in order-promoting amino acids (trp, tyr, phe, ile, leu, val, cys, and asn), 10, 33 being instead enriched in disorder-promoting residues, such as ala, arg, gly, gln, ser, glu, lys, and pro. 13, 31, 32, 34, 35 difference between ordered and disordered proteins goes far beyond these differences in their amino acid compositions. in fact, based on the comparison of the 265 amino acid physico-chemical property-based scales (such as hydropathy, net charge, flexibility index, helix propensities, strand propensities, aromaticity, etc.) 34 and more than 6000 composition-based attributes (e.g., all possible combinations having one to four amino acids in the group) 36 it has been concluded that ordered and disordered proteins and regions can be discriminated using many of these attributes. 13 based on the analysis of 517 amino acid scales, a novel amino acid scale, top-idp (trp, phe, tyr, ile met, leu, val, asn, cys, thr, ala, gly, arg, asp, his, gln, lys, ser, glu, and pro), was built to provide ranking for the tendencies of the amino acid residue to promote order or disorder. 30 the fact that the sequences of ordered and disordered proteins and regions are noticeably different suggested that idps clearly constitute a separate entity inside the protein kingdom, that these proteins can be reliably predicted using various computational tools, [37] [38] [39] [40] [41] [42] and structurally, that idps should be very different from ordered globular proteins since peculiarities of amino acid sequence determine protein structure. natural abundance of idps: touching the tip of the iceberg initial systematic analyses revealed that intrinsic disorder in proteins is a rather common phenomenon. in fact, as of 2002, the list of experimentally validated natively unfolded proteins with chain length greater than 50 amino acid residues contained more than 100 entries. 1 it was also pointed out that this list would probably be doubled if shorter polypeptides 30-50 residues long were included, 1 and that these 100 experimentally validated natively unfolded have at least 250 homologues, which are also expected to be natively unfolded. 1, 12 it happened that these "large" numbers (which actually were large enough to make a crucial point that biologically active structure-less proteins represent the new rule and not mere rare exceptions) constitute just a small tip of an iceberg. in fact, using computational tools developed for sequence-based intrinsic disorder prediction the wide spread of idps and hybrid proteins containing idprs was convincingly shown. [43] [44] [45] [46] for example, more than 15,000 out of 91,000 proteins in the thencurrent swiss protein database were identified as having long idprs. 47 the published in 2000 analysis of 31 whole genomes that span the 3 kingdoms of life revealed that many proteins contained segments predicted to have 40 consecutive disordered residues and that the eukaryotes exhibited more disorder by these measures than either the prokaryotes or the archaea. 43 other studies on the abundance of intrinsic disorder in various evolutionary distant species supported these findings and consistently showed that the eukaryotic proteomes had higher fraction of intrinsic disorder than prokaryotic proteomes. 44, [48] [49] [50] [51] [52] this conclusion is in line with the results of a comprehensive bioinformatics investigation of the disorder distribution in almost 3500 proteomes from viruses and three kingdoms of life, results of which are shown in figure 2 as the correlation between the intrinsic disorder content and proteome size for 3484 species from viruses, archaea, bacteria, and eukaryotes. 46 surprisingly, figure 2 shows that there is a well-defined gap between the prokaryotes and eukaryotes in the plot of fraction of disordered residues on proteome size, where almost all eukaryotes have 32% or more disordered residues, whereas the majority of the prokaryotic species have 27% or fewer disordered residues. 46 therefore, it looks like the fraction of 30% disordered residues serves as a boundary between the prokaryotes and eukaryotes and reflects the existence of a complex step-wise correlation between the increase in the organism complexity and the increase in the amount of intrinsic disorder. a gap in the plot of fraction of disordered residues on proteome size parallels a morphological gap between prokaryotic and eukaryotic cells which contain many complex innovations that seemingly arose all at once. in other words, this sharp jump in the disorder content in proteomes associated with the transition from prokaryotic to eukaryotic cells suggests that the increase in the morphological complexity of the cell paralleled the increased usage of intrinsic disorder. 46 the variability of disorder content in unicellular eukaryotes and rather weak correlation between disorder status and organism complexity (measured as the number of different cell types) is likely related to the wide variability of their habitats, with especially high levels of disorder being found in parasitic host-changing protozoa, the environment of which changes dramatically during their life-span. 53 the further support for this hypothesis came from the fact that the intrinsic disorder content in multicellular eukaryotes (which are characterized by more stable and less variable environment of individual cells) was noticeably less variable than that in the unicellular eukaryotes. 46 it was pointed out that idps possess noticeable amino acid biases, and many idps/idprs are characterized by sequence redundancy and low sequence complexity, containing long stretches of various repeats and being completely devoid of some (often many) types of amino acid residues. these observations seem to indicate that the sequence space of idps/idprs should be simpler than that of ordered proteins. however, the reality is more complex than conventional wisdom might suggest, and the sequence space attainable by simple idps/idprs is more diversified than that of the structurally more sophisticated ordered proteins. in fact, a 100 residue-long protein in which any of the normally occurring 20 amino acids can be found has a sequence space of 20 100 (10 130 ) sequences. 54 obviously, not all random amino acid sequences can fold into unique structures. in other words, a sequence space of a foldable protein (or "foldable" sequence space) is noticeably smaller than the entire sequence space available for a random polypeptide chain. for decades, the actual size of "foldable" sequence space continues to be unsolved mystery despite a large body of theoretical, biochemical, and computational work that aims to unravel the relationship between a protein's primary sequence and its resulting 3d structure. 55 however, the actual number of different amino acid residues in a given foldable sequence can be dramatically reduced, 54 since all twenty residues are not necessary for protein folding and the actual physicochemical identity of most of the amino acids in a protein is irrelevant. [56] [57] [58] [59] [60] [61] [62] [63] in other words, folding alphabet can be noticeably reduced, 55, 64 and amino acids can be clustered based on some shared features such as homolog substitution frequency, 65 local structural environments, 66 or peculiarities of the tertiary structural environments. 67 this simplified folding code further reduces the available "foldable" sequence space. 68 figure 2 . correlation between the intrinsic disorder content and proteome size for 3484 species from viruses, archaea, bacteria, and eukaryotes. each symbol indicates a species. there are totally six groups of species: viruses expressing one polyprotein precursor (small red circles filled with blue), other viruses (small red circles), bacteria (small green circles), archaea (blue circles), unicellular eukaryotes (brown squares), and multicellular eukaryotes (pink triangles). each viral polyprotein was analyzed as a single polypeptide chain, without parsing it into the individual proteins before predictions. the proteome size is the number of proteins in the proteome of that species and is shown in log base. the average fraction of disordered residues is calculated by averaging the fraction of disordered residues of each sequence over the all sequences of that species. disorder prediction is evaluated by pondr-vsl2b. modified from ref. 46. simply by virtue of their existence, idps/idprs add a new level of complexity to the sequence-structure relationship, dividing the population of protein sequences into two categories, sequences that yield natively ordered, and sequences that code natively disordered proteins. 55 idps/idprs cannot fold spontaneously and some of them require specific partners to gain more ordered structure. therefore, they do not possess an entire folding code that defines the ability of foldable proteins to fold spontaneously into a unique biologically active structure. the missing portion of the idp folding code (or at least part of it) is supplemented by binding partner(s). this defines a principal difference between structured proteins and idps/idprs: foldable proteins fold first and then bind to their partners whereas idps/idprs remain disordered until they interact with their partners. 68, 69 furthermore, many idps/idprs do not require folding to be functional, 1, 4, 13, 14, [70] [71] [72] [73] and some of them form fuzzy complexes, in which they preserve significant amount of disorder. 74, 75 all this suggests that the sequence space of idps (at least those which either do not fold at all or do not completely fold at binding) is noticeably greater than the "foldable" sequence space due to the removal of restrictions posed by the need to gain ordered structure spontaneously. 68 this represents one of the conundrums of intrinsic disorder, where the apparent sequence redundancy and simplicity are combined with the lack of structural restrains leading to the increase in the dimensions and complexity of the available sequence space. also, the existence of a noticeable sequencestructure heterogeneity of idps should be emphasized. 68 since the unique 3d-structure of an ordered single-domain protein is defined by the interplay between all (or almost all) of its residues, one could expect that the structure-coding potential is homogeneously distributed within its amino acid sequence. on the other hand, a sequence of an idp/idpr contains multiple, relatively short functional elements and therefore represents a very complex structural and functional mosaic. 68 this important feature defines the known ability of an idp/idpr to interact, regulate, and be controlled by multiple structurally unrelated partners. 76 such functional "anatomy" of idps/idprs is determined by the extremely high level of their sequence heterogeneity, which is further increased due to the ability of a single idpr to bind to multiple partners gaining very different structures in the bound state. 77 one of the crucial consequences of an extended sequence space and non-homogeneous distribution of foldability (or the structure-coding potential) within amino acid sequences of idps and idprs is their astonishing structural heterogeneity. in fact, a typical idp/idpr contains a multitude of elements coding for potentially foldable, partially foldable, differently foldable, or not foldable at all protein segments. 68 as a result, different parts of a molecule are ordered (or disordered) to a different degree. this distribution is constantly changing in time where a given segment of a protein molecule has different structures at different time points. as a result, at any given moment, an idp has a structure which is different from a structure viewed at another moment. 68 another level of structural heterogeneity is determined by the fact that many proteins are hybrids of ordered and disordered domains and regions, and this mosaic structural organization is crucial for their functions. 16 also, even when they do not possess ordered domains, idps are known to have various levels and depth of disorder. 78 over a few past years, an understanding of the available conformational space of idps/idprs underwent significant evolution. in fact, for a long time, idps were considered mostly "unstructured" or "natively unfolded" polypeptide chains. this was mostly due to the fact that the majority of idps analyzed at early stages of the field contained very little ordered structure, that is, they were really mostly unstructured or unfolded. finding and characterization of such "structure-less" proteins was important to build up a strong case to counter-point the dominant view represented by the classical sequence-to-structureto-function paradigm, especially since such fully unstructured, yet functional proteins clearly represented the other extreme of the protein structurefunction spectrum. 16 the top half of the figure 3 illustrates this situation by opposing rock-like ordered proteins and cooked spaghetti-like idps. however, already in some early studies, it was indicated that idps/idrs could be crudely grouped into two major structural classes, proteins with compact and extended disorder. 1, 4, 12, 13, 73 based on these observations, the protein functionality was ascribed to at least three major protein conformational states, ordered, molten globular, and coil-like, 13, 79 indicating that functional idps can be less or more compact and possess smaller or larger amount of flexible secondary/tertiary structure. 1, 4, 12, 13, 73, 79 roughly at the same time, it was emphasized that the extended idps (known as natively unfolded proteins) do not represent a uniform entity but contain two broad structural classes, native coils and native pre-molten globules. 1 currently available data suggest that intrinsic disorder possesses multiple flavors, can have multiple faces, and can affect different levels of protein structural organization, where whole proteins, or various protein regions can be disordered to a different degree. 68 this new view of structural space of functional proteins can be visualized to form a continuous spectrum of differently disordered conformations extending from fully ordered to completely structure-less proteins, with everything in between (fig. 3, bottom half) . here, functional proteins can be well-folded and be completely devoid of disordered regions (rock-like scenario). other functional proteins may contain limited number of disordered regions (a grass-on-the rock scenario), or have significant amount of disordered regions (a llama/camel hair scenario), or be molten globule-like (a greasy ball scenario), or behave as pre-molten globules (a spaghetti-and-meatballs/sausage scenario), or be mostly unstructured (a hairball scenario). notably, in this representation, there is no boundary between ordered proteins and idps, and, the structure-disorder space of a protein is considered as a continuum. it is important to remember that even the most ordered proteins do not resemble "solid rocks" and have some degree of flexibility. in fact, a protein molecule is an inherently flexible entity and the presence of this flexibility (even for the most ordered proteins) is crucial for its biological activity. 80 also, another important point to remember is that due to their heteropolymeric nature, proteins are never random coils and always have some residual structure. 68 protein biophysicists/biochemists working on different aspects of ordered proteins (e.g., analyzing their structural properties, functions, folding, etc.) would find biophysical properties of functional idps/idprs to be rather unusual since these highly dynamic proteins do not follow the well-accepted wisdom that a protein has to be well-folded to be biologically functional. however, the unusualness is a subjective feature, and from the viewpoint of polymer physics the extended idps/idprs possess the expected behavior . structural heterogeneity of idps/idprs. top half: bi-colored view of functional proteins which are considered to be either ordered (folded, blue) or completely structure-less (disordered, red). ordered proteins are taken as rigid rocks, whereas idps are considered as completely structure-less entities, kind of cooked noodles. bottom half: a continuous emission spectrum representing the fact that functional proteins can extend from fully ordered to completely structure-less proteins, with everything in between. intrinsic disorder can have multiple faces, can affect different levels of protein structural organization, and whole proteins, or various protein regions can be disordered to a different degree. some illustrative examples includes ordered proteins that are completely devoid of disordered regions (rock-like type), ordered proteins with limited number of disordered regions (grass-on-the rock type), ordered proteins with significant amount of disordered regions (lhama/camel hair type), molten globule-like collapsed idps (greasy ball type), pre-molten globule-like extended idps (spaghetti-and-sausage type), and unstructured extended idps (hairball type). of flexible and charged polymers, whereas the behavior of an ordered protein is rather unexpected (i.e., due to the existence of the native ensemble that for well-folded, ordered proteins can be approximated as a harmonic well around a unique, welldefined equilibrium structure). therefore, one definitely should keep in mind that the "unusual" biophysics of extended idps/idprs has its roots in the usual polymer physics of highly charged and flexible polypeptides. each protein is believed to be a unique entity that has quite unique primary sequence which governs its 3d structure (or lack thereof) and ensures specific biological function(s). therefore, understanding the effect of sequence variance on the biological performance presents a challenging task. however, natural polypeptides have originated as random copolymers of amino acids, which were adjusted or "selected" over evolution based on their functional capacities. 56, 81 despite their differences in primary amino acid sequences, protein molecules in a number of conformational states behave as polymer homologues, suggesting that the volume interactions can be considered as a major driving force responsible for the formation of equilibrium structures or structural ensembles. 82 for example, ordered globular proteins and molten globules (both as folding intermediates of globular proteins or as examples of collapsed idps) exhibit key properties of polymer globules, where the fluctuations of the molecular density are expected to be much less than the molecular density itself. extended idps (both intrinsic coils and intrinsic pre-molten globules) and ordered proteins in the pre-molten globule intermediate state possess properties of squeezed coils, since water is a poor solvent for a polypeptide. in fact, even high concentrations of strong denaturants (e.g., urea and gdmcl) are very likely to be bad solvents for protein chains, resulting in the preservation of extensive residual structure even under these harsh denaturing conditions. 82 based on these and related observations, and taking into account the fact that many idps/idprs are characterized by significant amino acid composition biases, the overall polymeric behavior of these proteins and regions can be mimicked reasonably well by the behavior of low-complexity polypeptides (e.g., homopolypeptide and block copolypeptides). following these ideas, it was shown that water is a poor solvent for polypeptide backbone alone and for the idps containing long tracts of polar amino acid residues since polar homo-polypeptides without hydrophobic groups (e.g., polyglutamine or glycineserine block copolypeptides) were shown to prefer collapsed ensembles in aqueous media. [83] [84] [85] [86] [87] [88] furthermore, even polyglycine was shown to have a tendency to form heterogeneous ensembles of collapsed structures in water. 88 a systematic analysis of the conformational behavior of protamines, arginine-rich idps involved in the condensation of chromatin during spermatogenesis, and protamine-like peptides revealed that there is a charge-driven coil-to-globule transition in these highly charged polypeptides, where the net charge per residue serves as the discriminating order parameter. 89 overall, the increase in the hydrodynamic dimensions of a polypeptide chain with increase in its net charge per residue can be attributed to the increase in the intramolecular electrostatic repulsions between similarly charged sidechains and the favorable solvation of these moieties. 89 based on these premises, at least three different classes of globule-forming polar/charged idps were proposed. the first class is comprised by polar tracts which collapse due to water being a poor solvent for a backbone and non-charged side chains. the second class is represented by weak polyelectrolytes and weak polyampholytes, which have low per residue net charge and low fractions of positively and/or negatively charged residues. these idps/ idprs form collapsed structures since the driving force responsible for the collapse is not overcome by the intramolecular electrostatic repulsion between the charged side-chains and by their favorable free energies of solvation. furthermore, if such idps/ idprs possess polyampholytic nature, their globular state could be additionally stabilized by electrostatic interactions between the oppositely charged sidechains. finally, idps/idprs from the third class are strong polyampholytes characterized by high fractions of positively and/or negatively charged residues but have low per residue net charge. such intrinsically disordered protein can form collapsed structures stabilized mostly by multiple electrostatic interactions between solvated side-chains of opposite sign. 89 the extended idps/idprs were used as a model system for the analysis of the effect of electrostatic interactions on conformational properties of unfolded proteins, and for testing the quantitative descriptions and predictions of polymer theory related to the influence of charged amino acids on chain dimensions. 90 for example, based on the analysis of the conformational equilibrium of coarse-grained polypeptides as a function of sequence hydrophobicity, charge, and length it has been concluded that the variations in sequence hydrophobicity and charge define a coil-to-globule transition comparable to that seeing in the empirical ch-plot, 12, 91 suggesting that a minimal, polymer physics-based model can capture the elements of global protein conformation. 92 idps/idprs with very high net charges are expected to be more extended and behave more similar to random coils (i.e., similar to conformations adopted by proteins in the denaturant gdmcl). the analysis of the gdmcl-induced expansion of the unfolded states suggested that protein charge density plays a crucial role in defining the hydrodynamic behavior of the unfolded polypeptide chain. 90 here, highly charged proteins can exhibit a prominent expansion at low ionic strength that correlates with their net charges. 90 it has been also hypothesized that the pronounced effect of charges on the dimensions of unfolded proteins might have important implications for their cellular functions. 90 similarly, a comprehensive analysis of the hydrodynamic dimensions of fg-nucleoporins containing large idprs with multiple phenylalanineglycine repeats (fg-domains) revealed that under the physiologic conditions in vitro these domains adopt distinct categories of disordered structures, such as molten globule, pre-molten globule, relaxedcoil, extended-coil (as in urea), or very extended-coil (as in gdmcl). 93 the category of intrinsically disordered structure in a given fg-domain was related to its amino acid composition, namely to the content of charged residues, where more charged fg-domains possessed larger hydrodynamic dimensions. 94 furthermore, fg-nucleporins with higher charge density were shown to be more dynamic than the collapsed-coil fg-domains, being also prone to repel other fg-domains. on the other hand, the collapsedcoil fg-domains were prone to oligomerize. these observations suggested that different types of fgdomains with different aggregation propensities provide molecular basis for two different gating mechanisms operating at the nuclear pore complex at distinct locations; one acting as a hydrogel, and the other as an entropic brush. 94 therefore, the abundance and peculiarities of the charged residues distribution within the protein sequences might determine physical and biological properties of extended idps and idprs. also, simple polymer physics-based reasoning can give reasonably well-justified explanation of the conformational behavior of extended idps. in general, the conformational behavior of idps is characterized by the low cooperativity (or the complete lack thereof) of the denaturant-induced unfolding, lack of the measurable excess heat absorption peak(s) characteristic for the melting of ordered proteins, "turned out" response to heat and changes in ph, and the ability to gain structure in the presence of various binding partners. 95 the analysis of the temperature effects on structural properties of several extended idps revealed that native coils and native pre-molten globules partially fold as the temperature is increased. 1, 73, [95] [96] [97] [98] these heating-induced structural changes in extended idps were attributed to the increased strength of the hydrophobic interaction at higher temperatures, leading to a stronger hydrophobic attraction, which is the major driving force for folding. similarly, extended idps/idprs are characterized by the "turned out" response to changes in ph, 96,99-102 where a decrease (or increase) in ph induces their partial folding due to the minimization of their high net charges viewed at neutral ph, thereby decreasing charge/charge intramolecular repulsion and permitting hydrophobicdriven collapse to the partially folded conformation. 95 every disordered protein is disordered in its own way data accumulated so far indicate that intrinsic disorder exists at multiple structural levels and might differently affect different regions/domains of idps. this defines noted structural complexity and heterogeneity of idps/idprs which are further enhanced by the way different proteins/protein regions respond to their environments. furthermore, since intrinsic disorder is crucial for many biological functions and therefore must prevail in different environments, the amino acid sequences and compositions of idps and idprs are specifically shaped by the peculiarities of their global and local environments. all this makes the protein intrinsic disorder phenomenon to be so broad that one can even assume that every disordered protein (or at least every family of disordered proteins) is disordered in its own way. this hypothesis has far-reaching consequences since it implies that a general disorder predictor has limited accuracy and cannot predict with equally high accuracy disorder status of all protein sequences due to their heterogeneity. it also implies that some environmental factors definitely should be taken into account when assessing intrinsic disorder in proteins. several examples are presented below to support the overall validity of these statements. the first example is given by transmembrane (tm) proteins, in which disorder is widely observed (e.g., 40% of human integral plasma proteins were predicted to contain long idprs). [103] [104] [105] [106] [107] furthermore, disorder is unevenly distributed between the cytoplasmic and the external surfaces of these proteins, with cytoplasmic domains being up to threefold more disordered than extracellular domains. 105 although these analyses gave interesting hints on the abundance of disorder in tm proteins, the obvious weakness of such evaluations is in the fact that they were performed using the disorder predictors developed from structured and disordered regions found in water-soluble proteins. 108 however, the major physico-chemical properties of water-soluble and integral membrane proteins are very different due to the differences in their environments. for example, similar to typical water soluble proteins, the tm regions of membrane proteins are often highly structured, containing a-helices 109 or b-structure, 110 which are especially likely to occur due to the low dielectric constant values within the membrane bilayers. 111, 112 on the other hand, the exterior regions of tm proteins are much more apolar than the exteriors of water-soluble proteins. [113] [114] [115] therefore, the peculiarities of the membrane environment, with its highly nonpolar nature originating either from lipids or from protein interiors, are especially unfavorable for intrinsic disorder, since propensity for intrinsic disorder is typically encoded in a high content of polar and charged residues. therefore, the idprs found in integral membrane proteins would be expected to be generally localized within the regions external to the membrane bilayer. 108 also, the distinctive environment of the membrane bilayer imposes constraints on the amino acid composition of integral membrane proteins, even on the regions external to the membrane bilayer. 116, 117 comprehensive bioinformatics analysis revealed that integral membrane proteins commonly possess idprs defined as regions of missing electron density in their crystal structures. 108 comparison of the idprs found in the a-helical and the b-barrel bundle integral membrane proteins with the idprs viewed in typical water-soluble proteins revealed the existence of statistically distinct amino acid compositional biases characteristic for these three protein classes. therefore, the use of specific amino acid signatures of idprs found in tm helical bundles and b-barrels can potentially lead to significantly more accurate disorder predictions for these two classes of integral membrane proteins. 108 another illustrative example of the specific disorderrelated and environment-dependent sequence features is given by archaeal proteins. 46, 51 based on the levels of predicted disordered residues, archaeal proteins can be grouped into three classes, with ranges of the disordered residue content of 12-21%, 21%-32%, and 32%-38% (see fig. 2 ). the archaeal proteomes with the highest disorder contents are halophiles and methanophiles. 46, 51 similar to tm proteins, the estimation of intrinsic disorder in the extremophilic proteins of the microorganisms surviving under hypersaline conditions using predictors developed for the "normal" non-halophilic proteins existing under the normal physiological conditions of 100-150 mm nacl may not be accurate. 46 in fact, one of the strategies used by the halophilic archaea, which are salt-loving extremophilic organisms that grow optimally at high salt concentrations, to maintain proper osmotic pressure in their cytoplasm is a so-called "salt-in" strategy that involves accumulation of molar concentrations of potassium and chloride in their cytosoles. 118 this strategy requires extensive adaptation of the intracellular proteins to the presence of near-saturating salt concentrations. the proteomes of such "salt-in" organisms are highly acidic, 46, 51 and their proteins are characterized by remarkable instability at conditions of low salt concentrations and by maintaining soluble and active conformations in hypersaline conditions that are generally detrimental to the non-halophilic proteins. [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] finally, peculiarities of disorder distributions in viral proteins can be used to further support the importance of considering environmental factors. 46, 51 here, the comprehensive analysis of intrinsic disorder in various completed proteomes revealed that the viral proteomes have the largest variation of disorder content, which ranges from 7.3% disordered residues in the human coronavirus nl63 to 77.3% disordered residues in the avian carcinoma virus proteome (see fig. 2 ). 46 the high predicted intrinsic disorder content in viruses has multiple functional implications, where some idprs are used in the functioning of viral proteins and help viruses to highjack various pathways of the host cells, others likely have evolved to help viruses accommodate to their hostile habitats, and still others evolved to help viruses in managing their economic usage of genetic material via alternative splicing, overlapping genes, and anti-sense transcription. 128 these findings are in agreement with another study revealing that in comparison with archaea and bacteria, viral and bacteriophagic proteins were significantly more enriched in polar residues and depleted in hydrophobic residues and were close to eukaryotic proteins in terms of their amino acid compositions and the reduced content of the order-promoting residues. 129 functional protein clouds: major functional advantages of being intrinsically disordered the high natural abundance of idpds/idprs and their specific structural features indicate that these proteins and regions might carry out important biological functions. this hypothesis has been confirmed by several comprehensive studies, 1, [11] [12] [13] [14] [71] [72] [73] 78, [130] [131] [132] [133] [134] which revealed that these structure-less members of the protein kingdom are abundantly involved in numerous biological processes, where they are frequently found to play different roles in regulation of the functions of their binding partners and in promotion of the assembly of supra-molecular complexes. 1, 4, [11] [12] [13] [14] [15] 31, [70] [71] [72] [73] [76] [77] [78] [79] 131, 132, [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [149] the conformational plasticity of idps/idprs provides them with a wide spectrum of exceptional functional advantages over the functional modes of ordered proteins and domains. 4, 10, 11, 13, 32, 71, 72, 77, 78, 131, 132, 134, 141, 142, 150, 151 some of these advantages are: 1 increased speed of interaction due to greater capture radius and the ability to spatially search through interaction space; 2 increased interaction (surface) area per residue; 3 strengthened encounter complex allows for less stringent spatial orientation requirements; 4 efficient regulation via rapid degradation; 5 the ability to be involved in one-to-many binding, where a single disordered region binds to several structurally diverse partners; 6 the ability to be involved in many-to-one binding, where many distinct (structured) proteins may bind a single disordered region; 7 the ability to overcome steric restrictions, enabling larger interaction surfaces in protein-protein and protein-ligand complexes than those obtained with rigid partners; 8 the ability to fold upon binding (completely or partially); 9 the ability of some idps/idprs to form very stable intertwined complexes; 10 the ability of some idps/idprs to stay substantially disordered in bound state; 11 binding fuzziness, where different binding mechanisms (e.g., via stabilizing the binding-competent secondary structure elements within the contacting region, or by establishing the longrange electrostatic interactions, or being involved in transient physical contacts with the partner, or even without any apparent ordering) can be employed to accommodate peculiarities of interaction with various partners; 12 binding plasticity, where an idpr folds to specific bound conformations (which can be very different) according to the template provided by binding partners; 13 high accessibility of sites targeted for posttranslational modifications (ptms); 14 efficient structural and functional regulation via ptms such as phosphorylation, acetylation, lipidation, ubiquitination, sumoylation, and so forth, allowing for a simple means of modulation of their biological functions; 15 efficient functional control via regulatory proteolytic attack sites of which are frequently associated with idprs; 16 ease of regulation/redirection and production of otherwise diverse forms by alternative splicing (given the existence of multiple functions in a single disordered protein, and given that each functional element is typically relatively short, alternative splicing could readily generate a set of protein isoforms with a highly diverse set of regulatory elements 152 ); 17 the possibility of overlapping binding sites due to extended linear conformation; 18 decoupled binding affinity and specificity, where, due to the induced folding, idp/idpr can be involved in the formation of specific but weak complexes. in other words, idp/idpr might possess high specificity for given partners combined with high k on and k off rates that enable rapid association with the partner without an excessive binding strength. this combination of high specificity with low affinity defines the broad utilization of intrinsic disorder in regulatory interactions where turning a signal off is as important as turning it on; 19 diverse evolutionary rates with some id proteins being highly conserved and other id proteins possessing high evolutionary rates. the latter ones can evolve into sophisticated and complex interaction centers (scaffolds) that can be easily tailored to the needs of divergent organisms; 20 flexibility that allows masking (or not) of interaction sites or that allows interaction between bound partners; 21 the ability to be involved in the cascade interactions, where idp binding to the first partner induces partial folding generating a new binding site suitable for interaction with the second partner, and so forth. many disorderrelated functions (e.g., signaling, control, regulation, and recognition) are incompatible with well-defined, stable 3-d structures. 1, [11] [12] [13] [14] 31, 73, 78, 79, 132, 134, [138] [139] [140] 142, 144, 153 functions of many idps/idprs rely on interactions with specific binding partners, and many idps/idprs tend to undergo disorder-to-order transitions as a result of binding to their specific targets. 12 functionally, idps/idprs were grouped in at least six broad classes based on the mode of action. 14,136 these broad classes included protein and rna chaperones, entropic chains, effectors, scavengers, assemblers and display sites, 14,136 and 28 separate functions, including molecular recognition via binding to other proteins, or to nucleic acids, were assigned for idprs in early studies. 71, 72 later, a rich spectrum of biological functions associated with idps/idprs was found based on a comprehensive computational study of a correlation between the functional annotations in the swiss-prot database and predicted intrinsic disorder. [138] [139] [140] the approach was based on the hypothesis that if a function described by a given keyword relies on intrinsic disorder, then the keyword-associated protein would be expected to have a greater level of predicted disorder compared to the protein randomly chosen from the swiss-prot. this analysis revealed that 44% and 34% of swiss-prot functional keywords were associated with ordered and disordered proteins, respectively, whereas 22% functional keywords yielded ambiguity in the likely function-structure associations. [138] [139] [140] interestingly, most of the structured protein-associated key words were shown to be related to enzymatic activities, whereas the majority of the disordered protein-associated keywords were related to signaling and regulation. these results agree well with the notion that enzymatic catalysis requires ordered structure and that effectiveness of signaling is dependent on binding reversibility, a property directly associated with the thermodynamics of disorder-to-order transition induced by binding. [138] [139] [140] many idps and idprs undergo a disorderto-order transition upon functioning. 11, 13, 15, 71, 72, 78, 79, [130] [131] [132] 134, [154] [155] [156] [157] when disordered regions bind to signaling partners, the free energy required to bring about the disorder to order transition takes away from the interfacial, contact free energy, with the net result that a highly specific interaction can be combined with a low net free energy of association. 13, 155 high specificity coupled with low affinity is a useful pair of properties for a reversible signaling interaction. furthermore, a disordered protein can readily bind to multiple partners by changing shape to associate with different targets. 13, 158, 159 all this clearly suggests that there is a new twopathway protein structure-function paradigm, with sequence-to-structure-to-function for enzymes and membrane transport proteins, and sequence-to-disordered ensemble-to-function for proteins and protein regions involved in signaling, regulation, and control. 1, 13, 71, 73, 79 one of the first generalization of this concept was given by the protein trinity hypothesis, which suggested that native proteins can be in one of three states, the solid-like ordered state, the liquid-like collapsed-disordered state, or the gas-like extended-disordered state. 79 function is then viewed to arise from any one of the three states or from transitions between them. this model was subsequently expanded to include a fourth state (pre-molten globule) and transitions between all four states. 1 in reality, based on the outlined above idea of the continuous spectrum of protein structures, functional proteins contain various amounts of intrinsic disorder and this continuous structural spectrum of protein defines their limitless functional variability. among intriguing protein functions relying on intrinsic disorder are moonlighting activities, 137 actions of hub proteins, 78, 93, 134, [160] [161] [162] [163] [164] and scaffolding functions. 141, 165 since all these functions illustrate the notions that the intrinsic disorder concept represents a universal skeleton key (or lock-pick) that helps unlocking seemingly unresolvable mysteries of protein science and therefore can be considered as a new ariadne's thread that helps navigate the unusual twists of the sophisticated relationships between protein sequence, structure, and function, they are considered in some detail below. moonlighting proteins. moonlighting is the ability of a protein to fulfill more than one function. often, these functions are unrelated or at least are not obviously related to each other. 137, [166] [167] [168] the capability of a protein to be involved in moonlighting or multi-tasking activities represent one of the solutions used by the nature to increase the organism's complexity without the expansion of the genome size, where by acting differently at distinct points of metabolic networks proteins increase network complexity without increasing the actual size of the network. 137, [166] [167] [168] among various molecular mechanisms used by the moonlighting proteins to switch between functions are changes in cellular localization, changes in ligand binding, expression in different cell types, and variations of the oligomerization state. 137 in addition to these mechanisms that can be explained within the frames of the traditional structure-function paradigm, consideration of the intrinsic disorder phenomenon opens new possibilities. 137 in fact, one of the peculiar functional advantages of idps/idprs is their binding promiscuity and ability to be involved in one-to-many signaling, whereby an idp/idpr binds structurally different partners in a template-induced folding process. 11, 77, 132, 169 therefore, idps/idprs can use the same region or overlapping interaction regions/surfaces to exert distinct effects and employ the disorderbased mechanisms to switch function that relies on their capability to form different conformations upon binding. 137 such structural malleability of idps/ idprs defines their ability to participate in unprecedented moonlighting events, where these disordered moonlighting proteins or regions produce the opposing effects (inhibition and activation) on different partners or even the same partner molecule. 137 hub proteins. signaling interactions inside the cell can be described as specific and complex networks that can be considered as "scale-free" or "small-world" networks, which have hubs, with many connections, and ends, that have the only connection to just one neighbor. 170, 171 such scale-free networks combine the local clustering of connections characteristic of regular networks with occasional long-range connections between clusters, as can be expected to occur in random networks. in other words, the distance between nodes in these scalefree networks follows a power-law distribution. 172 based on their spatiotemporal peculiarities protein hubs were grouped into two broad categories, "date hubs" that binds their numerous partners sequentially, and "party hubs" simultaneously interacting with their partners. 173 since many idps are known to be involved in interaction with large number of distinct partners, they clearly can be considered as hubs in the scale-free protein-protein interaction networks. 78, 134 based on the systematic analysis of several know hub proteins 134 followed by a series of robust bioinformatics studies, 93,160-164 it was concluded that hubs commonly use disordered regions to bind to multiple partners and that there are at least two primary mechanisms by which disorder is utilized in protein-protein interaction networks where one disordered region binds to many partners or many disordered region bind to one partner. 134 scaffold proteins. scaffold proteins constitute an important subclass of hubs that typically have a modest number of interacting partners and that are commonly found at the central parts of functional complexes, where they interact with most of their partners at the same time and therefore act as party hubs. 160 besides being responsible for bringing together specific proteins within a signaling pathway and providing selective spatial orientation and temporal coordination to facilitate and promote interactions among interacting proteins, some scaffolds can influence the specificity and kinetics of signaling interactions via simultaneous binding to multiple participants in a particular pathway and facilitation and/or modifying the specificity of pathway interactions, 174 other scaffold can change conformations of individual proteins and thus modulate their activities, 174 still other scaffold proteins may modulate the activation of alternative pathways by promoting interactions between various signaling proteins. 141 analysis of several well-characterized signaling scaffold proteins reveled that their large idprs are crucial for the successful scaffold function. 141 a more global bioinformatics analysis revealed that a typical design of a scaffold protein includes a set of short globular domains (80 amino acids on average) connected by long linker regions (150 residues on average) with crucial binding functions. 165 this gave further support to the notion that signaling scaffold proteins utilize the various features of highly flexible id regions to obtain more functionality from less structure. 141 disorder and transcription regulation. conformational plasticity and adaptability associated with intrinsic disorder are crucial for various protein functions. among the proteins whose functional life is strongly disorder-dependent are transcription factors (tfs) 175, 176 and other proteins involved in transcriptional regulation, such as the mediator complex, 24,177 core and linker histones, 178 and ribosomal proteins. 179 for example, from 83 to 94% of tfs might possess long idprs, with the degree of disorder in eukaryotic tfs being significantly higher than in prokaryotic tfs. 175, 176 also, tfs were shown to be depleted in order-promoting residues and enriched in disorder-promoting residues, and were characterized by high levels of a-molecular recognition feature (morf). 175 furthermore, disorder is unevenly distributed within the tfs, with the degree of disorder in their activation regions being much higher than that in dna-binding domains. however, the at-hooks (which are dna-binding motifs present in many proteins which binds to the (ataa) and (tatt) repeats of dna) and basic regions of tf dna-binding domains are highly disordered suggesting that eukaryotes with their well-developed gene transcription machinery require transcription factor flexibility to be more efficient. 175 a number of interesting and important roles were also ascribed to intrinsic disorder in tfs related to the regulation of heat shock response (so called heat shock factors, hsfs) 180 and in the reprogramming tfs (the yamanaka factors, namely sox2, oct3/4 (pou5f1), klf4, and c-myc, and the thomson factors, namely sox2, oct3, lin28, and nanog) overexpression of which is known to generate induced pluripotent stem (ips) cells from terminally differentiated somatic cells. 181 disorder in the regulation of cellular pathways. of special interests are the vital roles of intrinsic disorder in regulation and orchestration of various cellular pathways. one of the illustrative examples of this regulatory role of intrinsic disorder is the canonical wnt-pathway that involves five proteins, axin, cki-a, gsk-3b, apc (adenomatous polyposis coli, also known as deleted in polyposis 2.5 protein), and b-catenin (all shown to contain long idprs). this pathway is known to play a number of crucial roles in the development of organism, and the malfunctions of which might lead to various diseases including cancer. 182 the comprehensive analysis of published data revealed that idprs found in wntpathway proteins orchestrate protein-protein interactions, and facilitate ptms and signaling. 182 furthermore, the scaffold protein axin and another large protein, apc, are heavily enriched in disorder and act as flexible concentrators in gathering together all other proteins involved in the wnt-pathway. 182 intriguingly, the multifarious roles of highly disordered apc in regulation of b-catenin function were established by showing that disordered apc helps the collection of b-catenin from cytoplasm, facilitates the bcatenin delivery to the binding sites on axin, and controls the final detachment of b-catenin from axin. 182 another important illustration of the involvement of intrinsic disorder in regulation of crucial pathway is given by the process of the programmed cell death (pcd), which is one of the most intricate cellular processes where the cell uses specialized uversky cellular machinery and intracellular programs to kill itself and which enables metazoans to control cell numbers and eliminate cells that threaten the animal's survival. 183 pcd includes several specific modules, such as apoptosis, autophagy, and programmed necrosis (necroptosis). these modules are not only tightly regulated but also intimately interconnected and are jointly controlled via a complex set of protein-protein interactions. recently, several large sets of pcd-related proteins across 28 species were analyzed using a wide array of modern bioinformatics tools to understand the role of the intrinsic disorder in controlling and regulating the pcd. 183 this analysis revealed that proteins involved in regulation and execution of pcd possess substantial amount of intrinsic disorder and idprs were implemented in a number of crucial functions, such as protein-protein interactions, interactions with other partners including nucleic acids and other ligands, were shown to be enriched in post-translational modification sites, and were characterized by specific evolutionary patterns. 183 unique catalytic function of a protein is believed to be dictated by its unique 3d structure. this axiom constitutes a cornerstone of the lock-and-key paradigm and it seemed to be able to sustain the furious attack on protein structure-function relationship initiated by the discovery of idps and hybrid proteins with ordered domains and idprs. in fact, from the vast majority of experimental and computational studies a general conclusion was drawn over and over again, where the functional repertoire of idps complemented the functional arsenal of ordered proteins, with ordered proteins being mostly responsible for catalysis and transport and with idps doing the majority of other jobs in the cell. on the other hand, all proteins (even the most ordered and tightly folded ones) are intrinsically flexible molecules that undergo conformational changes over a wide range of timescales and amplitudes. 184 in fact, the combination of active site reactivity with the dynamic character of proteins allows enzymes to be promiscuous and remarkably efficient at the same time. 185 furthermore, in general, dynamic fluctuations are crucial for enzyme catalysis, since they can influence substrate binding and product release, and may even adjust the effective barriers of the catalyzed reactions. [186] [187] [188] [189] [190] often, dynamic changes in the enzyme during the catalytic reaction can be described using the induced-fit model, where a conversion of one tight conformational ensemble (free enzyme) to another distinct ensemble (bound enzyme) takes place through a series of local substrate-mediated structural rearrangements. 191 despite this crucial role of local flexibility in the enzymatic catalysis, enzymes are still relatively stable molecules whose dynamic character is restricted to a small set of tightly folded conformations and whose unique (albeit locally flexible) structures are needed for efficient catalysis. from this viewpoint, the presence of intrinsic disorder is expected to be poorly compatible with enzymatic catalysis, which requires a well-organized environment in the active site of the enzyme in order to facilitate the formation of the transition state of the chemical reaction to be catalyzed. 192 in a sharp contrast to this common wisdom supported by a wide array of specific examples, several enzymes were shown to be much more dynamic than the catalytic machines are expected to be, clearly possessing, in their precatalytic states, many characteristic properties of molten globules and retaining unusually high flexibility in structurally defined enzyme-ligand complexes. one of the best characterized examples of such molten globular enzymes is the engineered monomeric form of chorismate mutase from methanococcus jannaschii (mjcm). 184, [193] [194] [195] here, a functional monomer (mmjcm) was created by inserting the hinge-loop sequence into the long, dimer-spanning n-terminal helix. 193 in its unbound form, mmjcm was shown to exists as a native molten globule that was described as a dynamic ensemble of a-helical conformers rapidly interconverting on the millisecond timescale. 193 interaction with natural ligand induced global conformational changes in the molten globular mmjcm promoting formation of a defined enzyme-ligand complex, which, however, preserved unusually high flexibility. 184 catalytic mechanism of the molten globular mmjcm was described as follows: "though probably stochastic in nature, internal motions in the complex may generate a collective dynamic matrix that samples catalytically active conformation(s) often enough to achieve rapid turnover in the presence of the true transition state." 184 therefore, some enzymes can represent a highly dynamic heterogeneous conformational ensemble which is still compatible with efficient catalysis. in agreement with this hypothesis, a molten globular character was described for circularly permuted dihydrofolate reductase (dhfr), 196, 197 and urease g from bacillus pasteurii (bpureg). [198] [199] [200] of these three enzymatic molten globules ureg is the only natural molten globular enzyme known to date, since both circularly permuted dhfr and monomeric mjcm were obtained as a result of some genetic manipulations. although the number of known native molten globules with enzymatic activity is small, their existence provides an interesting hint on early protein evolution. in fact, simple logics suggests that well-ordered enzymes appear as a result of long evolutionary process, whose very likely starting point was a partially folded polypeptide with some general properties of the molten globule. idps/idprs can form highly stable complexes, or be involved in signaling interactions where they undergo constant "bound-unbound" transitions, thus acting as dynamic and sensitive "on-off" switches. the ability of these proteins to return to the highly flexible conformations after the completion of a particular function, and their predisposition to gain different conformations depending on the environmental peculiarities, are unique physiological properties of idps which allow them to exert different functions in different cellular contests according to a specific conformational state. 4 due to their lack of rigid structure, combined with the high level of intrinsic dynamics and almost unrestricted flexibility at various structure levels in the non-bound state, as well as due to the unique capability to adjust to structure of the binding partner, idps are characterized by a very diverse range of binding modes, creating a multitude of unusual complexes, many of which are not attainable by ordered proteins. 201 some of these complexes are relatively static, resemble complexes of ordered proteins, and, therefore are suitable for the structure determination by x-ray crystallography. among these static complexes are: morfs, wrappers, chameleons, penetrators, huggers, intertwined strings, long cylindrical containers, connectors, armature, tweezers and forceps, grabbers, tentacles, pullers, and stackers or b-arcs. 201 these binding modes are shown in supporting information figure 1s and briefly described in the supporting information materials. in addition to the static complexes, where bound partners have fixed structures, some idps/idprs do not fold even in their bound state, forming so-called disordered, dynamic, or fuzzy complexes with ordered proteins, 97, [202] [203] [204] [205] [206] other disordered proteins, [207] [208] [209] or biological membranes. 210, 211 in complexes of some of these idps with their binding partners, the disordered regions flanking the interaction interface but not the interface itself remain disordered. such mode of interaction was recently described as "the flanking fuzziness" in contrast to "the random fuzziness" when the disordered protein remains entirely disordered in the bound state. 75, 212 it is also expected that the similar binding mode can be utilized by disordered protein while interacting with nucleic acids and other biological macromolecules. 201 physically, binding is considered as joining objects together and suggests spatial and temporal fixation of bound partners. the formation of protein complexes with specific binding partners is expected to bring some fixation (at least at the binding site). therefore, disordered complexes where interaction of a disordered protein with the binding partners is not accompanied by a disorder-to-order transition within the interaction interface clearly cannot be described by the classical binding paradigm. this contradiction can be resolved assuming that the ordered binding partner and/or disordered protein contain multiple low affinity binding sites. the existence of several similar binding sites combined with a highly flexible and dynamic structure of disordered protein creates a unique situation where any binding site of disordered protein can interact with any binding site of its partner with almost equal probability, in a staccato manner. the low affinity of each individual contact implies that each of them is not stable and can be readily broken. therefore, such disordered or fuzzy complex can be envisioned as a highly dynamic ensemble in which a disordered protein does not present a single binding site to its partner but resemble a "binding cloud," in which multiple identical binding sites are dynamically distributed in a diffuse manner. in other words, in this staccato-type interaction mode, an disordered protein rapidly changes multiple binding sites while probing binding site(s) of its partner. 201 an additional factor which can help holding a dynamic complex together could be a weak longrange attraction between protein molecules. 213 this long-range attraction is universal for all protein solutions and has a range several times that of the diameter of the protein molecule, much greater than the range of the screened electrostatic repulsion. 213 the most common outcome of these function-related structural changes is the overall increase in the amount of ordered structure. however, functions of some ordered proteins require local or even global unfolding of a unique protein structure. 68 among specific features of these structural alterations are their induced nature and transient character combined with a wide range of molecular mechanisms by which they can be promoted. 68 these functional unfolding-activating factors include light; mechanical force; changes in ph, temperature, or redox potential; interaction with membrane, ligands, nucleic acids, and proteins; various ptms; release of autoinhibition due to the unfolding of autoinhibitory domains induced by their interaction with nucleic acids, proteins, membranes, ptms, and so forth. 68 among rather unusual factors used by nature to activate proteins via functional unfolding are light and mechanical force. for example, exposure to blue light results in the activation of the photoactive yellow protein (pyp), which is an ordered, water-soluble 14 kda protein that contains a thioester linked uversky p-coumaric acid cofactor and serves as a photosensor in ectothiorhodospira halophila. 214, 215 pyp is a bacterial blue light sensor that undergoes conformational changes upon signal transduction. the absorption of a photon triggers substantial protein unfolding and leads to the formation of the transient signaling state that interacts with the partner molecules. this allows the swimming bacterium to operate the directional switch that protects it from harmful illumination. comprehensive analysis combining double electron electron resonance spectroscopy (deer), high resolution nmr, and timeresolved pump-probe x-ray solution scattering (tr-saxs/waxs) revealed that the transiently activated and short-lived signaling state of the pyp possessed a large degree of disorder and existed as an ensemble of multiple conformers that exchange on a millisecond time scale. 216, 217 this unusual behavior is illustrated in figure 4 that shows structures of inactive folded pyp and its light-activated functional form, which is highly disordered. 68 some proteins undergo local unfolding induced by the mechanical force and therefore can serve as force sensors. 68 among these natural force sensors are mechanosensitive ion channels that recognize and respond to the membrane tension, which is the mechanical forces applied along the plane of the cell membrane, rather than to the hydrostatic pressure perpendicular to the membrane plane. 220 these ion channels are activated via partial unfolding of some of their functional parts induced by membrane tension. 221 for a long time, the fact that idps/idprs undergo disorder-to-order transitions either during their functions or in order to be functional was used as one of the strongest arguments against the idea of protein intrinsic disorder. it was stated that most idps (those which are not the artifacts of current methods of protein production) are in fact proteins waiting for a partner (pwps) that serve as parts of a multi-component complex and that do not fold correctly in the absence of other components. 29 therefore, when folded after binding to their partners, these proteins are not too different from typical ordered proteins. however, one need to keep in mind that a portion of "folding code" that defines the ability of ordered proteins to spontaneously gain a unique biologically active structure is missing for idps/idprs since they cannot fold spontaneously. this missing portion of the "folding code" (or a part of it) can be supplemented by binding partner(s). as a result, ordered and disordered proteins can be discriminated on a simple basis of temporal correlation between their folding and binding: ordered proteins fold first and then bind to their partners while the idps/idprs remain disordered until they bind their partners and often preserve substantial disorder in the bound state. 69 furthermore, numerous cases of functional unfolding (or transient disorder, or upside-down functionality) represent further support to the concept of functional disorder by clearly showing that many proteins possess dormant disorder that needs to be awakened in order to make these proteins functional. it is clear now that the idps and idprs are real, abundant, diversified, and vital. the highly dynamic nature of idps and idprs is a visual illustration of the chaos. however, the evolutionary persistence of these highly dynamic proteins (see below), their unique functionality, and involvement in all the major cellular processes evidence that this chaos is tightly controlled. 147 to answer the question as to . ground state structure was determined by multidimensional nmr spectroscopy. 218 this structure is in agreement with an earlier published 1.4 å crystal structure, 219 and modeled structure based on combined deer, tr-saxs/waxs, and nmr data. 217 it consists of an open, twisted, 6-stranded, antiparallel b-sheet, which is flanked by four ahelices on both sides. [217] [218] [219] on the contrary, the light-activated form is highly disordered. this structure satisfies deer, saxs/ waxs, and nmr data simultaneously. 217 how these proteins are governed and regulated inside the cell, gsponer et al. conducted a detailed study focused on the intricate mechanisms of idp regulation. 222 to this end, all the saccharomyces cerevisiae proteins were grouped into three classes using one of the available disorder predictors, dis-opred2 44 : (i) 1971 highly ordered proteins containing 0-10% of the predicted disorder; (ii) 2711 moderately disordered proteins with 10-30% predicted disordered residues; and (iii) 2020 highly disordered proteins containing 30-100% of the predicted disorder. then, the correlations between intrinsic disorder and the various regulation steps of protein synthesis and degradation were evaluated. this analysis revealed that the transcriptional rates of mrnas encoding idps and ordered proteins were comparable. however the idp-encoding transcripts were generally less abundant than transcripts encoding ordered proteins due to the increased decay rates of the transcripts of genes encoding idps. 222 furthermore, idps were shown to be less abundant than ordered proteins due to the lower rate of protein synthesis and shorter protein half-lives. as the abundance and half-life in a cell of certain proteins can be further modulated via their ptms such as phosphorylation, 223 the experimentally determined yeast kinase-substrate network was also analyzed. idps were shown to be substrates of twice as many kinases as were ordered proteins. furthermore, the vast majority of kinases whose substrates were idps were either regulated in a cell-cycle dependent manner, or activated upon exposure to particular stimuli or stress. 222 therefore, ptms may not only serve as important mechanism for the fine-tuning of the idp functions but possibly they are necessary to tune the idp availability under the different cellular conditions. 222 in addition to s. cerevisiae, similar regulation trends were also found in schizosaccharomyces pombe and homo sapiens. 222 based on these observations it has been concluded that both unicellular and multicellular organisms appear to use similar mechanisms for regulation of the intrinsically disordered protein availability. overall, this study clearly demonstrated that in eukaryotes, there is an evolutionarily conserved tight control of synthesis and clearance of most idps. this tight control is directly related to the major roles of idps in signaling, where it is crucial to be available in appropriate amounts and not to be present longer than needed. 222 it has been also pointed out that although the abundance of many idps is under strict control, some idps could be present in cells in large amounts or/and for long periods of time due to either specific ptms or via interactions with other factors, which could promote changes in cellular localization of idps or protect them from the degradation machinery. 13, 70, 138, 223, 224 overall, this study clearly showed that the chaos seemingly introduced into the protein world by the discovery of idps is under the tight control. 147 in an independent study, a global scale relationship between the predicted fraction of protein disorder and protein expression in e. coli was analyzed. 225 this study showed that the fraction of protein disorder was positively correlated with both measured rna expression levels of e. coli genes in three different growth media and with predicted abundance levels of e. coli proteins. 225 when a subset of 216 e. coli proteins that are known to be essential for the survival and growth of this bacterium were analyzed, the correlation between protein disorder and expression level became even more evident. in fact, essential proteins had on average a much higher fraction of disorder (0.36), had a higher number of proteins classified as completely disordered (19% vs. 2% for e. coli proteome), and were expressed at a higher level in all three media than an average e. coli gene. 225 the manual literature mining for a group of e. coli proteins that had high levels of predicted intrinsic disorder revealed that the disorder predictions matched well with the experimentally elucidated regions of protein flexibility and disorder. 225 a direct link between protein disorder and protein level in e. coli cells could be because the idps may carry out the essential control and regulation functions that are needed to respond to the various environmental conditions. another possibility is that idps might undergo more rapid degradation compared to structured proteins, which cells can counter by increasing mrna levels of the corresponding genes. in this case, higher synthesis and degradation rates could make the levels of these proteins very sensitive to the environment, with slight changes in either production or degradation leading to significant shifts in protein levels. 225 even more support for the tight control of idps inside the cell came from the analysis of cellular regulation of so-called "vulnerable" proteins. 23 the integrity of the soluble protein functional structures is maintained in part by a precise network of hydrogen bonds linking the backbone amide and carbonyl groups. in a well-ordered protein, hydrogen bonds are shielded from water attack, preventing backbone hydration and the total or partial unfolding of the soluble structure under physiological conditions. 226, 227 since soluble protein structures may be more or less vulnerable to water attack depending on their packing quality, a structural attribute, protein vulnerability, was introduced as the ratio of solvent-exposed backbone hydrogen bonds (which represent local weaknesses of the structure) to the overall number of hydrogen bonds. 23 it has been also pointed out that structural vulnerability can be related to protein intrinsic disorder as the inability of a particular protein fold to protect intramolecular uversky hydrogen bonds from water attack may result in backbone hydration leading to local or global unfolding. since binding of a partner can help to exclude water molecules from the microenvironment of the preformed bonds, a vulnerable soluble structure gains extra protection of its backbone hydrogen bonds through the complex formation. 226 to understand the role of structure vulnerability in transcriptome organization, the relationship between the structural vulnerability of a protein and the extent of co-expression of genes encoding its binding partners was analyzed. this study revealed that structural vulnerability can be considered as a determinant of transcriptome organization across tissues and temporal phases. 23 finally, by interrelating vulnerability, disorder propensity and co-expression patterns, the role of protein intrinsic disorder in transcriptome organization was confirmed, since the correlation between the extent of intrinsic disorder of the most disordered domain in an interacting pair and the expression correlation of the two genes encoding the respective interacting domains was evident. 23 because of the fact that idps are highly abundant and play crucial roles in numerous biological processes, it was not too surprising to find that some of them are involved in human diseases. for example, a number of human diseases originate from the deposition of stable, ordered, filamentous protein aggregates, commonly referred to as amyloid fibrils. in each of these pathological states, a specific protein or protein fragment changes from its natural soluble form into insoluble fibrils, which accumulate in a variety of organs and tissues. [228] [229] [230] [231] [232] [233] [234] several unrelated proteins including many idps are known to be involved in these protein deposition diseases. 234, 235 an illustrative examples of human neurodegenerative diseases associated with idps includes alzheimer's disease (deposition of amyloid-b, tau-protein, a-synuclein fragment nac) [236] [237] [238] [239] ; various taupathies (accumulation of tau-protein in the form of neurofibrillary tangles) 238 ; down's syndrome (nonfilamentous amyloid-b deposits) 240 ; parkinson's disease and other synucleinopathies (deposition of asynuclein) 241 ; prion diseases (deposition of prp sc ) 242 ; and a family of polyq diseases, a group of neurodegenerative disorders caused by expansion of gac trinucleotide repeats coding for polyq in the gene products. 243 furthermore, most mutations in rigid globular proteins associated with accelerated fibrillation and protein deposition diseases have been shown to destabilize the native structure, increasing the steady-state concentration of partially folded (disordered) conformers. [228] [229] [230] [231] [232] [233] [234] the maladies given above have been called conformational diseases, as they are characterized by the conformational changes, misfolding, and aggregation of an underlying protein. however, there is another side to this coin: protein functionality. in fact, many of the proteins associated with the conformational disorders are also involved in recognition, regulation, and cell signaling. for example, functions ascribed to a-synuclein, a protein involved in several neurodegenerative disorders, include binding fatty acids and metal ions; regulation of certain enzymes, transporters, and neurotransmitter vesicles; and regulation of neuronal survival (reviewed in ref. 241) . overall, there are about 50 proteins and ligands that interact and/or co-localize with this protein. furthermore, a-synuclein has amazing structural plasticity and adopts a series of different monomeric, oligomeric, and insoluble conformations (reviewed in ref. 24) . the choice between these conformations is determined by the peculiarities of the protein environment, suggesting that asynuclein has an exceptional ability to fold in a template-dependent manner. therefore, the development of the conformational diseases may originate not only from misfolding but also from the misidentification, misregulation, and missignaling of the related proteins. analysis of so-called polyglutamine diseases gives support to this hypothesis. 244 polyglutamine diseases are a specific group of hereditary neurodegeneration caused by expansion of cag triplet repeats in an exon of disease genes which leads to the production of a disease protein containing an expanded polyglutamine, polyq, stretch. nine neurodegenerative disorders, including kennedy's disease, huntington's diseases, spinocerebellar atrophy1, 22, 23, 26, 7, 17 , and dentatorubral pallidoluysian atrophy are known to belong to this class of diseases. [245] [246] [247] [248] in most polyq diseases, expansion to over 40 repeats leads to the onset. 248 it has been emphasized that such molecular processes as unfolded protein response, protein transport, synaptic transmission, and transcription are implicated in the pathology of polyq diseases. 244 importantly, more than 20 transcription-related factors have been reported to interact with pathological polyq proteins. furthermore, these interactions were shown to repress the transcription, leading finally to the neuronal dysfunction and death (reviewed in ref. 244) . these results suggest that polyq diseases represent kind of transcriptional disorder, 244 supporting our misidentification hypothesis for at least some of the conformational disorders. disorder is very common in cancer-associated proteins too. in a 2002 study, it was found that 79% of cancer-associated and 66% of cell-signaling proteins contain predicted regions of disorder of 30 residues or longer. 130 in contrast, only 13% of a set of proteins with well-defined ordered structures contained such long regions of predicted disorder. 130 in experimental studies, the presence of disorder has been directly observed in several cancer-associated proteins, including p53, 249 p57 kip2 , 250 bcl-x l and bcl-2, 251 c-fos, 252 a thyroid cancer associated protein tc-1, 253 ews-fli1 fusion protein that includes a potent transcriptional activator, the ews domain, alongside the highly conserved dna-binding domain fli1, 254,255 among many other examples. the best characterized example of the important cancerrelated idp is the tumor suppressor protein p53, which occupies the center of a large signaling network. p53 regulates expression of genes involved in numerous cellular processes, including cell cycle progression, apoptosis induction, dna repair, as well as others involved in responding to cellular stress. 256 when p53 function is lost, either directly through mutation or indirectly through several other mechanisms, the cell often undergoes cancerous transformation. 257, 258 cancers showing mutations in p53 are found in colon, lung, esophagus, breast, liver, brain, reticuloendothelial tissues, and hemopoietic tissues. 257 p53 is regulated by several different mechanisms including inhibition of its activity by interaction with e3 ubiquitin ligase mdm2, which binds to a short stretch of p53 located within the transactivation domain. mdm2-bound p53 cannot activate or inhibit other genes. mdm2 ubiquitinates p53 and thus targets it for destruction. mdm2 also contains a nuclear export signal that causes p53 to be transported out of the nucleus. 259, 260 the possibility of interrupting the action of diseaseassociated proteins (including through modulation of protein-protein interactions) presents an extremely attractive objective for the development of new drugs. since many proteins associated with various human diseases are either completely disordered or contain long disordered regions, 261, 262 and since some of these disease-related idps/idprs are involved in recognition, regulation, and signaling, these proteins/regions clearly represent novel potential drug targets. 27 due to failure to recognize the important role of disorder in protein function, current and evolving methods of drug discovery suffer from an overly rigid view of protein function. in fact, the rational design of enzyme inhibitors depends on the classical view where 3d-structure is an obligatory prerequisite for function. while generally applicable to many enzymatic domains, this view has persisted to influence thinking concerning all protein functions despite numerous examples to the contrary. this is most apparent in the observation that the vast majority of currently available drugs target the active site of enzymes, presumably since these are the only proteins for which the "unique structure-unique function" paradigm is generally applicable. idps often bind their partners with relatively short regions that become ordered upon binding. [263] [264] [265] targeting disorder-based interactions should enable the development of more effective drug discovery techniques. there are at least two potential approaches for the inhibition of the disorder-based interactions, where small molecule either bind to the binding site of the ordered partner to outcompete the idps/idprs or interacts directly with the idp/ idpr. the principles of small molecule binding to idprs have not been well studied, but sequence specific, small molecule binding to short peptides has been observed. 266 an interesting twist here is that small molecules can inhibit disorder-based proteinprotein interactions via induction of the dysfunctional ordered structures in targeted idpr, that is, via the drug-induced misfolding. in agreement with these concepts, small molecules "nutlins" have been discovered that inhibited the p53-mdm2 interaction by mimicking the inducible a-helix in p53 (residues 13-29) that binds to mdm2. 259, 260 although x-ray crystallographic studies of the p53-mdm2 complex revealed that the mdm2 binding region of p53 forms an a-helical structure that binds into a deep groove on the surface of mdm2, 267 nmr studies showed that the unbound n-terminal region of p53 lacks fixed structure, although it does possess an amphipathic helix part of the time. 249 a close examination of the interface between the proteins reveals that phe 19 , trp 23 , and leu 26 of p53 are the major contributors to the interaction, with the side chains of these three amino acids pointing down into a crevice on the mdm2 surface. 259, 260 the structure of nutlin-2 was shown to mimic the crucial residues of p53, with two bromophenyl groups fitting into mdm2 in the same pockets as trp 23 and leu 26 , and an ethyl-ether side chain filling the spot normally taken by phe 19 . [268] [269] [270] nutlins and related small molecules increased the level of p53 in cancer cell lines. this drastically decreased the viability of these cells, causing most of them to undergo apoptosis. when one of the nutlins was given orally to mice, a 90% inhibition of tumor growth compared to the control was induced. 260, [268] [269] [270] this successful nutlin story marks the potential beginning of a new era, the signaling-modulation era, in targeting drugs to protein-protein interactions. importantly, this druggable p53-mdm2 interaction involves a disorder-to-order transition. principles of such transitions are generally understood and therefore can use to find similar drug targets, which are inducible a-helices. 271 in addition to nutlins inhibiting p53-mdm2 interaction, several other small molecules also act by blocking proteinprotein interactions. 272, 273 some of these interactions involve one structured partner and one disordered partner, with disordered segments becoming a-helix upon binding. 271 therefore, the p53-mdm2 complex is not a unique exception and many other disorderbased protein-protein interactions are blocked by a small molecule. all this suggest that there is a cornucopia of new drug targets that would operate by blocking disorder-based protein-protein interactions. for these p53-mdm2-type examples, the drug molecules mimic a critical region of the disordered partner (which folds upon binding) and compete with this region for its binding site on the structured partner. these druggable interaction sites operate by the coupled binding and folding mechanism. they are small enough and compact enough to be easily mimicked by small molecules. 25 methods for predicting such binding sites in disordered regions have been developed 274 and the bioinformatics tools to identify which disordered binding regions can be easily mimicked by small molecules have been elaborated. 271 a complementary approach for small molecules to inhibit the disorder-based protein-protein interactions relies on the direct binding of drugs to the idps/idprs, which is illustrated by the c-myc-max story. 275 in order to bind dna, regulate expression of target genes, and function in most biological contexts, c-myc transcription factor must dimerize with its obligate heterodimerization partner, max, which lacks a transactivation segment. both c-myc and max are intrinsically disordered in their monomeric forms. upon heterodimerization, they undergo coupled binding and folding of their basic-helix-loophelix-leucine zipper domains (bhlhzips). since the deregulation of c-myc is related to many types of cancer, the disruption of the c-myc-max dimeric complex is one of the approaches for c-myc inhibition. several small molecules were found to inhibit the c-myc-max dimer formation. 275 these molecules were shown to bind to one of the three discrete sites within the 85-residue bhlhzip domain of c-myc, which are composed of short contiguous stretches of amino acids that can selectively and independently bind small molecules. 275 inhibitor binding induces only local conformational changes, preserves the overall disorder of c-myc, and inhibits interaction with max. 275 furthermore, binding of inhibitors to c-myc was shown to occur simultaneously and independently on the three independent sites. based on these observations it has been concluded that a rational and generic approach to the inhibition of protein-protein interactions involving idps may therefore be possible through the targeting of intrinsically disordered sequence. 275 recently, a functional misfolding concept was introduced to describe a mechanism preventing idps from unwanted interactions with non-native partners. 276 idps/idprs are characterized by high conformational dynamics and flexibility, the presence of sticky preformed binding elements, and the ability to morph into differently-shaped bound configurations. however, detailed analyses of the conformational behavior and fine structure of several idps revealed that the preformed binding elements might be involved in a set of non-native intramolecular interactions. based on these observations it was proposed that an intrinsically disordered polypeptide chain in its unbound state can be misfolded to sequester the preformed elements inside the noninteractive or less-interactive cage, therefore preventing these elements from the unnecessary and unwanted interactions with non-native binding partners. 276 it is important to remember, however, that the mentioned functional misfolding is related to the ensemble behavior of transiently populated elements of structure. in other words, it describes the behavior of a globally disordered polypeptide chain containing highly dynamic elements of residual structure, so-called interaction-prone preformed fragments, some of which could potentially be related to protein function. 276 this ability of idrps/idprs to functionally misfold can be used for finding small molecules which would potentially stabilize different members of the functionally misfolded ensemble, and therefore prevent the targeted protein from establishing biological interactions. 277 this approach is very different from the discussed above direct targeting of short idprs since it is based on a small molecule binding to a highly dynamic surface created via the transient interaction of preformed interaction-prone fragments. in essence, this approach can be considered as an extension of the well-established structure-based rational drug design elaborated for ordered proteins. in fact, if the structure of a member(s) of the functionally misfolded ensemble can be guessed, then this structure can be used to find small molecules that are potentially able to interact with this structure, utilizing tools originally developed for the rational structure-based drug design for ordered proteins. 277 ideally, a drug that targets a given protein-protein interaction should be tissue specific. although some proteins are unique for a given tissue, many more proteins have very wide distribution, being present in several tissues and organs. how can one develop tissue-specific drugs targeting such abundant proteins? often, tissue specificity for many of the abundant proteins is achieved via the alternative splicing of the corresponding pre-mrnas, which generates two or more protein isoforms from a single gene. estimates indicate that between 35 and 60% of human genes yield protein isoforms by means of alternatively spliced mrna. 278 the added protein diversity from alternative splicing is thought to be important for tissue-specific signaling and regulatory networks in the multicellular organisms. the regions of alternative splicing in proteins are enriched in intrinsic disorder, and it was proposed that associating alternative splicing with protein disorder enables the time-and tissue-specific modulation of protein function. 152 since disorder is frequently utilized in protein binding regions, having alternative splicing of pre-mrna coupled to idprs can define tissue-specific signaling and regulatory diversity. 152 these findings open a unique opportunity to develop tissue-specific drugs modulating the function of a given id protein/region (with a unique profile of disorder distribution) in a target tissue and not affecting the functionality of this same protein (with different disorder distribution profile) in other tissues. wavy pattern of global evolution of intrinsic disorder idps/idprs are more common in eukaryotes than in less complex organisms. 43, 44, [48] [49] [50] [51] [52] this suggests that disorder, with its ability to be implemented in various signaling, recognition, and regulation pathways and networks, is important for the maintenance of life in eukaryotic and especially muticellular eukaryotic organisms. 4, 45, 78, 134 also, the finding that alternatively spliced regions of mrna code for idprs much more often than for structured regions suggests that there is a linkage between alternative splicing and signaling by idprs that constitutes a plausible mechanism that could underlie and support cell differentiation, which ultimately gave rise to the multicellular eukaryotic organisms. 152 therefore, one can assume that intrinsic disorder represents a relatively recent evolutionary invention. however, this hypothesis obviously would be wrong if earlier stages of evolution would be taken into account. in fact, the chances that the first polypeptides that appeared in the primordial soup of the primitive earth possessed well-developed and unique 3d structures are minimal. the earth formed about 4.5 billion years ago. scientists dated the first fossils to 3.85 billion years ago. there are still debates and different theories about what happened in those years between the time the earth was cool enough to spawn life and the time the first fossils were formed. at the beginning of the 20th century, oparin 279 and haldane 280 proposed that some organic molecules could have been spontaneously produced from the gases of the primitive earth atmosphere, assuming that this primitive atmosphere was reducing (as opposed to oxygen-rich), and there was an appropriate supply of energy, such as lightning or ultraviolet light. thirty year later, this hypothesis (that constitutes a cornerstone of the theory of molecular evolution) received strong support from the elegant experiments of stanley l. miller and harold c. urey who were able to synthesize various organic compounds including some amino acids from non-organic compounds which were believed to represent the major components of the early earth's atmosphere (water vapor, hydrogen, methane, and ammonia) by putting them into a closed system and running a continuous electric current through the system, to simulate lightning storms believed to be common on the early earth. 281, 282 however, the miller-urey experiment yielded only about half of the modern amino acids 281, 282 suggesting that the first proteins on earth may have contained only a few amino acids. these findings go in parallel with the biosynthetic theory of the genetic code evolution suggesting that the genetic code evolved from a simpler form that encoded fewer amino acids, 283 probably paralleled by the invention of biosynthetic pathways for new and chemically more complex amino acids. 284 furthermore, some additional support of the validity of this hypothesis can be found in the standard genetic code (that consists of 4 3 4 3 4 5 64 triplets of nucleotides, codons), which is redundant (64 codons encodes for 20 amino acids). in fact, with only two exceptions, codons encoding one amino acid may differ in any of their three positions. however, only the third positions of some codons may be fourfold degenerate, that is, any nucleotide at this position specifies the same amino acid and all nucleotide substitutions at this site are synonymous. using these observations as a reflection of the evolutionary development, it was proposed that there was a period during code evolution where the third position was not needed at all and a doublet code preceded the triplet code, giving rise to 4 3 4 5 16 codons encoding for 16 or fewer amino acids, if a termination codon is taken into account. 285 based on these and many other premises, one can discriminate evolutionary old and new amino acids. in 2000, eduard n. trifonov combined 40 different single-factor criteria into a consensus scale and proposed the following temporal order of addition for the amino acids: g/a, v/d, p, s, e/l, t, r, n, k, q, i, c, h, f, m, y, w. 286 even superficial analysis of this sequence reveals that many of the early amino acids (such as g, d, e, p, and s) are disorder-promoting, as they are very abundant in modern idps. on the other hand, the major orderpromoting residues (c, w, y, and f) were added to the genetic code late. this observation is further illustrated by figure 5 (a) which represents modern genetic code, contains information on the early and late codons (shown by light red and light blue colors, respectively), and on corresponding disorder-and order-promoting residues (shown by red and blue colors, respectively). codons with intermediate age and disorder-neutral residues are shown by light pink and pink colors, respectively. figure 5 uversky illustrates that there is relatively good agreement between the "age" of the residue and its disorderpromoting capacity, with early residues being mostly disorder-promoting, and with the majority of late residues being mostly order-promoting. this conclusion follows from the abundance of the matching colors (light red-red, light blue-blue, and light pinkpink). there are only two noticeable exceptions from these rule, valine and leucine, which are early but order-promoting residues. this strongly suggests that the primordial polypeptides were intrinsically disordered. it is very unlikely that these disordered primordial polypeptides possessed catalytic activity. 287 this hypothesis is in line with the rna world theory suggesting that during the evolution of enzymatic activity, catalysis was transferred from rna first to ribonucleoprotein (rnp) and only then to protein. 288 therefore, the first proteins in the "breakthrough organism" (the first to have encoded protein synthesis) would be nonspecific chaperone-like proteins rather than catalysts. 136, 287 such rna chaperone activities of early proteins conferred to their carriers a significant selective advantage in the rna world, where rna, which is especially prone to misfolding, 289, 290 was used for both information storage and catalysis. 291 since the variability of physicochemical properties of amino acids greatly exceeds that of figure 5 . peculiarities of disorder evolution. a: modern genetic code with information on the early and late codons (shown by light red and light blue colors, respectively) and disorder-and order-promoting residues (shown by red and blue colors, respectively). codons with intermediate ages (i.e., those located between early and late codons) are shown by light pink color, whereas disorder-neutral residues are shown by pink color. b: wavy pattern of the global disorder evolution. x-axis represents evolutionary time and y-axis shows disorder content in proteins at given evolutionary time point. here, primordial proteins are expected to be mostly disordered (left-hand side of the plot), proteins in lua likely are mostly structured (center of the plot), whereas many protein in eukaryotes are either totally disordered or hybrids containing both ordered and disordered regions (right-hand side of the plot). nucleotides and since protein structures are noticeably more stable than rna structures, the transition from rnas (ribozymes) to proteins as carriers of enzymatic activity was a logical evolutionary step. however, efficient catalysis relies on the proper spatial arrangement of catalytic residues which requires a stable structure. 292 therefore, grafting of the enzymatic activity to proteins generated strong evolutionary pressure toward the well-folded structures. in other words, the global evolution of intrinsic disorder is characterized by a wavy pattern [see fig. 5 (b)], where highly disordered primordial proteins with primarily rna-chaperone activities were gradually substituted by the well-folded, highly ordered enzymes that evolved to catalyze the production of all the complex "goodies" crucial for the independent existence of the first cellular organisms. due to its specific features crucial for the regulation of complex processes, protein intrinsic disorder was reinvented at the subsequent evolutionary steps leading to the development of more complex organisms from the last universal ancestor (i.e., the most recent organism from which all organisms now living on earth descend 293, 294 ) , and culminating in the appearance of the highly elaborated eukaryotic cells [see there is no simple answer to the question on the comparative evolutionary rates of ordered and idps and regions in modern organisms. in fact, it looks like everything is possible, and intrinsically disordered sequences may evolve faster, slower or similar to ordered sequences. for example, disordered and ordered domains of the same protein (e.g., papillomavirus e7 oncoprotein) were shown to possess similar degrees of conservation and co-evolution. 295 many other idps/idprs were shown to be characterized by high evolutionary rates 151,296,297 determined by the lack of specific structural restrictions. in fact, the analysis of calcineurins, 10 topoisomerase, 298 ribosomal protein s4, 299 b-subunits of the potassium channel kvb1.1, 300 and many other proteins showed that disordered regions in these proteins contained more amino acid substitutions, insertions, and deletions than the ordered regions of the same proteins. 151, 301 furthermore, based on the observation that a significantly higher degree of positive darwinian selection was observed in idprs of proteins compared to regions of a-helix, b-sheet or tertiary structures, it was hypothesized that idprs may be required for the genetic variation with adaptive potential and that these regions may be of "central significance for the evolvability of the organism or cell in which they occur." 302 on the other hand, some idps and idprs are highly conserved. human a-synuclein (a canonical idp of 140 residues 140,303 ) differs from its mouse counterpart by merely six residues (4%), and there are just 21 residue differences (12%, which include residue differences at 18 positions and 3 insertions/ deletions) between the human and canary a-synucleins. 304 in flagellin, the ordered central region has greater sequence diversity than its disordered termini. 305 functionally important conserved regions of predicted disorder were shown to be rather common in proteins from all kingdoms of life, including viruses. 306, 307 furthermore, many functional domains of a significant size were shown to be intrinsically disordered. 165 overall, a systematic study of several families of proteins with at least one structurally characterized disordered region revealed that their idprs are characterized by highly heterogeneous evolutionary rates, with some disordered amino acid sequences evolving slowly, and others evolving more rapidly than ordered sequences. 151 also, even different parts of the same disordered region can possess noticeable variability in their divergence during the evolutionary process. 308 finally, in some disordered proteins, peculiarities of the amino acid composition, and not the amino acid sequence might be conserved. 309, 310 some future directions the last 15 years witnessed a real revolution in our understanding of the protein structure-function relationships. the fact that there is an entire class of polypeptides which do not have rigid structures but possess crucial biological function was heavily underappreciated and ignored for a very long time despite numerous examples scattered in literature. the work which started in my group as an attempt to understand what is so special about several natively unfolded proteins produced a real explosion of interest to structure-less proteins with biological functions. a new field was created and a lot of intriguing information was produced related to structures and functions of idps/idprs. there is no need to list once again all the discoveries and findings made in this field-they are subjects of many recent reviews and some of them are briefly covered in this article. although the amount of data generated during the past decade and a half on specific features related to the structural properties of idps and idprs, their abundance, distribution, functional repertoire, regulation, involvement into the disease pathogenesis, and so forth is vast, it seems that this mass of data produced so far is just a small tip of a humongous iceberg. idps/idprs continue to bring discoveries almost on a daily basis and even more breakthroughs are expected in future. modern protein science is at the turning point, but biology still waits for physics. new models explaining various functions of idps, their evolution, and involvement in diseases are in great demand, together with the general theory unifying current knowledge on protein structure and function, and with novel experimental and computational tools for focused studies of idps/idprs. natively unfolded proteins: a point where biology waits for physics einfluss der configuration auf die wirkung der enzyme the protein data bank: a computerbased archival file for macromolecular structures understanding protein non-folding the structure of scientific revolutions primers in biology. protein structure and function protein disk of tobacco mosaic virus at 2.8 a resolution showing the interactions within and between subunits the transition of bovine trypsinogen to a trypsin-like state upon strong ligand binding. the refined crystal structures of the bovine trypsinogen-pancreatic trypsin inhibitor complex and of its ternary complex with ile-val at 1.9 a resolution intrinsic disorder in the protein data bank protein disorder and the evolution of molecular recognition: theory, predictions and observations intrinsically unstructured proteins: re-assessing the protein structurefunction paradigm why are "natively unfolded" proteins unstructured under physiologic conditions? intrinsically disordered protein intrinsically unstructured proteins natively disordered proteins what's in a name? why these proteins are intrinsically disordered caseins as rheomorphic proteins: interpretation of primary and secondary structures of the as1-, b-, and k-caseins the relation of polypeptide hormone structure and flexibility to receptor binding: the relevance of x-ray studies on insulins, glucagon and human placental lactogen high-resolution proton-magnetic-resonance studies of chromatin core particles protein structure and enzyme activity structural studies of tau protein and alzheimer paired helical filaments show no evidence for beta-structure nacp, a protein implicated in alzheimer's disease and learning, is natively unfolded protein structure protection commits gene expression patterns a protein-chameleon: conformational plasticity of alpha-synuclein, a disordered protein involved in neurodegenerative disorders malleable machines take shape in eukaryotic transcriptional regulation operational definition of intrinsically unstructured protein sequences based on susceptibility to the 20s proteasome drugs for 'protein clouds': targeting intrinsically disordered transcription factors protein dynamics: dancing on an ever-changing free energy stage protein flexibility, not disorder, is intrinsic to molecular recognition top-idp-scale: a new amino acid scale measuring propensity for intrinsic disorder intrinsic disorder and functional proteomics sequence complexity of disordered protein predicting disordered regions from amino acid sequence: common themes despite differing structural characterization the protein non-folding problem: amino acid determinants of intrinsic order and disorder composition profiler: a tool for discovery and visualization of amino acid composition differences comparing predictors of disordered protein a practical overview of protein disorder prediction methods predicting protein disorder and induced folding: from theoretical principles to practical applications prediction of protein disorder at the domain level prediction of protein disorder predicting intrinsic disorder in proteins: an overview inherent relationships among different biophysical prediction methods for intrinsically disordered proteins intrinsic protein disorder in complete genomes prediction and functional analysis of native disorder in proteins from the three kingdoms of life the mysterious unfoldome: structureless, underappreciated, yet vital part of any given proteome orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes from viruses and the three domains of life thousands of proteins likely to have long disordered regions norton rs (2006) abundance of intrinsically unstructured proteins in p. falciparum and other apicomplexan parasite proteomes prevalent structural disorder in e. coli and s. cerevisiae proteomes large-scale analysis of thermostable, mammalian proteins provides insights into the intrinsically disordered proteome archaic chaos: intrinsically disordered proteins in archaea reduction in structural disorder and functional complexity in the thermal adaptation of prokaryotes intrinsic disorder in pathogenic and non-pathogenic microbes: discovering and analyzing the unfoldomes of early-branching eukaryotes how much of protein sequence space has been explored by life on earth? folding by numbers: primary sequence statistics and their use in studying protein folding theory for protein mutability and biogenesis polymer principles in protein structure and stability sequence space, folding and protein design functional rapidly folding proteins from simplified amino acid sequences simplified proteins: minimalist solutions to the 'protein folding problem thoroughly sampling sequence space: largescale protein design of structural ensembles protein tolerance to random amino acid change high solubility of random-sequence proteins consisting of five kinds of primitive amino acids statistical distribution of hydrophobic residues along the length of protein chains. implications for protein folding and evolution simplified amino acid alphabets for protein fold recognition and implications for folding a reduced amino acid alphabet for understanding and designing protein adaptation to mutation a computational approach to simplifying the protein folding alphabet unusual biophysics of intrinsically disordered proteins the case for intrinsically disordered proteins playing contributory roles in molecular recognition without a stable 3d structure the interplay between structure and function in intrinsically unstructured proteins identification and functions of usefully disordered proteins intrinsic disorder and protein function what does it mean to be natively unfolded? fuzzy complexes: a more stochastic view of protein function fuzzy complexes: polymorphism and structural disorder in protein-protein interactions uversky vn (in press) intrinsic disorder-based protein interactions and their modulators flexible nets: disorder and induced fit in the associations of p53 and 14-3-3 with their partners showing your id: intrinsic disorder as an id for recognition, regulation and cell signaling the protein trinitylinking function and disorder folding funnels and binding mechanisms molten globule and protein folding polymeric aspects of protein folding: a brief overview absinth: a new continuum solvation model for simulations of polypeptides in aqueous solutions quantitative characterization of intrinsic disorder in polyglutamine: insights from analysis based on polymer theories atomistic simulations of the effects of polyglutamine chain length and solvent quality on conformational equilibria and spontaneous homodimerization characterizing the conformational ensemble of monomeric polyglutamine modulation of polyglutamine conformations and dimer formation by the n-terminus of huntingtin role of backbonesolvent interactions in determining conformational equilibria of intrinsically disordered proteins net charge per residue modulates conformational ensembles of intrinsically disordered proteins from the cover: charge interactions can dominate the dimensions of intrinsically disordered proteins comparing and combining predictors of mostly disordered proteins natively unfolded protein stability as a coil-to-globule transition in charge/hydropathy space disordered domains and high surface charge confer hubs with the ability to interact with multiple proteins in interaction networks a bimodal distribution of two distinct categories of intrinsically disordered structures with separate functions in fg nucleoporins intrinsically disordered proteins and their environment: effects of strong denaturants, temperature, ph, counter ions, membranes, binding partners, osmolytes, and macromolecular crowding evidence for a partially folded intermediate in alpha-synuclein fibril formation natively unfolded c-terminal domain of caldesmon remains substantially unstructured after the effective binding to calmodulin effect of zinc and temperature on the conformation of the gamma subunit of retinal phosphodiesterase: a natively unfolded protein natively unfolded human prothymosin alpha adopts partially folded collapsed conformation at acidic ph a circular dichroism study of preferential hydration and alcohol effects on a denatured protein, pig calpastatin domain i heme binding and polymerization by plasmodium falciparum histidine rich protein. ii. influence of ph on activity and conformation conformation-dependent antibacterial activity of the naturally occurring human peptide ll-37 intrinsically disordered c-terminal segments of voltage-activated potassium channels: a possible fishing rod-like mechanism for channel binding to scaffold proteins unraveling the nature of the segmentation clock: intrinsic disorder of clock proteins and their interaction map intrinsically disordered regions of human plasma membrane proteins preferentially occur in the cytoplasmic segment prevalence of intrinsic disorder in the intracellular region of human single-pass type i proteins: the case of the notch ligand delta-4 investigation of transmembrane proteins using a computational approach analysis of structured and intrinsically disordered regions of transmembrane proteins protein conformation in cell membrane preparations as studied by optical rotatory dispersion and circular dichroism characterization of the major envelope protein from escherichia coli. regular arrangement on the peptidoglycan and unusual dodecyl sulfate binding proposed knobs-intoholes packing for several membrane proteins folding patterns of porin and bacteriorhodopsin hydrophobic organization of membrane proteins are membrane proteins "inside-out" proteins? turning a reference inside-out: commentary on an article by stevens and arkin entitled the distribution of positively charged residues in bacterial innter membrane proteins correlates with the transmembrane topology prediction of membrane-protein topology from first principles structural adaptation of extreme halophilic proteins through decrease of conserved hydrophobic contact surface halophilic proteins and the influence of solvent on protein stabilization halophilic adaptation of enzymes halophilic enzymes: proteins with a grain of salt the effect of salts on the activity and stability of escherichia coli and haloferax volcanii dihydrofolate reductases halophilic adaptation of protein-dna interactions unique amino acid composition of proteins in halophilic bacteria molecular signature of hypersaline adaptation: insights from genome and proteome composition of halophilic prokaryotes structural basis for the aminoacid composition of proteins from halophilic archea electrostatic contributions to the stability of halophilic proteins viral disorder or disordered viruses: do viral proteins possess unique features? the diversity of physical forces and mechanisms in intermolecular interactions intrinsic disorder in cell-signaling and cancer-associated proteins coupling of folding and binding for unstructured proteins intrinsically unstructured proteins and their functions natively unfolded proteins flexible nets. the roles of intrinsic disorder in protein interaction networks protein folding revisited. a polypeptide chain at the folding-misfolding-nonfolding cross-roads: which way to go? the role of structural disorder in the function of rna and protein chaperones structural disorder throws new light on moonlighting functional anthology of intrinsic disorder. i. biological processes and functions of proteins with long disordered regions functional anthology of intrinsic disorder. iii. ligands, post-translational modifications, and diseases associated with intrinsically disordered proteins functional anthology of intrinsic disorder. ii. cellular components, domains, technical terms, developmental processes, and coding sequence diversities correlated with long disordered regions intrinsic disorder in scaffold proteins: getting more from less function and structure of inherently disordered proteins signal transduction via unstructured protein conduits the unfoldomics decade: an update on intrinsically disordered proteins a careful disorderliness in the proteome: sites for interaction and targets for future therapies close encounters of the third kind: disordered domains and the interactions of proteins linking folding and binding intrinsically disordered proteins from a to z on the importance of being disordered evolutionary rate heterogeneity in proteins with long disordered regions alternative splicing in concert with protein intrinsic disorder enables increased functional diversity in multicellular organisms iupred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content coupling of local folding to site-specific binding of proteins to dna molecular mechanism of biological recognition close encounters: why unstructured, polymeric domains can increase rates of specific macromolecular association cell biology. the importance of being unfolded heterogeneity of the binding sites of bovine serum albumin structural studies of p21 waf1/cip1/sdi1 in the free and cdk2-bound state: conformational disorder mediates binding diversity what properties characterize the hub proteins of the protein-protein interaction network of saccharomyces cerevisiae? intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes disorder and sequence repeats in hub proteins and their implications for network evolution role of intrinsic disorder in transient interactions of hub proteins intrinsic disorder in yeast transcriptional regulatory network high levels of structural disorder in scaffold proteins as exemplified by a novel neuronal protein, cask-interactive protein1 moonlighting proteins multifunctional proteins: examples of gene sharing molecular mechanisms for multitasking: recent crystal structures of moonlighting proteins exploring the binding diversity of intrinsically disordered proteins involved in one-to-many binding collective dynamics of 'small-world' networks classification of scale-free networks emergence of scaling in random networks revisiting date and party hubs: novel approaches to role assignment in protein interaction networks axin is a scaffold protein in tgf-beta signaling that promotes degradation of smad7 by arkadia intrinsic disorder in transcription factors human transcription factors contain a high fraction of intrinsically disordered regions essential for transcriptional regulation malleable machines in transcription regulation: the mediator complex more than just tails: intrinsic disorder in histone proteins a creature with hundred of waggly tails: intrinsically disordered proteins in ribosome hsf transcription factor family, heat shock response, and protein intrinsic disorder protein intrinsic disorder and induced pluripotent stem cells the roles of intrinsic disorder in orchestrating the wnt-pathway resilience of death: intrinsic disorder in proteins involved in the programmed cell death structure and dynamics of a molten globular enzyme flexibility and reactivity in promiscuous enzymes dynamics of proteins multiple conformational changes in enzyme catalysis an nmr perspective on enzyme dynamics relating protein motion to catalysis dynamical contributions to enzyme catalysis: critical tests of a popular hypothesis the catalytic and regulatory properties of enzymes enzymatic activity in disordered states of proteins an enzymatic molten globule: efficient coupling of folding and catalysis kinetics and thermodynamics of ligand binding to a molten globular enzyme and its native counterpart relative tolerance of an enzymatic molten globule and its thermostable counterpart to point mutation circularly permuted dihydrofolate reductase possesses all the properties of the molten globule state, but can resume functional tertiary structure by interaction with its ligands circularly permuted dihydrofolate reductase of e. coli has functional activity and a destabilized tertiary structure ureg, a chaperone in the urease assembly process, is an intrinsically unstructured gtpase that specifically binds zn21 intrinsically disordered structure of bacillus pasteurii ureg as revealed by steady-state and time-resolved fluorescence spectroscopy insights in the (un)structural organization of bacillus pasteurii ureg, an intrinsically disordered gtpase enzyme multitude of binding modes attainable by intrinsically disordered proteins: a portrait gallery of disorder-based complexes polyelectrostatic interactions of disordered ligands suggest a physical basis for ultrasensitivity dynamic equilibrium engagement of a polyvalent ligand with a single-site receptor protein dynamics and conformational disorder in molecular recognition structure/ function implications in a dynamic complex of the intrinsically disordered sic1 with the cdc4 subunit of an scf ubiquitin ligase the intrinsically disordered cytoplasmic domain of the t cell receptor zeta chain binds to the nef protein of simian immunodeficiency virus without a disorder-toorder transition homooligomerization of the cytoplasmic domain of the t cell receptor zeta chain and of other proteins containing the immunoreceptor tyrosine-based activation motif binding of intrinsically disordered proteins is not necessarily accompanied by a structural transition to a folded form quantitative observation of backbone disorder in native elastin membrane binding mode of intrinsically disordered cytoplasmic domains of t cell receptor signaling subunits depends on lipid composition lipid-binding activity of intrinsically unstructured cytoplasmic domains of multichain immune recognition receptor signaling subunits limitations of induced folding in molecular recognition by intrinsically disordered proteins effective long-range attraction between protein molecules in solutions studied by small angle neutron scattering properties of a water-soluble, yellow protein isolated from a halophilic phototrophic bacterium that has photochemical activity analogous to sensory rhodopsin measurement and global analysis of the absorbance changes in the photocycle of the photoactive yellow protein from ectothiorhodospira halophila structural and dynamic changes of photoactive yellow protein during its photocycle in solution the short-lived signaling state of the photoactive yellow protein photoreceptor revealed by combined structural probes solution structure and backbone dynamics of the photoactive yellow protein ) 1.4 a structure of photoactive yellow protein, a cytosolic photoreceptor: unusual fold, active site, and chromophore a mechanosensitive ion channel in the yeast plasma membrane an improved open-channel structure of mscl determined from fret confocal microscopy and simulation tight regulation of unstructured proteins: from transcript synthesis to protein degradation cdk-inhibitory activity and stability of p27kip1 are directly regulated by oncogenic tyrosine kinases the importance of intrinsic disorder for protein phosphorylation protein disorder is positively correlated with gene expression in escherichia coli insufficiently dehydrated hydrogen bonds as determinants of protein interactions keeping dry and crossing membranes the alternative conformations of amyloidogenic proteins and their multi-step assembly pathways protein misfolding, evolution and disease biological activity and pathological implications of misfolded proteins protein deposits as the molecular basis of amyloidosis. i. systemic amyloidoses protein deposits as the molecular basis of amyloidosis. ii. localized amyloidosis and neurodegenerative disordres amyloid fibrillogenesis: themes and variations conformational constraints for amyloid fibrillation: the importance of being unfolded pathways to amyloid fibril formation: partially folded intermediates in fibrillation of unfolded proteins alzheimer's disease and down's syndrome: sharing of a unique cerebrovascular amyloid fibril protein neuronal origin of a cerebral amyloid: neurofibrillary tangles of alzheimer's disease contain the same protein as the amyloid of plaque cores and blood vessels a68: a major subunit of paired helical filaments and derivatized forms of normal tau molecular cloning of cdna encoding an unrecognized component of amyloid in alzheimer disease alzheimer's disease in down's syndrome: clinicopathologic studies part ii: alpha-synuclein and its molecular pathophysiological role in neurodegenerative disease shattuck lecture-neurodegenerative diseases and prions polyglutamine diseases: protein cleavage and aggregation polyglutamine diseases: a transcription disorder? fourteen and counting: unraveling trinucleotide repeat diseases molecular genetics: unmasking polyglutamine triggers in neurodegenerative disease beyond the qs in the polyglutamine diseases polyglutamine expansion neurodegenerative disease local structural elements in the mostly unstructured transcriptional activation domain of human p53 intrinsic structural disorder and sequence features of the cell cycle inhibitor p57kip2 identification of a novel regulatory domain in bcl-x(l) and bcl-2 intrinsic structural disorder of the c-terminal activation domain from the bzip transcription factor fos tc-1 is a novel tumorigenic and natively disordered protein associated with thyroid cancer multiple aromatic side chains within a disordered structure are critical for transcription and transforming activity of ews family oncoproteins oncogenic partnerships: ews-fli1 protein interactions initiate key pathways of ewing's sarcoma signaling to the p53 tumor suppressor through pathways activated by genotoxic and nongenotoxic stress mutations in human cancers activation and activities of the p53 tumour suppressor protein inhibition of the p53-hdm2 interaction with low molecular weight compounds inhibition of the p53-mdm2 interaction: targeting a protein-protein interface intrinsically disordered proteins in human diseases: introducing the d2 concept unfoldomics of human diseases: linking protein intrinsic disorder with diseases predicting binding regions within disordered proteins mining alpha-helix-forming molecular recognition features with cross species sequence alignments coupled folding and binding with alpha-helix-forming molecular recognition elements fluorescent, sequence-selective peptide detection by synthetic small molecules structure of the mdm2 oncoprotein bound to the p53 tumor suppressor transactivation domain small-molecule antagonists of p53-mdm2 binding: research tools and potential therapeutics targeting the p53-mdm2 interaction to treat cancer in vivo activation of the p53 pathway by small-molecule antagonists of mdm2 rational drug design via intrinsically disordered protein small-molecule inhibitors of protein-protein interactions: progressing towards the dream protein-protein interactions and cancer: small molecules going in for the kill antagonists of protein-protein interactions multiple independent binding sites for small-molecule inhibitors on the oncoprotein c-myc intrinsically disordered proteins may escape unwanted interactions via functional misfolding intrinsically disordered proteins and novel strategies for drug discovery function of alternative splicing the origin of life (in russian) the rationalist annual for the year 1929 a production of amino acids under possible primitive earth conditions organic compound synthesis on the primitive earth the origin of the genetic code a co-evolution theory of the genetic code possibilities for the evolution of the genetic code from a preceding form consensus temporal order of amino acids and evolution of the triplet code the path from the rna world relics from the rna world beyond kinetic traps in rna folding the ubiquitous nature of rna chaperone proteins origin of life-the rna world proteins, rnas and chaperones in enzyme evolution: a folding perspective uprooting the tree of life a formal test of the theory of universal common ancestry sequence evolution of the intrinsically disordered and globular domains of a model viral oncoprotein the relationships among microrna regulation, intrinsically disordered regions, and other indicators of protein evolutionary rate proportion of solvent-exposed amino acids in a protein and rate of protein evolution the hydrophilic, protease-sensitive terminal domains of eucaryotic dna topoisomerases have essential intracellular functions structural preordering in the n-terminal region of ribosomal protein s4 revealed by heteronuclear nmr spectroscopy nmr structure and functional characteristics of the hydrophilic n terminus of the potassium channel beta-subunit kvbeta1.1 evolution and disorder proteome-wide evidence for enhanced positive darwinian selection within intrinsically disordered regions in proteins alpha-synuclein misfolding and parkinson's disease the synucleins terminal regions of flagellin are disordered in solution conservation of intrinsic disorder in protein domains and families. i. a database of conserved predicted disordered regions conservation of intrinsic disorder in protein domains and families. ii. functions of conserved disorder intrinsically disordered regions of p53 family are highly diversified in evolution dynamic behavior of an intrinsically unstructured linker domain is conserved in the face of negligible amino acid sequence conservation chemical composition is maintained in poorly conserved intrinsically disordered regions and suggests a means for their classification this work would be impossible without numerous collaborators whose enthusiasm and help drove the studies on intrinsically disordered proteins for many years. over the years i collaborated with more than 550 colleagues around the globe, and i am grateful to all former and current colleagues for their priceless contributions, assistance and support. key: cord-354465-5nqrrnqr authors: haslinger, christian; stadler, peter f. title: rna structures with pseudo-knots: graph-theoretical, combinatorial, and statistical properties date: 1999 journal: bull math biol doi: 10.1006/bulm.1998.0085 sha: doc_id: 354465 cord_uid: 5nqrrnqr the secondary structures of nucleic acids form a particularly important class of contact structures. many important rna molecules, however, contain pseudo-knots, a structural feature that is excluded explicitly from the conventional definition of secondary structures. we propose here a generalization of secondary structures incorporating ‘non-nested’ pseudo-knots, which we call bi-secondary structures, and discuss measures for the complexity of more general contact structures based on their graph-theoretical properties. bi-secondary structures are planar trivalent graphs that are characterized by special embedding properties. we derive exact upper bounds on their number (as a function of the chain length n) implying that there are fewer different structures than sequences. computational results show that the number of bi-secondary structures grows approximately like 2.35(n). numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of rna do not change when pseudo-knots are introduced into the secondary structure picture. we find a large fraction of neutral mutations and, in particular, networks of sequences that fold into the same shape. these neutral networks percolate through the entire sequence space. the secondary structures of nucleic acids form a particularly important class of contact structures. many important rna molecules, however, contain pseudoknots, a structural feature that is excluded explicitly from the conventional definition of secondary structures. we propose here a generalization of secondary structures incorporating 'non-nested' pseudo-knots, which we call bi-secondary structures, and discuss measures for the complexity of more general contact structures based on their graph-theoretical properties. bi-secondary structures are planar trivalent graphs that are characterized by special embedding properties. we derive exact upper bounds on their number (as a function of the chain length n) implying that there are fewer different structures than sequences. computational results show that the number of bi-secondary structures grows approximately like 2.35 n . numerical studies based on kinetic folding and a simple extension of the standard energy model show that the global features of the sequence-structure map of rna do not change when pseudo-knots are introduced into the secondary structure picture. we find a large fraction of neutral mutations and, in particular, networks of sequences that fold into the same shape. these neutral networks percolate through the entire sequence space. c 1999 society for mathematical biology presumably the most important problem and the greatest challenge in present day theoretical biophysics is deciphering the code that transforms sequences of biopolymers into spatial molecular structures. a sequence is properly visualized as a string of symbols which together with the environment encodes the molecular architecture of the biopolymer. in case of one particular class of biopolymers, the ribonucleic acid (rna) molecules, decoding of information stored in the sequence can be properly decomposed into two steps: (i) formation of the secondary structure, that is, of the pattern of watson-crick (and gu) base pairs, and (ii) the embedding of the contact structure in three-dimensional space. the sequence structure relation of rna was studied in detail in a series of papers (fontana et al., 1991 (fontana et al., , 1993a bonhoeffer et al., 1993; tacker et al., 1994; grüner et al., 1996a,b; tacker et al., 1996) at the level of secondary structures. the most salient findings of these investigations are: (i) there are many more sequences than (secondary) structures. (ii) there are few frequent and many rare structures. almost all sequences fold into frequent or 'common' structures. (ii) sequences that fold into a 'common' structure are distributed nearly uniformly in sequence space. (iv) a sequence folding into a 'common' structure has a large number of neutral neighbors (folding into the same structure) and a large number of neighboring sequences that fold into very different secondary structures. (v) neutral paths percolate sequence space along which all sequences fold into the same secondary structure. in fact, there are extended neutral networks of sequences folding into the same 'common' structure (grüner et al., 1996b; . (vi) almost all 'common' structures can be found close to any point in sequence space. this property is called shape space covering. the impact of these features on evolutionary dynamics is discussed in schuster (1995) and : a population explores sequence space in a diffusion-like manner along the neutral network of a viable structure. along the fringes of the population novel structures are produced by mutation at a constant rate (huynen, 1996) . fast diffusion together with perpetual innovation makes these landscapes ideal for evolutionary adaptation (fontana and schuster, 1998) . the 'classical' definition of secondary structures incorporates a quite restrictive condition on the set of base pairs that implies a tree-like arrangement of the doublehelical regions, see fig. 1 . additional interactions between different branches of this tree are referred to as pseudo-knots (for an exact definition see section 2). pseudoknots are excluded from many studies for a mostly technical reason (waterman and smith, 1978a, b) : the folding problem for rna can be solved efficiently by dynamic programming (waterman and smith, 1978b; zuker and sankoff, 1984) in their absence. on the other hand, an increasing number of experimental findings, as well as results from comparative sequence analysis, suggest that pseudo-knots are important structural elements in many rna molecules (westhof and jaeger, 1992) . notably, figure 1 . the contact structure of escherichia coli rnase p rna contains two pseudoknots [http://jwbrown.mbio.ncsu.edu/rnasep/home.html]. the conventional secondary structure is drawn on the l.h.s., the (four) regions forming the pseudo-knots are marked by braces, interaction regions are connected. the arc diagram of the same structure is obtained by arranging the backbone along a line and indicating base pairs by arcs connecting the corresponding bases. the base pairs of the conventional secondary structure are drawn above the line, the two pseudo-knot stems are shown below the back-bone. for details see section 2. functional rnas such as rnasep rna (loria and pan, 1996) and ribosomal rna (konings and gutell, 1995) contain pseudo-knots. the diversity of molecular biological functions performed by pseudo-knots can be subdivided into three groups. pseudo-knots at the 5 -end of mrnas appear to adopt a role in the control of mrna translation. for instance, the expression of replicase is controlled in several viruses either by ribosomal frame shifting (ten dam et al., 1990; brierley et al., 1991; dinman et al., 1991; chamorro et al., 1992; tzeng et al., 1992) or by in-frame readthrough of stop codons (wills et al., 1991) . both mechanisms involve pseudo-knots. core pseudo-knots are necessary to form the reaction center of ribozymes. most of the enzymatic rnas with core pseudo-knots, such as rnasep, are involved in cleavage or self-cleavage reactions (michel and westhof, 1990; forster and altman, 1990; brown, 1991; haas et al., 1991) . pseudo-knots in the trna-like motifs at the 3 -end of the genomic rna mediate replication control in several groups of plant viral rna (mans et al., 1991) . it is important, therefore, to include pseudo-knotted structures into investigations of rna sequence-structure relationships. in particular, we need to know whether the findings (i) through (vi) described above remain true when pseudo-knots are taken into account. assertion (i), the existence of more sequences than structures, is a necessary prerequisite for all subsequent statements concerning the sequencestructure map of rna. it is necessary therefore to estimate the number of rna structures with pseudo-knots in order to decide whether the results quoted above can in fact be true for 'real' rna molecules. in the following two sections we give a detailed mathematical analysis of what we call bi-secondary structures. in a nutshell, bi-secondary structures generalize to the notion of secondary structures to include pseudo-knots without allowing overly involved knotted structures or nested pseudo-knots. in fact, almost all known pseudo-knotted structures, with the notable exception of the e. coli αmrna, fall into this class. in section 2 we review a variety of equivalent graph-theoretical characterizations of bi-secondary structures and provide a way of efficiently determining whether a list of base pairs corresponds to a bi-secondary structure. then we briefly review a few graph invariants that might be useful for determining the complexity of higher-order structures beyond the realm of bi-secondary structures. at the end of section 2 we show that a convenient distance measure for comparing secondary structures can be used also in the presence of pseudo-knots (section 2.7). in subsection 2.8 we argue that the intersection theorem is valid for general nucleic acid contact structures. we say that an rna sequence is compatible with a structure s if it can in principle form this structure irrespective of energetic constraints. this means that for each base pair (i, j) in s the sequence positions x i and x j are one of the six possible rna base pairs au, ua, gc, cg, gu, or ug. the set of sequences that actually fold into a given structure s is therefore a subset of the set of compatible sequences. the intersection theorem (reidys et al., 1997) now states that for any two structures s and s there are sequences which are compatible with both of them. this result is the reason why very different structures with very closely related sequences can exist. the fact that the intersection theorem holds for structures with pseudo-knots means that we have to expect shape space covering provided the fraction of neutral mutations is large enough (reidys et al., 1997) . in section 3 we determine the number of different structures with pseudo-knots. combinatorial aspects of rna secondary structures have been studied in detail by waterman and co-workers (stein and waterman, 1978; waterman, 1978; waterman and smith, 1978a, b; penner and waterman, 1993; schmitt and waterman, 1994; waterman, 1995) and hofacker et al. (1999) . using different techniques we give analytical upper bounds on the number of different bi-secondary structures showing that their number does not increase much faster than the number of secondary structures. the analytical results are complemented by numerical data (see table 2 at the end of section 3) indicating that the number s n of 'reasonable' bi-secondary structures with chain length n grows approximately as s n ∼ 2.35 n . 'reasonable' means here that the structures have no isolated base pairs (i.e., the minimum stack size is l = 2) and that hairpin loops contain at least m = 3 unpaired bases. for comparison, the number of secondary structures without pseudo-knots grows like 1.86 n . exhaustive enumeration for short sequences suggest that only 1.65 n different secondary structures appear as minimum energy structures of sequences of length n (grüner et al., 1996a) . hence the number 4 n of rna sequences of length n, is much larger than the number of possible structures, independent of whether or not one takes pseudo-knots into account. this observation poses the question how the sequences that fold into a given structure are distributed in sequence space. in section 4 we describe a set of numerical experiments strongly suggesting that the inclusion of pseudo-knots does not alter the qualitative picture [properties (i) through (vi) above] of the rna sequence-structure map. a short discussion (section 5) concludes this contribution. readers who are not interested in the mathematical details of defining, characterizing, and counting contact structures of various types might want to skip sections 2 and 3. the three-dimensional structure of a linear biopolymer, such as rna, dna, or a protein can be approximated by its contact structure, i.e., by the list of all pairs of monomers that are spatial neighbors. contact structures of polypeptides have been introduced by ken dill and co-workers in the context of lattice models of protein folding (chan and dill, 1988; chen and dill, 1995) . they arise implicitly in knowledge-based potentials for polypeptides such as the delauney-tesselation potential described in singh et al. (1996) . last but not least, rna secondary structures form a special class of contact structures. the purpose of this section is to bring together different mathematical approaches that can be used to describe biopolymer structures: contact graphs, linked diagrams, book embeddings, and graph colorings. a contact structure is represented by the contact matrix c with the entries c i j = 1 if the monomers i and j are spatial neighbors without being adjacent along the backbone, and c i j = 0 otherwise. hence c i j = 0 if |i − j| ≤ 1. we shall use the notation [n] = {1, . . . , n}. we define a diagram ([n] , ) to consists of n vertices labeled 1 to n and a set of arcs that connect non-consecutive vertices. a closely related class of diagrams which also allow arcs between consecutive vertices are the linked diagrams introduced by touchard (1952). these are studied in some detail in hsieh (1973 ), kleitman (1970 ), stein (1978 and stein and everett (1978) . it is customary to arrange the vertices along the x-axis and to draw the vertices in such a way that they are confined in either the upper or the lower half-plane. the diagram of a contact structure with contact matrix c has the set of arcs = {{i, j}|c i j = 1}. (1) the contact matrix is thus the adjacency matrix of the corresponding diagram. with each diagram we may associate a diagram graph with the following properties: let b be the adjacency matrix of the backbone, i.e., the matrix with the entries b i,i+1 = b i+1,i = 1, i = 0, . . . , n − 1, and b 0n = b n0 = 1. then the adjacency matrix of a diagram graph with n + 1 vertices is of the form equation (2) establishes the 1-1 correspondence of diagrams and the associated diagram graphs. essentially the same construction can be used for contact structures of molecules with a circular backbone, i.e., for circular ssrna or ssdna. the only restriction is that {1, n} cannot be an arc in the case of a circular molecule. it is convenient in this case to define the corresponding diagram graph without the artificial root 0. each graph with a hamiltonian cycle is then the diagram graph of a contact structure with a circular backbone. the results in the following discussion hold for both linear and circular nucleic acids. a diagram is a 1-diagram if for any two arcs α, β ∈ holds α ∩ β = ∅ or α = β. a diagram is a 1-diagram if and only if associated diagram graph ( ) has vertex degrees less or equal to 3. such graphs are often called sub-cubic or trivalent. the diagram graphs of 1-diagrams are closely related to cubic hamiltonian graphs. the latter are studied in detail in section 9.4 of wagner and bodendiek (1990): a graph s is homeomorphic from a graph if s can be produced from by inserting vertices of degree 2 into some edges of . s is also called a subdivision of . obviously each cubic hamiltonian graph gives rise to a diagram graph on n vertices by subdividing the edges of a hamiltonian cycle. on the other hand, not all diagram graphs are homeomorphic from a cubic hamiltonian graph: suppose {1, 3} is an arc and 2 is an unpaired vertex. the corresponding diagram graph cannot be cubic because the triangle 1, 2, 3 cannot be obtained from a cubic graph. the classical definition of secondary structures (waterman, 1978) requires that each base interacts with at most one other nucleotide. thus nucleic acid secondary structures are special types of 1-diagrams. the second defining condition is that arcs do not cross. in terms of the contact matrix this means: if c i j = c kl = 1 and i < k < j then i < l < j. with the following notation we will find a simpler formulation of condition 2: let α = {i, j} with i < j be an arc of a diagram. we writeᾱ = [i, j] ⊂ r for the associated interval. two arcs of a diagram are consistent if they can be drawn in the same half-plane without crossing each other. equivalently, two arcs α, β ∈ of a diagram are consistent if either one of the following four conditions is satisfied: case (iv) is ruled out by definition in 1-diagrams. the non-crossing condition thus may be expressed as follows: whenever the intervals of two arcs {i, j} and {k, l} have non-empty intersection then one is contained in the other (schmitt and waterman, 1994) . equivalently, we may simply define that a secondary structure is a 1-diagram in which any two arcs are consistent. as a consequence, each secondary structure can be encoded as a string s of length n in the following way: if the vertex i is unpaired, then s i = '.'. each arc α = {p, q} with p < q translates to s p = '(' and s q = ')'. as the arcs are consistent their corresponding parentheses are either nested, (( )), or next to each other, ()(). as there are no arcs between neighboring vertices in a 1diagram there is at least one dot contained within each parenthesis. a variant of this notation is the mountain representation of rna secondary structures (hogeweg and hesper, 1984) . the 'dot-parenthesis' notation is used as a convenient notation in input and output of the vienna rna package, a piece of public domain software for folding and comparing rna molecules . a graph that can be embedded in the plane (or, equivalently on the sphere) is called planar. if it can be embedded in the plane in such a way that all its vertices lie on the exterior region it is called outer-planar. this class of graphs was introduced and characterized in terms of subgraphs in chartrand and harary (1967) and sys lo (1979). clearly, a 1-diagram is a secondary structure if and only if its diagram graph ( ) is outer-planar. the outer-planar embedding corresponds to the 'circle representation' of secondary structures. a similar procedure leads to book-embeddings. a p-book is a set of p distinct half-planes (the pages of the book) that share a common boundary line , called the spine of the book. an embedding of a graph into a book b consists of an ordering of the vertices along the spine of the book together with an assignment of each edge to a page of the book, in which edges assigned to the same page do not cross. the book-thickness (sometimes also called the page-number) bt( ) of a graph is the minimal number p of pages of a book into which it can be embedded (bernhart and kainen, 1979) . book-embeddings have a practical application in the context of vlsi design. for an overview see chung et al. (1987) and heath et al. (1992) . not surprisingly, the book thickness is closely related to other embedding properties of graphs. below we list a few important results: (v) bt(k n ) = n/2 , where k n is the complete graph with n vertices (bernhart and kainen, 1979) . n + 6 for sub-cubic graphs (chung et al., 1987) . of a graph is the minimum number of 'handles' one needs to add to a sphere so that the graph can be embedded on the resulting surface without crossing edges.) the book thickness of a variety of other graph classes has been studied in detail, among them hypercubes (chung et al., 1987) , de bruijn graphs (obrenić, 1993) , and various types of network graphs of practical interest (games, 1986). essentially the same construction is used for the investigation of cubic hamiltonian graphs in wagner and bodendiek (1990) . we shall see that the inconsistency graph is a useful construction for characterizing embedding properties of diagram graphs. theorem 1. let be a diagram. then the following statements are equivalent. proof. (i ⇐⇒ ii) can be drawn without intersection arcs if and only if ( ) is planar because the hamiltonian cycle h of ( ) divides the plane into the interior and the exterior of h which correspond to the upper and lower half-plane of the diagram , respectively. (ii ⇐⇒ iii) can be shown in the same way as the analogous result for cubic hamiltonian graphs in wagner and bodendiek (1990) , see also even and itai (1971) . (ii ⇐⇒ iv) follows immediate from bernhart and kainen (1979, theorem 2.5) as a planar diagram graph is by construction hamiltonian. as noted in even and itai (1971) , the determination of the book thickness of a is equivalent to finding a minimal vertex-coloring of a certain circle graph, which in our case is the intersection graph ( ). this problem is in general np-complete. the following observation simplifies the task by reducing the number of arcs that have to be considered. two it is easy to show that the arcs of a stem of a 1-diagram are either all isolated vertices or they are contained in the same component of the inconsistency graph ( ). furthermore, all arcs of a stem have the same adjacent vertices in ( ). we may therefore use a reduced intersection graphˆ ( ), the vertices of which are the stems. (in addition, we may recursively remove vertices of degree 2 that are not contained in a triangle before forming the intersection graph. this has the effect of removing bulges and interior loops that interrupt stems.) examples of reduced intersection graphs are given in figs 3 and 4. most of the literature on linked diagrams deals with complete diagrams, that is, each vertex x ∈ [n] is incident with an arc (touchard, 1952; kleitman, 1970; stein, 1978) . it is straightforward to extend touchard's definition of reducible diagrams to the incomplete diagrams considered here: the following equivalence is proved in haslinger (1997): reducible diagrams can therefore be viewed as being composed of substructures. these substructures do not in general conform to the conventional decomposition into stems and loops of an rna that forms the basis of the standard energy model of nucleic acid secondary structures (freier et al., 1986) . the contact structure of the proposed srv-1 frameshift signal contains a pseudoknot, see ref. ten dam et al. (1994) . pseudo-knots such as this one belong to the class of bi-secondary structures. knots such as the one in the lower part of the figure do not belong to the class of bi-secondary structures. knots, in contrast to pseudo-knots, may contain parallel stranded helices which so far have not been described for rna. we may draw the arcs in the upper or lower half-plane, but they are not allowed to intersect the x-axis. in other words, it can be embedded in 2-page book. bi-secondary structures are therefore 'superpositions' of two secondary structures. the virtue of bi-secondary structures is that they capture a wide variety of rna pseudo-knots, [figs 1 and 2 (upper part)], while at the same time they exclude true knots. knotted rnas could in principle arise either from parallel stranded helices (fig. 2) , or in very large molecules from sufficiently complicated crosslinking patterns. parallel-stranded rna has not been observed (so far), see, however, fortsch et al. (1996) on parallel-stranded dna. wollenzien cantor et al. (1980) have searched unsuccessfully for knots in large rnas. the definition of bi-secondary structures, by allowing a planar drawing of the structure, rules out both possibilities. among the rna structures with pseudo-knots, almost all are bi-secondary structures. our examples include several viral rnas such as coronavirus (brierley et al., 1991) , luteovirus (ten dam et al., 1990) , and retrovirus rna (chamorro et al., 1992) , as well as catalytic rnas such as rnasep rna (loria and pan, 1996), tmrna (vlassov et al., 1995; felden et al., 1997) , and ribosomal rnas (gutell et al., 1994) . we have encountered only a single exception, namely αmrna (tang and draper, 1990) . theorem 2. let be a 1-diagram. then the following statements are equivalent: (v) among any three arcs of at least two are consistent. proof. the equivalence of (i), through (iv) is proved in theorem 1 for all diagrams. the equivalence of (v) and (vi) follows immediately from the definition of ( ). the implication (iii ⇒v) is obvious. finally, it is possible to show that ¬(ii) implies ¬(v) based on kuratowski's (1930) theorem. for the details we refer to haslinger (1997). the practical importance of theorem 2 lies in the fact that existence or nonexistence of triangles in ( ) can be checked very easily, and hence we have a very efficient (polynomial time) method for deciding whether a diagram is a bi-secondary structure or not. note that the equivalence of (iii) and (vi) does not hold for general diagrams. a counterexample is shown in fig. 3 . being the union of the two secondary structures ([n], u ) and ([n], l ) we can represent each bi-secondary structure as a string s using two types of parentheses: as in a secondary structure we write a dot '.' for all unpaired vertices. a pair { p, q} ∈ u becomes s p = '(' and s q = ')', while an arc { p, q} ∈ l becomes s p = '[' and s q = ']'. unfortunately, the decomposition of a bi-secondary structure into two secondary structures is in general not unique, see fig. 4 . the fact that ( ) is bipartite allows us to define a normal form for this representation by means of the following rule: the leftmost arc of each connected component of ( ) belongs to u . in particular, all isolated vertices of ( ) are contained in u . the normal form of a secondary structure therefore contains only dots and (round) parentheses. within each non-trivial connected component of ( ) the distribution of arcs between u and l is unique because the component is bipartite. all arcs in a stack have a common neighboring vertex in ( ), hence they all belong to the same class of the partition. therefore, in normal form, all arcs belonging to the same stack are written with the same type of brackets. the following example shows that there are natural rna structures that are more complicated than bi-secondary structures. the escherichia coli α-operon mrna folds into a structure that is required for allosteric control of translational initiation (tang and draper, 1990) . compensatory mutations have defined an unusual pseudo-knotted structure (tang and draper, 1989) , the thermodynamics of which were subsequently investigated in detail (gluick and draper, 1994) . the diagram of its contact structure cannot be drawn without intersections, see fig. 5 . to our knowledge it is the only known rna structure that cannot be embedded in a 2-page book. in this subsection we briefly discuss a few graph properties that could be used for a classification of polymer structure complexity beyond the realm of bi-secondary structures. clearly, one may use its book thickness. a closely related quantity is the chromatic number of the intersection graph: a color partition of a graph is partition v = v 1 ∪ v 2 ∪ · · · ∪ v c of its vertex set into c subsets v i such that no two vertices in v i are adjacent. the chromatic number χ( ) is the smallest number c of colors for which a color partition of can be found. an arbitrary diagram can be decomposed into substructures by means of the following obvious result: let = ([n], ) be a diagram and let v : = 1 ∪ figure 5 . diagram of the contact structure of e. coli α-mrna. the structure contains five stems, labeled by uppercase greek letters. we may choose the color partition if ( ) such that all arcs in a stem have the same color. it therefore suffices to draw the inconsistency graph for stems (r.h.s. of the figure) . it contains triangles, thus the diagram of this rna structure is not a bi-secondary structure. it is easy to check that χ( ( )) = 3. the inconsistency graph ( ). noticing that χ( ) = 1 if contains no edges and χ( ) = 2 if is bipartite with non-empty edge set, the following characterization follows immediately: is a bi-secondary structure iff χ( ( )) ≤ 2. clearly, χ ( ( )) equals the minimum number of pages of all book embeddings in which the the ordering of the vertices along the spine coincided with the natural ordering along the backbone. in general, we have bt( ( )) ≤ χ( ( )) for all diagrams. we remark that graphs with moderate chromatic numbers can be characterized by results similar to kuratowski's theorem for planar graphs. for instance, one can show for k ≤ 4, that a graph with chromatic number χ( ) ≥ k contains a subdivision of the complete graph k k (dirac, 1952) . the generalization of this proposition to k > 4 is known as hajós' conjecture. it is false for k ≥ 7 and unsolved for k = 5 and k = 6 (holton and sheehan, 1993). it seems that χ( ( )) is in fact the more useful quantity, as there are no efficient algorithms to determine the book-thickness of a given graph, and χ( ( )) accounts for the immutable ordering of the backbone vertices, whereas the book-thickness might decrease by changing this ordering. a quite different algebraic graph invariant µ, introduced by de verdière (1990), leads to the same classification of structures for small µ: µ = 1 ( ) is a circle, has no arcs. µ ≤ 2 ( ) is outer-planar, is a secondary structure. µ ≤ 3 ( ) is planar, is a bi-secondary structure. the graphs with µ ≤ 4 have recently been identified as the flat or linklessly embeddable graphs (lovász and schrijver, 1996) . a useful characterization of this class of graphs is proved in robertson et al. (1995, b) : 'a graph is non-flat if and only if it has no minor in the so-called petersen family'. the graph v * 8 , fig. 6 , is a valid diagram graph. it is easy to check that v * 8 is flat and that its inconsistency graph is γ δ α β γ δ α β figure 6 . the graph v * 8 and its inconsistency graph. (v * 8 ) = k 4 . hence there are flat diagram graphs for which χ( ( )) ≥ 4. thus there is no direct correspondence between χ( ( )) and µ, not even for 1-diagrams. . interpreting each arc {i, j} as a transposition (i, j) on [n] we may assign the permutation to each diagram . one observes: (i) if a 1-diagram then π( ) is an involution. (ii) an involution π is the permutation representation of a 1-diagram if and only if its cycle decomposition does not contain a canonical transposition, i.e., a transposition of the form (i, i + 1). (iii) different 1-diagrams give rise to different involutions. a natural set of generators for the symmetric group s n is the set t of all transpositions. the corresponding length function is where cyc(π ) is the number of cycles into which π decomposes. we have (τ ) = 1 if and only if τ ∈ t is a transposition. the associated metric is the canonical metric on the cayley graph (s n , t ), see reidys and stadler (1996) for a detailed discussion. as the involutions form a subset of s n we have theorem 3. the function where π( ) denotes the permutation representation of a diagram , is a metric on the set of all 1-diagrams with n vertices. in particular, two 1-diagrams and have distance d( , ) = 1 if and only if they differ by a single arc. metrics on 'shape space' are necessary for a detailed quantitative study of sequence-structure maps. applications to rna secondary structures are reported for instance in fontana et al. (1993a) and . (3) is not limited to defining a metric on the set of structures. suppose we are given an alphabet of monomers (for instance {a, u, g, c} in the case of rna) and a rule that determines which pairs of monomers may form a base pair (au, ua, gc, cg, gu, ug in the case of rna). the proof of this result in reidys et al. (1997) is valid for all 1-diagrams, not only for secondary structures. the intersection theorem sets the stage for shape space covering: it allows close-by sequences to fold into structures that are as different as desired -given a suitable folding potential. further applications of equation (3) can be found in weber (1997). the number x n of all diagrams on n vertices is x n = 2 (n−1)(n−2)/2 as there are (n − 1)(n − 2)/2 possible arcs (söler and jankowski, 1991), which can be arbitrarily combined to form a diagram. in section 2.7 we have shown that all 1-diagrams correspond to involutions, therefore the number t n of involutions on [n] is an upper bound for the number d n of 1-diagrams on [n]. the combinatorics of involutions is discussed for instance in the book by wilf (1994): t n = t n−1 + (n − 1)t n−2 n ≥ 2 and t 0 = t 1 = 1 , and has the asymptotic form the number of involutions t n therefore grows faster than exponential in the sense that n √ t n → ∞. 1-diagrams can be counted by a very similar recursion as the following result shows: proof. the first few values of d n are obvious, d 0 = 1 is a convenient definition. the recursion is derived as follows: a 1-diagram on n + 2 vertices can be formed either by adding a lone vertex to a 1-diagram on n + 1 vertices or by adding an arc {1, k} to a 1-diagram on n vertices by inserting the vertex labeled k between the k − 1st and the kth vertex of . note, however, that must be a 1-diagram, but in addition it might have an arc {k − 1, k} in , as these vertices are separated by the endpoint of the newly introduced arc in the new structure. viewing this differently, we may either add the arc {1, k} or the -like structure consisting of the arcs {1, k} and {k − 1, k + 1}, which leaves us with a 1-diagram on n − 2 vertices and the same problem. repeating this argument we arrive at the following expansion: hence we have d n+2 = d n+1 + n d n + (n − 1)d n−2 + (n − 3)d n−4 + · · · . observing that d n+1 can of course be written in the same form and substituting into the above equations yields d n+2 = (n +1)d n +n d n−1 +(n −1)d n−2 +(n −2)d n−3 +· · ·+2d 1 + d 0 − d n−1 . subtracting the corresponding expansion for d n+1 yields a simple rearrangement now completes the proof. proof. the series d n is obviously monotonically increasing. hence the series a n+2 = (n +1)a n , a 0 = a 1 = 1 is a lower bound. it is well known that a n = (n −1)!! grows faster than exponentially. a very similar formula is obtained for the case of a circular backbone. there are d n−2 diagrams with arc {1, n} on n vertices. thus the number of 1-diagrams with circular backbone is d n = d n − d n−2 . an exponential upper bound can be found, however, on the numbers d n (c) of 1-diagrams whose inconsistency graph has chromatic χ( ( )) ≤ c. we find theorem 6. d n (c) ≤ (2c + 1) n . then there is a color partition of with c colors. as ([n] , i ) is a secondary structure, it can be encoded in dot-parenthesis notation. coloring the parenthesis with a different color for each class i of the color partition hence yields a unique representation of . this representation can be interpreted as a string of length n over an alphabet consisting of '.' and c different pairs of brackets, i.e., with 2c + 1 letters. theorem 6 is not a very good estimate as we shall see in section 3.3. a secondary structure on n+1 digits may be obtained from a structure on n digits either by adding a free end at the right-hand end or by inserting a base pair 1 ≡ (k + 2). in the second case the substructure enclosed by this pair is an arbitrary structure on k digits, and the remaining part of length n − k − 1 is also an arbitrary valid secondary structure. therefore, we obtain the following recursion formula for the number s n of secondary structures: this expression has first been derived by waterman (1978) the most important numbers are collected in table 1 . a more detailed table can be found in hofacker et al. (1999) . detailed combinatorial studies on various aspects of secondary structure graphs are based on equation (6), see for instance penner and waterman (1993) , stein and waterman (1978 ), waterman (1978 , smith (1978a, b) and hofacker et al. (1999) . in the following we shall make use of the number of secondary structures of length n with k base pairs. this closed formula was recently derived in schmitt and waterman (1994). a first naive upper bound is d n (2) ≤ s 2 n , because on each side of the x-axis we have a secondary structure on n vertices. theorem 5 implies d n (2) ≤ 5 n . a slightly better bound can be derived using the enumeration of secondary structures: proof. we start with the s(n, k) secondary structures with k arcs. in order to produce a bi-secondary structure we use 2l of the n − 2k unpaired positions for introducing l additional arcs. there are n−2k 2l possible choices for these additional pairs, which may form any of the c l = 1 l+1 2l l possible configurations of l matched parentheses. c l is a catalan number. without losing generality we may assume that l ≤ k, i.e., the partial secondary structure with the larger number of pairs is drawn above the x-axis. thus replacing the sums by appropriate multiples of the maximum entry is trivial. note that this bound is still a gross overestimate: (i) it contains all the redundancy of the ().[]-representation. (ii) the number c l also counts conformations of square brackets of the form [], which do not correspond to a graph at all, and it counts conformations in which not all square brackets are inconsistent with an arc that is represented by a round bracket. these latter configurations are counted more than once. proof. let a n (k, l) denote argument of the maximum in lemma 2. it is straightforward to compute solving the optimization problem that defines a is straightforward. a short computation shows thatŷ = 1/ √ 21 andx = (7 − √ 21)/14 is the only local maximum with x, y ≤ 1/2. it violates the condition y ≤ x, however. the solution thus lies on the boundary of the triangle (0, 0), (1/2, 0) and (1/4, 1/4). setting y = 0 one obtains the maximumx = 1/2 − 1/ √ 20. along the edge x + y = 1/2 we findŷ = 1/ √ 12 violating the condition y ≤ x. with x = y we arrive at the cubic equation 31x 3 − 31x 2 + 10x − 1 = 0 which has a single real solution x ≈ 0.1942. we find a(x,x) ≈ 1.5605329 = a, because this value is much larger than the values of a(x, y) at the three corners of the triangle. more sophisticated models of rna take into account that (i) base pairs must enclose at least m = 3 other bases, and (ii) that isolated base pairs are energetically disfavored. in hofacker et al. (1999) the numbers (m,l) n of secondary structures with stack size at least l base pairs and separation of the vertices incident with an arc at least m is derived. we define (m,l;κ) n to be the number of 1-diagrams with χ ( ( )) ≤ κ and with the same restrictions, and set clearly we have (m,l;2) n ≤ [ (m,l) n ] κ because the 1-diagram is a superposition of at most κ secondary structures. in particular, we find the upper bound a (2) 3,2 ≤ 3.418 for the biophysical case. we have not been able to derive an exact counting series for bi-secondary structures. hence we resorted to a numerical survey. we pursued three different strategies for estimating the number of bi-secondary structures: (1) complete enumeration is feasible only for very small values of n because the number of structures grows faster than 2 n . (2) as an alternative we produce random strings from the alphabet ().[] and check each string if it is the normal form of a bi-secondary structure. the number of secondary structures is then estimated by 5 n × n nf /n sample , where n sample is the size of the random sample and n nf is the number of detected normal forms in the sample. our best estimates are compiled in table 2 . in the biologically interesting case, m = 3 and l = 2, we find a (2) 3,2 ≈ 2.35. judging from the exhaustive enumeration data (grüner et al., 1996a) we should expect that the number of structures that actually occur as minimum energy structures is still smaller. in order to incorporate pseudoknots into secondary structure computations we first have to devise an energy model. naturally, we require that this energy function extends the standard model for rna secondary structures without pseudo-knots. the standard energy model is based on decomposing a secondary structure into its 'loops' (zuker and sankoff, 1984) . for secondary structures without pseudo-knots this decomposition is unique and coincides with the so-called minimum cycle basis of the secondary structure graph (leydold and stadler, 1998) . the free energy of a particular secondary structure is computed as the sum of the contributions of the individual loops. these contributions depend on the type of the loop (stacked base pairs, hairpin loop, bulge, interior loop, or multi-branch loop), its size, and on the sequence of nucleotides, see e.g., walter et al. (1994) . we emphasize that the energy model for pseudo-knotted structures introduced in this section is not intended as an accurate potential for predicting pseudo-knots in particular (biologically relevant) sequences. it is intended as a simplified model that allows us to investigate the likelihood of pseudo-knots in an ensemble of sequences and the stability of pseudo-knots against point mutations of the sequence. it is shown in tacker et al. (1996) for (pseudo-knot-free) rna secondary structures that such statistical properties are surprisingly robust against changes in the parameter set and the choice of the folding algorithm. for instance, most global properties of rna folding are already present in the 'maximum matching' model, which, instead of an elaborate energy model, simply seeks to maximize the number of base pairs (tacker et al., 1996) . a potential function that captures the most salient features of pseudo-knots is therefore sufficient for our purposes. very little experimental information is available on the thermodynamics of pseudoknots, see, however, wyatt et al. (1990) . on the other hand, the geometric constraints of rna structures are well understood (saenger, 1984; pleij et al., 1985) . hence we start from the following three principles: (i) loops that are not involved in pseudo-knots have the same energy contributions as in pseudo-knot-free rna secondary structures. (ii) the stacking energies of base pairs are not affected by pseudo-knot formation even in stems that are part of pseudo-knots. (iii) steric hindrance is the major contribution to the pseudo-knot energies. the energy parameters detailed in walter et al. (1994) , and implemented in release 1.2 of the vienna rna package , are used in this study for . schematic drawing of an rna structure with pseudo-knots. the three loops a, b, c and the four stems 1, 2, 3, and 6 are involved in pseudo-knots. the evaluation of loops a and c is straightforward as they contain only a single paired region, namely stack 3. three stems are contained in loop b; we assume that stack 6 is the longest one. the ν-parameters of the three pseudo-knotted loops are listed on the r.h.s. the energy contributions of base pair stacking and the contributions of all unmarked loops are evaluated according to the standard model. the non-pseudo-knot contributions. the basic idea for parameterizing the pseudoknot contributions rests on two simplifications: (i) rna stacks are viewed as stiff rods and (ii) unpaired regions are assumed to be very flexible. within a loop that is involved in pseudo-knot formation, we assume that each of the stacks formed by the pseudo-knotted base pairs is a stiff helix. this reasoning leads to an ansatz based upon the following quantities: u = number of unpaired bases in the loop. l max = number of base pairs in the longest pseudo-knot stack. l i = number of bases in pseudo-knot stack i. k = number of stacked base pairs that can be bridged by one unpaired base. first we define a measure for the sterical hindrance in the pseudo-knotted loop: this expression assumes that all other parts of a loop can be used to meet the constraint introduced by the longest stacked region l max within the loop, see fig. 7 . the free energy contributions of the unpaired regions can be estimated from a theory by jacobson and stockmeyer (1950) . the same approach is used for long loops in the standard energy model for rna secondary structures. if the free energy needed to join the ends of an unrestricted, zero volume polymer is known, the theory predicts the free energy needed to form a similar but larger loop. the minimum length of an rna loop that behaves according to the jacobson-stockmayer theory is not known. we therefore introduce a parameterν and define the energy function as follows: our energy model therefore has four free parameters that need to be estimated from the available experimental data, namely k ,ν, e ps and α. for simplicity we fixed α at the same value that is used for all non-pseudo-knotted loops: α = 1078.56 cal mol −1 (at 37 • c). given the sequence, one can compute the secondary structure with the minimum energy by means of dynamic programming (waterman, 1978; zuker and sankoff, 1984) . in the presence of pseudo-knots this is no longer true. in the present study we use tacker's kinetic folding algorithm (tacker et al., 1996) which is based on (martinez, 1984) . it first produces a list of all possible stems of a given sequence and then determines the free energies of the loops and stacks. the most stable stem is the first one added to the folding structure. using this as a constraint, we compile a list of the remaining possible stems and add the most stable one to the growing structure. this procedure is repeated until the free energy of the structure cannot be decreased anymore. the parameters k ,ν, and e ps are adjusted by predicting the structures of a sample of sequences that are known to form pseudo-knots. this set includes seven fragments with about 80 nt from bacteriophages that form h-type pseudo-knots, e. coli tmrna containing five pseudo-knots, and rnase p sequences from several different species [for details see haslinger (1997) ]. the best results were obtained using k = 4,ν = 9, e ps = 4.2 kcal mol −1 . the same value of e ps was used in abrahams et al. (1990) . in order to check the influence of these parameters on the sequence-structure relation of rna we also used a parameter set leading to an unrealistically large number of predicted pseudo-knots in the test sequences (k = 3,ν = 10, e ps = 2.0 kcal mol −1 ). the average number of base pairs and related statistical properties of the predicted structures depend very little on the inclusion of pseudo-knots and the choice of the pseudo-knot parameters. this is not surprising as the relative stability of base pairs and unpaired regions remains essentially unchanged. the average loop size decreases with the 'unrealistic' pseudo-knot potential because loop regions may take part in pseudo-knots at very little entropic cost. the frequency of pseudo-knots in random sequences is tabulated in table 3 . for the realistic potential we find a pseudo-knot every ∼2500 bases, while with the exaggerated potential one would expect one pseudo-knot in every random sequence of length n = 148. as we have seen in the previous section, there are still many more sequences than structures. in order to obtain a better impression of the relationship between the numbers of sequences and structures that arise through folding, we determine the rank order statistics of folded structures. to this end we compute the structures of a large number of randomly chosen sequences and rank them according to their frequency f of occurrence in the sample. a plot of log f versus the logarithm of the rank reveals a generalized zipf's law (zipf, 1949) , fig. 8 . while the inclusion of pseudo-knots somewhat increases the fraction and the diversity of rare structures (large ranks) it does not change the general shape of the distribution. as for 'pure' secondary structures there is only a small number of common structures into which almost all sequences fold. naturally, we ask how sequences folding into the same (common) secondary structure are distributed in sequence space. we call the set s(ψ) of all sequences (genotypes) folding into phenotype (contact structure) ψ the neutral set of ψ. more precisely, s(ψ) is the pre-image of ψ w.r.t. the folding map algorithm. as for 'pure' secondary structures, a large fraction λ of point mutations is neutral, i.e., does not change the structure. on the other hand, rna sequences folding into a particular structure are not significantly clustered: they form a percolating network spanning the entire sequence. the fraction λ of neutral point mutations was estimated from 6000 independently generated random sequences, see table 4 . as observed in grüner et al. (1996a, b) , we find that λ decreases somewhat with chain length (the large values for n = 30 being caused in part by the large number of short sequences that 'fold' into the open structure). the fraction of neutral neighbors approaches an asymptotic value slightly above 0.5. surprisingly, this value is almost independent of the potential function: even a potential leading to a large fraction of pseudo-knotted structures decreases λ only by a few percent. a random graph theory (reidys et al., 1997; reidys, 1997) shows that there is threshold value of about λ * = 0.307 (for a 4-letter alphabet). if the fraction of neutral neighbors exceeds this threshold, then the set of all sequences folding into a given structure s forms a single connected network, which has been termed the neutral network of s. these neutral networks can be conveniently detected by means of a simple computer experiment. a neutral path starts at a randomly chosen sequence. then we construct a series of subsequent mutants such that each sequence along the path folds into the same structure as the initial sequence, and such that each step increases the hamming distance from the starting point. the strict logic on base pairing in rna makes it necessary to consider two types of mutations: (i) point mutations in the unpaired regions of the molecules, and (ii) the substitution of one possible base pair (gc, cg, gu, ug, au, ua). all other mutations in paired regions necessarily change the structure, for instance by changing a gu pair into a gg mismatch. if there are neutral networks in sequence space the neutral path will reach a length l close to n before there is no neutral mutant further away from the starting point (n is the maximal hamming distance between sequences of length n). on the other hand, if the neutral sets s(ψ) form isolated clusters we will find l n. when interpreting the lengths of neutral paths we have to keep in mind that (i) the search procedure only produced lower bounds on the diameter of neutral networks, and (ii) that a pair of random sequences has an expected distance of 0.75n for a 4-letter alphabet. the data in fig. 9 are therefore a clear indication for the existence of percolating neutral networks in the presence of pseudo-knots. secondary structures form a particular class of contact structures. in this contribution we have considered a natural generalization of this class. indeed, most known rna structures with pseudo-knots are bi-secondary structures (which do not involve nested pseudo-knots). bi-secondary structures correspond to planar graphs while secondary structures form the sub-class of outer-planar graphs. the inconsistency graph introduced in section 2.4 is a useful construction capturing most of the geometrical features of nucleic acid structure. its chromatic number may serve as a measure of structural complexity. it seems possible that an analogous construction will be useful for classifying and comparing protein structures as well. the analysis of graph-theoretical properties of classes of contact structures might also be useful for designing energy models that are more realistic and/or algorithmically easier than pair potentials. the standard folding potential for rna and dna secondary structures, for instance, is based on loops, that is, induced subgraphs of the diagram graph that are circles. the total energy of a secondary structure is defined as the sum of the sequence-dependent energy contributions of all loops [see, e.g., freier et al. (1986) ]. it is by no means obvious how this energy function should be generalized to include non-secondary structure features such as pseudo-knots, g-quartets, or knots, because in general there is no unique decomposition of a graph into loops. in order to understand the sequence-structure mapping of a class of biopolymers it is necessary to have bounds on the number of structures that can possibly be formed for a given set of sequences. we can expect the existence of neutral networks and shape space covering only if the number of sequences by far exceeds the number of structures. while the number of possible contact structures grows faster than exponentially with the length of the molecules we find exponential upper bounds when the structural complexity is limited. in particular, there are not more than some 4.7 n possible bi-secondary structures. if we enforce in addition the sterical (looplength at least 3) and thermodynamic (no isolated base pairs) constraints of natural rna sequences, then this bound drops to 3.42 n . exhaustive enumeration indicates that the actual number of bi-secondary structures with biophysical constraints grows roughly as 2.35 n . therefore the number of rna sequences, 4 n , exceeds by far the number of possible bi-secondary structures. we have then devised a simple energy function extending the standard model to incorporate pseudo-knots. our ansatz assumes that steric hindrance is the major contribution to pseudo-knot energies counteracting the stabilizing effect of the additional base pairings. based on this approach we used a kinetic folding procedure to show that the inclusion of pseudo-knots does not significantly change the global features of the sequence structure map of rna: there are many more sequences than structures, and almost all sequences fold into one of a small number of common structures. common structures are uniformly distributed over sequence space. neutral networks in sequence space can therefore be modeled as random graphs (reidys et al., 1997) . this ansatz generalizes from secondary structures to 1-diagrams without modifications. the only input parameter in this model, namely the fraction λ of neutral neighbors, has been determined computationally. computer simulations agree with the prediction of a random graph theory: the fraction of neutral mutations, λ > 0.5, is well above the threshold value of λ * ≈ 0.306, hence all sequences folding into a given common structure form a single percolating network that spans the entire sequence space. this is verified by the detection of neutral paths that extend through the entire sequence space. the intersection theorem is valid for bi-secondary structures, hence the random graph approach (reidys et al., 1997) , can be used to predict the relative locations of the neutral networks of two different common structures. in particular, we have to expect shape space covering, i.e., the neutral networks of any two common structure come very close to each other at least in some parts of the sequence space. this sets the stage for the evolutionary transitions between different structures described in detail in weber (1997) and fontana and schuster (1998) . in summary, the mathematical results and the computer simulations presented in this contribution indicate that pseudo-knots do not change the qualitative picture of the rna sequence-structure map as it was obtained from studying secondary structures. prediction of rna secondary structure, including pseudoknotting, by computer simulation the book thickness of a graph rna multi-structure landscapes. a study based on temperature dependent partition functions mutational analysis of the rna pseudoknot component of a coronavirus ribosomal frameshifting signal structure and evolution of ribonuclease p rna structure and topology of 16s ribosomal rna. an analysis of the pattern of psoralen crosslinking an rna pseudoknot and an optimal heptameric shift site are required for highly efficient ribosomal frameshifting on a retroviral messenger rna interchain loops in polymers: effects of excluded volume planar permutation graphs statistical thermodynamics of double-stranded polymer molecules a-1 ribosomal frameshifting in a doublestranded rna virus of yeast forms a gag-pol fusion protein a property of 4-chromatic graphs and some remarks on critical graphs queues, stacks, and graphs probing the structure of the escherichia coli 10sa rna (tmrna) statistics of landscapes based on free energies, replication and degradation rate constants of rna secondary structures statistics of rna secondary structures continuity in evolution: on the nature of transitions rna folding and combinatory landscapes similar cage-shaped structures for the rna component of all ribonuclease p and ribonuclease mrp enzymes parallel-stranded duplex dna containing da·du base pairs improved free-energy parameters for predictions of rna duplex stability optimal book embeddings of fft, benes, and barrel shifter networks thermodynamics of folding a pseudoknotted mrna fragment analysis of rna sequence structure maps by exhaustive enumeration. i. neutral networks analysis of rna sequence structure maps by exhaustive enumeration. ii. structures of neutral networks and shape space covering lessons from an evolving rrna: 16s and 23s rrna from a comparative perspective long-range structure in ribonuclease p rna rna secondary structures with pseudoknots. master's thesis, inst. f. theoretische chemie comparing queues and stacks as mechanisms for laying out graphs fast folding and comparison of rna secondary structures combinatorics of rna secondary structures energy directed folding of rna sequences the petersen graph proportions of irreducible diagrams exploring phenotype space through neutral evolution smoothness within ruggedness: the role of neutrality in adaptation intramolecular reaction in polycondensations proportions of irreducible diagrams a comparison of thermodynamic foldings with comparatively derived structures of 16s and 16s-like rrnas sur le problème des courbes gauches en topologie minimal cycle bases of outerplanar graphs domain structure of the ribozyme from eubacterial ribonuclease p the colin de verdière number of linklessly embeddable graphs graphs with e edges have pagenumber o( √ e) transfer rna-like structures: structure, function and evolutionary significance an rna folding rule modelling of the three-dimmensional architecture of group i catalytic introns based on comparative sequence anaysis embedding de bruijn graphs and shuffle-exchange graphs in five pages spaces of rna secondary structures a new principle of rna folding based on pseudoknotting random induced subgraphs of generalized n-cubes bio-molecular shapes and algebraic structures generic properties of combinatory maps: neural networks of rna secondary structures petersen family minors sachs' linkless embedding conjecture linear trees and rna secondary structure how to search for rna structures. theoretical concepts in evolutionary biotechnology from sequences to shapes and back: a case study in rna secondary structures comparing multiple rna secondary structures using tree comparisons delaunay tessellation of proteins: four body nearest neighbor propensities of amino acid residues modeling rna secondary structures i. mathematical structural model of predicting rna secondary structures on a class of linked diagrams, i. enumeration on a class of linked diagrams on some new sequences generalizing the catalan and motzkin numbers characterizations of outerplanar graphs algorithm independent properties of rna structure prediction an unusual mrna pseudoknot structure is recognized by a protein translation repressor evidence for allosteric coupling between the ribosome and repressor binding sites of a translationally regulated mrna identification and analysis od the pseudoknot-containing gag-pro ribosomal frameshift signal of simian retrovirus-1 rna pseudoknots and translational frameshifting on retroviral, coronaviral and luteoviral rnas sur une problème de configurations et sur les fractions continues ribosomal frameshifting requires a pseudoknot in the saccharomyces cerevisiae double-stranded rna virus sur un novel invariant des graphes et un critère de planarité cleavage of trna with imidazole and spermine imidazole constructs: a new approach for probing rna structures co-axial stacking of helixes enhances binding of oligoribonucleotides and improves predicions of rna folding secondary structure of single-stranded nucleic acids introduction to computational biology: maps, sequences, and genomes combinatorics of rna hairpins and cloverleaves rna secondary structure: a complete mathematical analysis dynamics on neutral evolution rna pseudoknots. current opinion struct evidence that a downstream pseudoknot is required for translational readthrough of the moloney murine leukemia virus gag stop codon rna pseudoknots stability and loop size requirements embedding planar graphs in four pages human behaviour and the principle of least effort rna secondary structures and their prediction stimulating discussions with ivo l. hofacker and christoph flamm are gratefully acknowledged. key: cord-353290-1wi1dhv6 authors: kustin, talia; stern, adi title: biased mutation and selection in rna viruses date: 2020-09-28 journal: mol biol evol doi: 10.1093/molbev/msaa247 sha: doc_id: 353290 cord_uid: 1wi1dhv6 rna viruses are responsible for some of the worst pandemics known to mankind, including outbreaks of influenza, ebola, and the recent covid-19. one major challenge in tackling rna viruses is the fact they are extremely genetically diverse. nevertheless, they share common features that include their dependence on host cells for replication, and high mutation rates. we set out to search for shared evolutionary characteristics that may aid in gaining a broader understanding of rna virus evolution, and constructed a phylogeny-based dataset spanning thousands of sequences from diverse single-stranded rna viruses of animals. strikingly, we found that the vast majority of these viruses have a skewed nucleotide composition, manifested as adenine rich (a-rich) coding sequences. in order to test whether a-richness is driven by selection or by biased mutation processes, we harnessed the effects of incomplete purifying selection at the tips of virus phylogenies. our results revealed consistent mutational biases towards u rather than a in genomes of all viruses. in +ssrna viruses we found that this bias is compensated by selection against u and selection for a, which leads to a-rich genomes. in -ssrna viruses the genomic mutational bias towards u on the negative strand manifests as a-rich coding sequences, on the positive strand. we investigated possible reasons for the advantage of a-rich sequences including weakened rna secondary structures, codon usage bias, and selection for a particular amino-acid composition, and conclude that host immune pressures may have led to similar biases in coding sequence composition across very divergent rna viruses. genomes of all replicating entities, including viruses and cellular hosts, have been shaped by millions of years of evolution. the rapid progress of genomics in the past few decades has brought about enormous amounts of genomic information, and today there are thousands of genomes of viruses available, which allow studying the processes that govern the evolution of these genomes (belshaw et al. 2008; duffy et al. 2008; pybus and rambaut 2009) . rna viruses are an extremely diverse collection of entities, spanning a diverse range of hosts, morphologies, genome organizations, and genetic composition. nevertheless, rna viruses do share several common features that drive their evolution: (a) their ultimate dependence on the cell, (b) their high mutation rates, (c) strong purifying selection derived from constraints operating on a small and densely coding genome, and (d) sporadic but powerful positive selection driven by an evolutionary arms race with the host they infect. we hence reasoned that we may find common genomic signatures shared by rna viruses, which in turn may allow us to learn more about the drivers of virus evolution. one example of a process that may affect viral genomes is host editing by cellular enzymes. two notable examples are adar, which promotes a>g mutations, and apobec3 (a3), which promotes c>u mutations in single-stranded dna, manifested as g>a mutations on the coding rna strand of hiv (bishop et al. 2004; samuel 2012) . in principle, a3 promotes hyper-mutated viral genomes, which are unlikely to be capable of replicating, and hence undergo purifying selection. however, there has been extensive debate whether a3 may sometimes operate in a suboptimal manner, leading to genomes that are "viable", i.e., replication competent (jern et al. 2009; sadler et al. 2010; cuevas et al. 2015; delviks-frankenberry et al. 2016 ). if indeed a3 or adar enzymes lead to viable replicating genomes, we would expect to see footprints of their activity in contemporary virus genomes. another notable example of a shared common signature across rna viruses is the depletion of cg and ua dinucleotides across almost all known rna viruses greenbaum et al. 2009; cheng et al. 2013; tulloch et al. 2014) . this under-representation is shared by viral hosts as well; ta dinucleotides (ua in rna) are under-represented in most organisms, likely due to rna-degrading enzymes located in the cytoplasm, and cg are underrepresented in plants and vertebrates, likely due to deamination processes (burge et al. 1992; ). it appears that cells have evolved mechanisms to detect foreign genetic material bearing high levels of cg dinucleotides: recently it has been shown that the cellular enzyme zap restricts hiv genomes bearing rna with multiple cgs (takata et al. 2017) , and we and others have shown strong selection against introduction of cg in hiv and rna viruses in general (burns et al. 2009; atkinson et al. 2014; stern et al. 2017; theys et al. 2018; caudill et al. 2020) . we thus set out to test if there are additional shared genomic and evolutionary features in rna viruses and compiled a large dataset of sequences from pathogenic single strand rna viruses from baltimore classes iv and v (+ssrna and -ssrna viruses, respectively). we focus on these classes in order to (a) avoid the confounding effects of double stranded viruses such as stableness of double stranded dna/rna, and (b) avoid reverse-transcribing viruses, whose replication cycle is unique compared to other rna viruses, and includes a dna stage. one of the key challenges in our study was to disentangle the roles of mutation and selection. any extant sequence is a product of evolution from an ancestral sequence, and this process includes the action of both mutation and natural selection, occurring repetitively. indeed, in the examples above we see that either increased introductions of mutations (via a3 enzymes or adar) or selection (mediated by zap restriction of cg rich sequences) may both lead to unique genomic signatures. to disentangle the effects of mutation versus selection, we harness the notion of incomplete purifying selection operating on viral genomes, whereby selection is relaxed at the tips of phylogeny (fitch et al. 1997; pybus et al. 2007; strelkowa and lassig 2012; gire et al. 2014) . by contrasting between rates of substitutions at internal versus external branches of phylogenies we were able to test for the presence of mutational biases (i.e., mutations that are biased towards specific nucleotides) or for selection for specific types of mutations. overall, our results suggest a consistent selective advantage for the abundance of the a nucleotide across almost all vertebrate rna viral genomes. we first generated an extensive dataset of ~65,900 full coding sequences from pathogenic single strand rna viruses, which we name phyvirus (fig. 1, table s1 ). hosts include a wide array of animals, spanning from arachnids, to birds, to fish, to mammals. as expected, the dataset contained a disproportionate number of human viruses; yet reassuringly, hundreds to thousands of sequences were available from other phylogenetic clades. we implemented an automated process to generate multiple sequence alignments and their corresponding phylogenetic trees (materials and methods). we focused only on alignments of coding sequences (rather than longer genomic alignments) so as to mitigate as much as possible the effects of recombination. we further iteratively ensured that phylogenies are limited to sequences with a high degree of homology, by focusing only on phylogenies where any given branch length is smaller than 0.5 substitutions/site (materials and methods). finally we also ensured that phylogenies were not dramatically affected by mutational saturation (fig. s1 ). we have made the phyvirus dataset available online at https://www.sternadi.com/phyvirus, where all alignments, phylogenies, as well as metadata files are accessible to the wide public. we first calculated the fraction of a, c, g, and u in the coding sequences of all viral families. to our surprise we found an over-representation of a across literally all viral families, accompanied by a strong diminution of c ( fig. 2a) . the fraction of a ranged from ~28% to ~40% in most sequences, reaching a high of 49% in vpg sequences of rhinovirus. the exceptions were the positive single strand rna (+ssrna) families togaviruses, where more c was observed, and caliciviruses, which were relatively homogenous in nucleotide content. as well, some abundance of u was noted in coronaviruses and some -ssrna families and of g in flaviviruses. we next examined the nucleotide composition after breaking down by host classification, to test whether composition was dependent on the host. in general, we did not notice nucleotide composition dependence on the host, with some minor exceptions, mainly in the picornaviridae family (fig. 2b ). finally, when analyzing coding sequences of double strand dna and rna viruses, rna bacteriophages, or coding sequences of hosts, we did not find any consistent preference for a ( fig. s2a -c), and we note that mixed evidence exists regarding a-richness in retroviruses (van hemert and berkhout 1995). we next went on to examine the nucleotide composition across the three codon positions. this analysis revealed an interesting and consistent pattern: first codon positions were found to be enriched for a and g, second codon positions were found to be enriched for a and u, whereas the third codon positions were enriched for a and u in -ssrna viruses, and for u only in +ssrna viruses (fig. 2d ). once again this was in stark contrast to high gc content at third codon positions of host coding sequences (kudla et al. 2006) . the pattern at the codon level also led to a particular pattern of amino-acid frequencies, with some differences between the host and viral amino acid frequencies observed (fig. s3 ). we went on to test if a similarly consistent pattern is found in non-coding regions. in general, rna viruses are devoid of non-coding sequences, and thus these sequences are quite short. our analysis revealed a base composition that was quite different from that in the coding regions, with no consistent enrichment for a or any other nucleotide (fig. s2d ). it seemed that a different set of "rules" apply to the non-coding regions, likely driven by the regulatory roles of non-coding rna in rna viruses. most often these sequences are under strong purifying selection to maintain particular rna structures (robertson 1979; desselberger et al. 1980; le et al. 1992; thurner et al. 2004) , and this most likely leads to a base composition that is specific to every virus and its non-coding region. finally, we examined whether we find longer patterns of biased composition, and focused on the frequencies of dinucleotides. as has been noted previously greenbaum et al. 2009; cheng et al. 2013; tulloch et al. 2014 ), we observed a strong and consistent depletion of cg and to a lesser extent a depletion of ua dinucleotides across all viruses except for togaviruses (fig. 2c ). conversely, we saw an enrichment for ca and ug. this may be explained by adar editing (ua>ug or reverse complement of ua>ca), although other adar editing products were less observed (ag and cu). alternatively, ca and ug may compensate for the lack of cg and ua, since they are both one transition mutation away (the most frequent mutation that occurs naturally in viruses) from either cg or ua (but see (di giallonardo et al. 2017) ). returning to our observations of a-richness, we observed no enrichment for longer patterns that include a, suggesting that a in itself is the unique factor in the virus coding sequences. moreover, a phylogenetic analysis of substitution patterns revealed that all three types of to-a substitution (c>a, g>a, and u>a) are high (fig. s4 ). the hallmark of incomplete purifying selection is higher dn/ds at external nodes (tips) as compared to internal nodes (zhang et al. 2005 ). we thus first tested for the presence of incomplete purifying selection across all our datasets, by contrasting the rate of nonsynonymous to synonymous (dn/ds) substitutions between the internal branches and the external branches (materials and methods). notably, external branch lengths may dramatically differ, and depend heavily on density of sampling. we thus tested various branch length cutoffs to define inclusion or exclusion of a branch as internal or external. we found a higher dn/ds ratio at the tips in 64% of the alignments in our dataset. however only 42% of the datasets displayed significant support using a likelihood ratio test (p< 0.05, after false discovery correction (benjamini and hochberg 1995) ) for the two-rate model that allows for different dn/ds ratios at different branches (here, internal versus external branches). in general, datasets that did not display significant support often contained fewer or less divergent sequences, or had longer branch lengths (less dense sampling). we conclude that incomplete purifying selection is probably pervasive, but significant support requires more data and denser sampling. importantly, the alignments that did pass significance testing were a faithful taxonomical representation of our full dataset (fig. s5 ) and were not significantly different in terms of estimated age (fig. s1 ). we continued our analysis only with the alignments where we observed significant support for incomplete purifying selection. we next set out to contrast between the rate of to-a substitution at the internal branches and the external branches. modeling of directional selection for a resulted in incorrect inference ( fig. s6) , and to overcome this we performed mutational mapping along the phylogeny (materials and methods). we first inferred the extent to which mutation rates are biased in our datasets by focusing only on substitutions at the external branches, where selection is relaxed. we found that +ssrna viruses display a strong mutational bias towards u, in strong contrast to the overall a-rich genome they present, but in line with their content at third codon positions (fig. 3c ). -ssrna viruses maintain a mutational bias towards a in their coding sequences (with the exception of hantaviruses), which is effectively a bias towards u on their genomic strand. interestingly, our data contains two families with an ambisense coding strategy, arenaviridae and phenuiviridae, in which proteins are encoded from both the positive and negative strands. by default, these families are characterized as -ssrna viruses, since only one strand (the negative one) is packed. we tested whether the differences in mutational biases between -ssrna and +ssrna viruses hold when separating the coding sequences of the ambisense viruses based on the strand they reside on. we observed that regardless of the strand, all coding sequences of ambisense viruses are a-rich (fig. s8a) . however, the mutational bias was different based on the strand of the coding sequence: coding sequences on positive strands displayed a mutational bias towards u, whereas coding sequences on negative strands displayed a mutational bias towards a (fig. s8b) . rates to each nucleotide are normalized so as to sum up to one (materials and methods). see fig. s7 for comparison between internal and external biases. (d) families that displayed a significant association between branch location and type of substitution based on a hansel-mantel test (materials and methods). (e) violin plots of inferred odds ratios of to-x/to-y (where x stands for a given nucleotide and y stands for any other nucleotide apart from x) for each nucleotide at internal versus external branches across all viral families. panels (c) to (e) show only virus phylogenies where incomplete purifying selection was observed to be significantly supported. we were intrigued by the finding of the mutational bias towards u and sought to test if this phenomenon presents itself in the recent covid-19 epidemic caused by the sars-cov-2 virus, a +ssrna from the coronaviridae family. uniquely, sars-cov-2 evolution should reflect short term evolution since the virus has been spreading for merely a few months, and hence observed diversity reflects for the most part mutational biases rather than selection. extensive sequencing of the virus around the globe allowed us to analyze mutational patterns that show a strong abundance of substitutions towards u (fig. 4 ) (see also (simmonds 2020) ), further supported by within-host diversity analyses (fig. s9 ). we note that sequencing errors (and in particular deamination/oxidation) may lead to an increase in c>u and g>u, however this should rarely affect the consensus sequence of a virus, which is typically based on dozens to hundreds of sequencing reads (see also fig. s9 ; materials and methods). all in all, the sars-cov-2 sequences support our observation of to-u mutational bias in viral genomes. we discuss the finding of mutations towards u in rna viruses more in depth below. we next created contingency tables of inferred to-x/to-y (where x stands for a given nucleotide and y stands for any other nucleotide apart from x) substitutions at internal branches/external branches, allowing us to test for an association between branch location and direction of substitution (fig. 3b) . our results showed a highly consistent pattern across almost all +ssrna viruses, supporting selection for a and selection against u (fig. 3d-e) . in -ssrna viruses the pattern was mixed, and we did not see any consistent signs of selection for or against any nucleotide. to conclude, our analysis shows (a) a consistent mutational bias towards u in genomes of all viruses, which leads to a mutational bias towards a in coding strands of -ssrna viruses, and (b) selection for a in most +ssrna viruses, which presumably compensates for the bias towards u caused by the mutational process. we put forth three possible explanations why a may be selected for in viruses. second, it is possible that codon usage bias and translational optimization have led to the particular sequence composition observed herein. accordingly, if codons with more a are associated with more abundant trnas, viral genes should be translated more efficiently. varying codon usage has been reported for many different viruses, yet the underlying reasons for this variation remain obscure (jenkins and holmes 2003; gu et al. 2004; kryazhimskiy et al. 2008; wong et al. 2010; belalov and lukashev 2013; cardinale et al. 2013; tian et al. 2018; chen et al. 2020) . the breadth of the phyvirus dataset allows probing codon bias in depth, and we calculated the relative synonymous codon usage (rscu), the effective number of codons (enc) (fig. s10a ) and the trna adaptation index (tai) (fig. s10b ) (materials and methods), focusing on human viruses for the latter. our correspondence analysis (ca) analysis showed that rscu differences between viral sequences are not attributed to viral host classification, type of protein, or viral family, as reflected in the silhouette values ( fig. 5a-c) . silhouette scores below zero reflect bad clustering, where clusters are embedded within each other, whereas values around zero reflect an almost complete overlap of clusters, suggesting that the clustering variables do not explain differences in rscu values. when probing which factors are responsible for the first and second components of the ca (23% and 10% of the variability in the data, respectively), we observed a strong correlation of the first axis with the synonymous nucleotide content of the third codon position !" , but very weak to no correlations of both axes with tai (fig 5d) , and of tai with enc (fig. 5f ). if codon usage had been driven by selection for enhanced translation of proteins, we would have expected one or more of the following: higher correlation with tai, low enc (fig. s10a ), unique rscu profile of genes known to be highly expressed in viruses, such as capsid products (fig 5b) or unique rscu profile based on host/type of virus (fig. 5a,c) .we do not observe any of these phenomena, whereas the correlation with !" suggests that other forces drive the codon composition. we conclude that translational optimization is not likely the driver of the sequence composition observed herein, and that the particular codon composition of a viral sequence is likely a by-product of other factors. we continue to a third possible explanation for selection towards a: selection for amino-acids encoded by a-rich codons. while there are many possible reasons leading to selection for specific amino acids, we speculated that the major histocompatibility complex (mhc) class i system may play a role in selection for specific peptides, as it is has been shown to drive the evolution of many vertebrate viruses (kuntzen et al. 2007; foll et al. 2014; carlson et al. 2015; klã¸verpris et al. 2015) . to test the effect of mhc on composition of viral genomes, we predicted which peptides derived from virus genomes would be weakly or strongly detected by the mhc system. remarkably, peptides preferentially displayed by mhc systems were found to be encoded by a/g-poor and c/u-rich sequences (fig. 6) . in other words, there should be a selective advantage for a/g-rich and c/u poor sequences that would allow escape from the mhc system. while clearly this subject merits further in-depth investigation, selection due to the mhc class i system would explain our results, in particular the selection against u and for a. mhc epitope prediction was run on all translated phyvirus coding sequences across a variety of mhc alleles, including human, chimp, gorilla, rhesus macaque, bovine, porcine and mouse alleles (table s2 ). peptides were classified into peptides that would be strongly detected by the mhc system ('strong'), weakly detected ('weak'), or not at all ('none') (see also fig. s11 ). we have found that the vast majority of rna viruses examined herein have skewed nucleotide composition in their coding sequences, with most viruses bearing a-rich and c-poor sequences. this pattern appears to be quite consistent across hosts ranging from fish, to insects, to mammals, with the caveat that the largest number of sequences in our data was from mammalian viruses (including human viruses). the a-rich pattern disappeared when analyzing viruses of bacteria, viral non-coding regions, or coding sequences of hosts ( fig. s2b-d) . one of the original goals of this analysis was to test whether we observe the presence of signatures of rna editing by a3 enzymes or adar enzymes, or restriction by cellular enzymes such as zap in long-term evolution of viruses. interestingly, we did not find any such evidence for a3, at least not in a widespread manner, found partial evidence for possible adar editing, and although we found a decreased cg presence in all viral families excluding togaviridae we frequency cannot prove that zap is the underlying reason for this phenomenon. in any case, it is not likely that restriction by zap would explain the a-richness observed in all single strand viruses. two non-mutually exclusive hypotheses may be put forth to explain the consistent pattern of a-richness that we observe: there is selection for more a in viral sequences, and/or there is a mutational bias that leads to more a in genomes of viruses. in order to tease apart the roles of selection and mutation, we used the notion of incomplete purifying selection, which allows us to separate between recent and non-recent evolution. our results revealed that both mutational biases and selection operate in viral genomes. in both +ssrna and -ssrna, we observed a mutational bias towards u on the genomic strand of these viruses, which is counteracted by selection against u and towards a in the +ssrna viruses. we begin by discussing the mutational biases we observed. in the absence of selection, which is what we measure at external branches of trees, any biased introduction of one nucleotide should lead to its pair being introduced at an equal proportion. for example, if we denote # as the probability of erroneously incorporating an a, when this a is reverse-complemented this will lead to # = $ (fig. s12) . while for some -ssrna viruses we see a similar mutational bias towards both a and u, this equality in mutational biases does not hold for any of the +ssrna viruses (fig. 3c) . even more intriguingly, we see a mutational bias that differs between negative and positive strands of the ambisense viruses (fig. s8b ). yet if we consider the genomic strand only, this bias collapses to a mutational bias that introduces more u on the genomic strand. this suggests that the mutational process is biased towards one of the strands and acts as a non -symmetrical process (fig. s12 ). one possible explanation for this directional mutation bias may be genomic damage in the form of spontaneous deamination. interestingly, this is consistent with a model suggesting that dna damage is a major source of replication errors in humans (gao et al. 2019) . for single strand rna viruses, it is possible that the genomic damage will affect packaged genomes more than strands replicated within the cells, leading to this non-symmetrical bias. another explanation for this mutational bias is host mediated enzymatic deamination of virus genomes that are packed as virions. we note that the genomic mutational bias towards u we find in viruses falls in line with previously published work that shows that mutation is universally biased towards at in several species, including human and bacteria (hershberg and petrov 2010; lynch 2010) . this suggests that there may be a commonality in the unknown mechanism that creates this bias. furthermore, in double strand genomes it is nearly impossible to determine the specific mutation that causes the bias observed (to-a, to-t or both) due to the complementary nature of the genome. if the mechanism generating biases is the same in viruses and in cells, based on our analysis we speculate that the bias is towards u mutations rather than towards a. finally, we turn to examine the underlying reasons for selection towards a. we have proposed three explanations: first, a is a weak rna binder thus selection to a will promote less rna secondary structures and will aid viruses in avoiding the host defense mechanisms. second, translational selection promotes specific codons and causes bias in the nucleotide content and third, there may be selection for amino-acids encoded by codons with a. at this stage, our analyses suggest that avoidance of secondary structure or translational selection are most probably not the sole underlying cause for selection towards a, and we show tentative evidence suggesting that the mhc class i system may drive selection for codons with elevated a. in line with the above, we noted throughout our analysis that flaviviruses were outliers; their sequences were a and g rich, and they are the only +ssrna viral family where we did not infer selection for a and selection against u (fig. 3d-e) . when probing the sequences in this family, we noted that the majority of sequences were of vector-borne viruses (e.g., dengue virus, zika virus, and west nile virus). in general, vector-borne viruses were rare in other virus families, suggesting that our observations regarding selection for a do not hold for vector-borne viruses. carefully tying this together with our hypotheses in the previous section, we note that insects lack both an mhc system and an interferon response (flajnik and kasahara 2001; secombes and zou 2017) which augments the immune response to dsrna, and this might be the underlying reason for the lack of selection against u and towards a observed in flaviviruses. alternatively, other characteristics of the life cycle and replication of flaviviruses may be responsible for the absence of selection we observed in these viruses. to conclude, we have found similar patterns of coding sequencing composition across a wide variety of rna viruses. we found that both mutation towards u and selection for a drive these patterns. in general, we show here that probing viral sequences and phylogenies allows a better understanding of mechanisms that shape the evolution of viruses, and in particular, allows insights into possible footprints of host activity, potentially illuminating the interaction between hosts and viruses. sequences for the phyvirus dataset were primarily obtained from niaid virus pathogen database and analysis resource (vipr) (pickett et al. 2012) , and were augmented by sequences of influenza from the niaid influenza research database (ird) (zhang et al. 2017) . the sequences were retrieved as single gene/protein (as opposed to genome segments) during january 2019. host information was retrieved from vipr and ird. notably, around 9,700 sequences lacked host assignment. we manually sampled a few dozen sequences and checked their host assignment in the associated publication. most were human viruses but other hosts were present as well. we note that this does not affect any of the analyses in this study which were almost always agnostic with respect to host. our data contains multiple non-duplicated sequences from the same viral species in order to build a comprehensive evolutionary and phylogenetic history as much as possible. these features of the dataset are summarized at table s1 . we would like to acknowledge the viralzone resource (https://viralzone.expasy.org/) (hulo et al. 2011) for providing comprehensive and accessible information about viral families and genomes. an in-house computational pipeline was used for clustering the phyvirus dataset into multiple sequence alignments and associated phylogenetic trees. we first used megablast (morgulis et al. 2008) to create clusters of homologous sequences, by using each sequence as a query against all sequences of the phyvirus dataset as a database, using an e-value of 10 %&! . we then aligned sequences using prank (lã¶ytynoja 2014) with default settings, and reconstructed phylogenies using the maximum-likelihood method based phyml (guindon and gascuel 2003) , with default settings. we next sought to ensure that phylogenies do not contain sequences that are too remote from each other. to this end we implemented an iterative scheme where we "cut" phylogenies into two or more at branches whose length was larger than 0.5. the 0.5 cutoff was chosen based on manual curation and inspection, and allowed us to avoid grouping together very remotely related sequences. the phylogenies were then rooted using midpoint rooting for analyses that required a rooted tree. finally, clusters with less than ten sequences were omitted from the analysis. this pipeline resulted in 465 alignments from thirteen viral families that contain overall 65,951 sequences (fig. 1) . for analyses that required codon-based alignments, we performed codon alignment using prank. we manually obtained non-coding sequences for a select number of virus families: picornavirus alignments were obtained from http://www.virology.wisc.edu/acp/aligns/seq_align.html (palmenberg and sgro 2002; palmenberg et al. 2009 ), full genome dengue and ebolavirus sequences identifiers were downloaded from vipr (pickett et al. 2012) , allowing us to thus obtain complete record and features from ncbi. dsdna, dsrna and rna phages sequences retrieval and processing dsdna and dsrna sequences were obtained from vipr. bacteriophage sequences were downloaded from ncbi (https://www.ncbi.nlm.nih.gov/) using the cystoviridae and leviviridae taxonomy codes. coding sequences were extracted using genomic coordinates from ncbi. codon usage data was downloaded from the codon usage database https://www.kazusa.or.jp/codon/ (nakamura et al. 2000) and filtered for the host classes in our data, focusing on hosts that have more than 50 coding sequences in the codon usage database, overall we have analyzed 41 mammalian species, ranging from mouse, cow, monkey and human. using the codon usage table, we counted the number of nucleotides and amino acids in coding sequences and their frequencies. nucleotide frequencies were calculated by viral family, by host and by codon position. we first averaged over each alignment, and then averaged by viral family and codon position. this was done to avoid biasing the calculation when a very large number of sequences were available for a particular gene. dinucleotide odds ratios '( ( , â�� { , , , }) were calculated as described previously (cheng et al. 2013) : , where ' and ( denote the nucleotide frequencies and '( denotes the frequency of the dinucleotide in the sequence. the baseml program from the paml package (yang 2007 ) was used to run the unrestricted non-reversible (unr) and general time reversible (gtr) models in order to infer the frequencies of substitution between all pairs of nucleotides. since the unr model requires a rooted tree, we implemented mid-point rooting on all of the phylogenies. about half of the datasets displayed slight significant support for the unr over the gtr model, yet results of unr and gtr were very consistent with respect to the frequencies of substitution (data now shown). we applied a mutation-selection model of direction selection that we have previously developed (stern et al. 2017) to test for selection for a specific nucleotide. briefly, the model allows an increase in the substitution rate at a proportion ( ) of sites by rescaling rows and columns of the substitution matrix going to and from the selected nucleotide. the model further accounts for incomplete purifying selection at the external branches of the tree by rescaling the among-site variation distribution (see (stern et al. 2017 ) for more details). notably, the original directional model was run on a phylogeny where the root sequence was known. the model was here modified, and assumed a stationary distribution at the root. as commonly practiced, this distribution was inferred based on the distribution of nucleotides at the leaves. moreover, the original model was agnostic regarding which nucleotide is under selection, and hence assumed that all four nucleotides may be under selection with a probability of /4. we here changed the model to allow for selection for only selected nucleotide (a, c, g, or u) with a probability of . the null model ( = 0) allowed the use of a likelihood ratio test to assess for significant selection towards a specific nucleotide. results of this analysis revealed supposedly pervasive selection towards all four nucleotides, and we concluded this is erroneous inference that is due to problematic assumptions of the model that assume a same set of substitution rate parameters across all sites (fig. s6) . in order to assess for incomplete purifying selection we used the branch-site model of the codeml program from the paml package (yang et al. 2000) . as described previously (pybus et al. 2007) , we compared between a 1-ratio model which assumes one value for all branches of the phylogeny and a 2-ratio model that assumes one value of * for the external branches and another value of + for internal branches. the underlying assumption is that relaxed selection at external branches will lead to an increase in at external branches, i.e., since tip lengths vary across the datasets and some external branches are very long (suggesting purifying selection may exert its effect on them), we used different branch length cutoffs to determine which external branches are treated as external branches and which external branches are categorized as internal branches when attempting to detect incomplete purifying selection. the cutoffs that we used were 0.05, 0.04, 0.03, 0.02, 0.01, 0.005, 0.001, and no cutoff (i.e., the classic definition of external branch: a branch that had no progeny branches). to determine which datasets display significant support for incomplete purifying selection we first performed a likelihood ratio test between the 1-ratio null model and the 2ratio alternative model. all p-values were corrected for multiple testing using false discovery rate (fdr) (benjamini and hochberg 1995) . only datasets showing significant support under one of the cutoffs described above, and where + < * , were considered as showing evidence of incomplete purifying selection. ancestral sequence reconstruction was performed using the fastml program under a jukes and cantor model for nucleotides and applying joint reconstruction of characters across the phylogenies (ashkenazy et al. 2012) . this enabled us to map the different substitutions across each phylogenetic tree. we then classified substitutions as external or internal, based on the maximal cutoff value that allowed for detection of incomplete purifying selection. the mutational bias of each nucleotide was calculated based on external substitutions only, by calculating the fraction of substitutions towards ( â�� { , , , }) divided by the fraction of ancestral nucleotides that are not . for convenience, the biases were then normalized so they sum up to one. to determine if there is selection for or against a specific nucleotide we constructed a 2x2 contingency table for each viral family, with the type of mutation (e.g., to-a and to-c/g/u) at the columns, and type of branch (external/internal) at the rows. cells then included the number of substitutions for each intersection of categories. we then used the mantel-haenszel (mh) test to test for an association between branch type and substitution type. we performed the mh test for each viral family and for each of the four nucleotides, and corrected for multiple testing using fdr (benjamini and hochberg 1995) . for the between host analysis sars-cov-2 a multiple sequence alignment containing 53,997 sequences was downloaded from gisaid (https://www.gisaid.org/) on july 7th 2020. for each sequence we counted the numbers of each mutation type relatively to the epi_isl_402125 sequence (ncbi accession mn908947) from wuhan, china. we have also reconstructed the most recent ancestor (mrca) of the sars-cov-2 human clade using the bat coronavirus ratg13 sequence (accession number: mn996532.1) as an outgroup, and the mrca we obtained was identical to the epi_isl_402125 sequence. each mutation was counted only once, under the assumption that shared mutations were due to shared ancestry. we further discarded positions that have been documented as prone to errors based on the following resource: https://github.com/w-l/problematicsites_sars-cov2, although we note that retaining or discarding these positions had almost no effect on the results. for the within-host analysis we analyzed deep sequencing data of 212 sars-cov-2 samples that we have recently sequenced (miller et al. 2020) . to mitigate sequencing errors, we considered only positions with coverage above 1000 and mutation frequencies above 5%. we have calculated several measures of codon usage bias. first, relative synonymous codon usage (rscu) was calculated for each gene as previously described (sharp and li 1986) , where each sequence is represented as a 59 dimensional vector. we then performed correspondence analysis (ca) to reduce dimensionality and to detect major trends in codon usage variation among the sequences. in order to asses separation of the sequences on the first two ca axes we calculated the silhouette score (rousseeuw 1987 ) based on different clustering categories (host classification, protein type for the six main protein types shared among all viral families depicted in fig. 5 , and viral family). next, we calculated the effective number of codons (enc) as previously described (wright 1990) , where enc values range from 20 (when only one codon is used per amino acid) to 61 (when all synonymous codons are equally used for each amino acid). we also calculated !" , which is the frequency of gc content at the synonymous third codon position. rscu, ca, enc and !" were calculated using the codonw software (j.f.peden, unpublished, available at http://codonw.sourceforge.net/). finally, we calculated the trna adaptation index (tai) using the tai package (https://github.com/mariodosreis/tai) with genomic trna information from homo sapiens that was obtained from gtrnadb (chan and lowe 2016) . to predict peptides that can serve as epitopes for mhc class i recognition we used the netmhcpan4 program (jurtz et al. 2017) . we ran the program over all phyvirus sequences using 249 mammalian alleles from the following organisms: human, chimpanzee, swine, mouse, gorilla, rhesus macaque and bovine (table s2 ). the prediction was performed for peptide lengths of nine with default parameters. we calculated the nucleotide content for strong and weak binding areas and for areas with no binding prediction. in our calculation we first considered only nucleotides that determine the amino acid type unequivocally (for example for valine we counted g and u only, since the wobble position can be either one of the four nucleotides). a similar analysis was performed considering all three nucleotides that code for these peptides, yielding essentially the same results (fig. s11) . a t-test was performed to determine if the nucleotide content was significantly different between the mhc binding strengths. multiple test correction was performed using fdr. the phyvirus dataset is available online at https://www.sternadi.com/phyvirus, and includes all alignments, phylogenies, as well as metadata files. raw sequencing data for the miller et al. 2020 sars-cov-2 samples are available in the ncbi sra database under bioproject id prjna647529. fastml: a web server for probabilistic reconstruction of ancestral sequences the influence of cpg and upa dinucleotide frequencies on rna virus replication and characterization of the innate cellular pathways underlying virus attenuation and enhanced replication causes and implications of codon usage bias in rna viruses pacing a small cage: mutation and rna viruses controlling the false discovery rate: a practical and powerful approach to multiple testing apobec-mediated editing of viral rna over-and under-representation of short oligonucleotides in dna sequences genetic inactivation of poliovirus infectivity by increasing the frequencies of cpg and upa dinucleotides within and across synonymous capsid region codons base composition and translational selection are insufficient to explain codon usage bias in plant viruses hiv-1 adaptation to hla: a window into virus-host immune interactions cpg-creating mutations are costly in many human viruses gtrnadb 2.0: an expanded database of transfer rna genes identified in complete and draft genomes dissimilation of synonymous codon usage bias in virus-host coevolution due to translational selection cpg usage in rna viruses: data and hypotheses extremely high mutation rate of hiv-1 in vivo dsrna sensing during viral infection: lessons from plants, worms, insects, and mammals minimal contribution of apobec3-induced g-to-a hypermutation to hiv-1 recombination and genetic variation the 3' and 5'-terminal sequences of influenza a, b and c virus rna segments are highly conserved and show partial inverted complementarity dinucleotide composition in animal rna viruses is shaped more by virus family than by host species rates of evolutionary change in viruses: patterns and determinants differentiating between selection and mutation bias long term trends in the evolution of h(3) ha1 human influenza type a comparative genomics of the mhc: glimpses into the evolution of the adaptive immune system influenza virus drug resistance: a time-sampled population genetics perspective overlooked roles of dna damage and maternal age in generating human germline mutations genomic surveillance elucidates ebola virus origin and transmission during the 2014 outbreak patterns of oligonucleotide sequences in viral and host cell rna identify mediators of the host innate immune system analysis of synonymous codon usage in sars coronavirus and other viruses in the nidovirales a simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood evidence that mutation is universally biased towards at in bacteria a test of neutral molecular evolution based on nucleotide data viralzone: a knowledge resource to understand virus diversity the extent of codon usage bias in human rna viruses and its evolutionary origin likely role of apobec3g-mediated g-to-a mutations in hiv-1 evolution and drug resistance netmhcpan-4.0: improved peptide-mhc class i interaction predictions integrating eluted ligand and peptide binding affinity data why is cpg suppressed in the genomes of virtually all small eukaryotic viruses but not in those of large eukaryotic viruses? heterogeneity of genomes: measures and values the a-rich rna sequences of hiv-1 pol are important for the synthesis of viral cdna role of hla adaptation in hiv evolution natural selection for nucleotide usage at synonymous and nonsynonymous sites in influenza a virus genes high guanine and cytosine content increases mrna levels in mammalian cells viral sequence evolution in acute hepatitis c virus infection conserved tertiary structure elements in the 5' untranslated region of human enteroviruses and rhinoviruses phylogeny-aware alignment with prank rate, molecular spectrum, and consequences of human mutation adaptive protein evolution at the adh locus in drosophila full genome viral sequences inform patterns of sars-cov-2 spread into and within israel. medrxiv:2020 database indexing for production megablast searches codon usage tabulated from international dna sequence databases: status for the year 2000 alignments and comparative profiles of picornavirus genera sequencing and analyses of all known human rhinovirus genomes reveal structure and evolution vipr: an open bioinformatics database and analysis resource for virology research evolutionary analysis of the dynamics of viral infectious disease phylogenetic evidence for deleterious mutation load in rna viruses and its contribution to viral evolution 5' and 3' terminal nucleotide sequences of the rna genome segments of influenza virus silhouettes: a graphical aid to the interpretation and validation of cluster analysis apobec3g contributes to hiv-1 variation through sublethal mutagenesis adars: viruses and innate immunity evolution of interferons and interferon receptors codon usage in regulatory genes in escherichia coli does not reflect selection for 'rare' codons rampant câ��u hypermutation in the genomes of sars-cov-2 and other coronaviruses: causes and consequences for their short-and long-term evolutionary trajectories the evolutionary pathway to virulence of an rna virus clonal interference in the evolution of influenza cg dinucleotide suppression enables antiviral defence targeting non-self rna within-patient mutation frequencies reveal fitness costs of cpg dinucleotides and drastic amino acid changes in hiv conserved rna secondary structures in flaviviridae genomes the adaptation of codon usage of +ssrna viruses to their hosts rna virus attenuation by codon pair deoptimisation is an artefact of increases in cpg/upa dinucleotide frequencies. elife 3:e04531. van der kuyl ac, berkhout b. 2012. the biased nucleotide composition of the hiv genome: a constant factor in a highly variable virus the tendency of lentiviral open reading frames to become a-rich: constraints imposed by viral genome organization and cellular trna availability codon usage bias and the evolution of influenza a viruses. codon usage biases of influenza virus the 'effective number of codons' used in a gene paml 4: phylogenetic analysis by maximum likelihood codon-substitution models for heterogeneous selection pressure at amino acid sites evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level influenza research database: an integrated bioinformatics resource for influenza virus research we thank rasmus nielsen, eran bacharach, pleuni pennings and tzachi hagai for stimulating discussions and comments on the manuscript. this work was supported in part by a fellowship to tk from the edmond j. safra center for bioinformatics at tel-aviv university, and funding to as from the koret-uc berkeley-tel aviv university initiative in computational biology and bioinformatics and from the israeli science foundation (1333/16). key: cord-355075-ieb35upi authors: papenfuss, anthony t; baker, michelle l; feng, zhi-ping; tachedjian, mary; crameri, gary; cowled, chris; ng, justin; janardhana, vijaya; field, hume e; wang, lin-fa title: the immune gene repertoire of an important viral reservoir, the australian black flying fox date: 2012-06-20 journal: bmc genomics doi: 10.1186/1471-2164-13-261 sha: doc_id: 355075 cord_uid: ieb35upi background: bats are the natural reservoir host for a range of emerging and re-emerging viruses, including sars-like coronaviruses, ebola viruses, henipaviruses and rabies viruses. however, the mechanisms responsible for the control of viral replication in bats are not understood and there is little information available on any aspect of antiviral immunity in bats. massively parallel sequencing of the bat transcriptome provides the opportunity for rapid gene discovery. although the genomes of one megabat and one microbat have now been sequenced to low coverage, no transcriptomic datasets have been reported from any bat species. in this study, we describe the immune transcriptome of the australian flying fox, pteropus alecto, providing an important resource for identification of genes involved in a range of activities including antiviral immunity. results: towards understanding the adaptations that have allowed bats to coexist with viruses, we have de novo assembled transcriptome sequence from immune tissues and stimulated cells from p. alecto. we identified about 18,600 genes involved in a broad range of activities with the most highly expressed genes involved in cell growth and maintenance, enzyme activity, cellular components and metabolism and energy pathways. 3.5% of the bat transcribed genes corresponded to immune genes and a total of about 500 immune genes were identified, providing an overview of both innate and adaptive immunity. a small proportion of transcripts found no match with annotated sequences in any of the public databases and may represent bat-specific transcripts. conclusions: this study represents the first reported bat transcriptome dataset and provides a survey of expressed bat genes that complement existing bat genomic data. in addition, these data provide insight into genes relevant to the antiviral responses of bats, and form a basis for examining the roles of these molecules in immune response to viral infection. bats make up approximately 20% of the extant mammalian diversity and are the second most species rich mammalian lineage after rodents [1] . the order chiroptera is divided into two suborders: the megachiroptera and microchiroptera. these two lineages are estimated to have diverged approximately 58 million years ago [2] . megachiroptera consists of a single family, the old world fruit bats, while microchiroptera includes 17 families of echolocating bats. bats have a wide geographic distribution and exploit a variety of environmental niches, being absent only from the polar regions. bats are also hosts to numerous viruses, many of which are highly pathogenic to humans and other mammals yet appear to cause no clinical consequences in bats [3] [4] [5] [6] [7] [8] . this group of mammals also shares a variety of unique characteristics that likely facilitate the persistence and spread of the viruses they carry. highly social species, bats live at much higher densities than other mammals. they are the only mammals capable of powered flight and have long lifespans relative to their body size [9] . despite their diversity, unique characteristics and role as natural reservoirs for viruses, bats are also the least studied of all mammalian taxa and there is little information available on antiviral immunity in any bat species. bats are the natural reservoir hosts of more than 80 viruses, with new viruses or viral sequences of bat origin being discovered each year [9, 10] . rna viruses account for the overwhelming majority of known bat viruses, many of which are among the most deadly known to man, including ebola, hendra, nipah and sars-like coronaviruses [9] . many of these viruses, which cause severe morbidity and mortality in humans and other mammals, appear to cause no clinical diseases in bats under natural or experimental infection. the most studied example is the henipaviruses (hendra and nipah viruses) which are members of the family paramyxoviridae. nipah virus has a mortality rate of 40-90% in humans and close to 100% in experimental animal models (cats and hamsters). yet, infection of pteropus vampyrus (the natural reservoir host of nipah virus in malaysia) and p. poliocephalus (a related bat species native in australia) by a high dose of nipah virus, failed to result in clinical signs of disease [7, 8, 11] . other examples of experimental infections of bats including ebola zaire, japanese encephalitis and st. louis encephalitis viruses have not resulted in any symptoms of disease despite the presence of viral rna in tissues [3] [4] [5] [6] . experimental infections of p. poliocephalus with nipah virus have demonstrated the presence of serum antibody and viral shedding in the absence of clinical symptoms of disease [11] . the only viruses that have been demonstrated to cause clinical symptoms of disease in bats are rabies virus and the closely related australian bat lyssavirus [12, 13] . however, results of experimental infections are inconsistent with only a small proportion of bats succumbing to infection, and rates of sero-conversion and virus recovery from tissues were reported to be very low [13] . the long co-evolutionary history of bats with viruses has likely resulted in the adaptation of the bats immune system to cope with viral infection. one hypothesis is that the innate immune system rapidly controls viral replication to very low levels that cause no clinical consequences to bats, but still results in viral shedding and subsequent spillover to other species. however, as little information currently exists on any aspect of bat immunology and few bat-specific reagents exist, this hypothesis remains untested. recent years have seen a surge in the availability of whole genome sequence data. bats were among the organisms sequenced as part of the us national institutes of health (nih)-funded mammalian genome project. these genomic resources are an important step forward in identifying the genes that are involved in antiviral immunity in bats and in providing insights into other unique life history characteristics. there are currently two publicly available bat genome sequences: one from the megabat p. vampyrus and a second from the microbat myotis lucifugus. both bat genomes were initially sequenced to low coverage (2.6x for p. vampyrus and 1.7x for m. lucifugus, though a draft quality assembly of the m. lucifugus genome based on 7x coverage sequencing is now available). additionally, the annotations were predominantly based upon comparative data. despite these shortcomings, these projects have an important role to play in revealing the mechanisms that have evolved to allow bats to remain asymptomatic to so many viral diseases. in order to understand bat-virus interactions, we are developing the australian black flying fox, p. alecto, as a model bat species. p. alecto belongs to the family pteropodidae and is closely related to p. vampyrus [14] . these two species are reservoirs for a variety of closely related viruses, the most important of which include the henipaviruses, hendra virus in p. alecto and nipah virus in p. vampyrus [10] . a number of important resources have now been developed for p. alecto, including cell lines from a variety of tissues [15] . we have also begun to identify some of the genes involved in immune responses in this species and carry out functional studies in bat cells [16] [17] [18] [19] [20] [21] . to begin to characterise the immune gene repertoire of p. alecto, we sequenced the transcriptome of bat immune tissues and mitogen-stimulated cells using the illumina platform. to our knowledge, this study represents the first analysis of the transcriptome of any species of bat. our analysis of the p. alecto transcriptome provides information on a variety of immune genes not previously identified in any bat species and represents an important starting point for examining the antiviral activity of these molecules. overview of the bat transcriptome two separate transcriptomic datasets were generated and raw sequences from each database were submitted to the sequence read archive [sra: srr350710.3 and srr351237.2]. the first was obtained using total rna extracted from a juvenile male flying fox thymus. due to its role in central tolerance, the thymus expresses a large proportion of the proteome and therefore allows for the identification of a broad range of genes, including those involved in the immune response. to enrich for sequences corresponding to cytokines and innate immune genes, the second dataset was derived from pooled total rna obtained from mitogen-stimulated spleen, white blood cells and lymph node and unstimulated thymus and bone marrow obtained from one pregnant female and one adult male flying fox. cells were stimulated with lipopolysaccharide (lps) and ionomycin, which stimulate the production of pro-inflammatory cytokines; polyic, a tlr3 ligand; pha, which triggers t cell activation and pma, which activates t and b cells. about 12.5 million 65 bp long reads were obtained from the thymus dataset, while 23.9 million 76 bp long reads were generated from the stimulated pooled sample. prior to assembly, the raw reads were trimmed of low quality sequence and polya/t tails, uninformative strings of 'n' and primer/adapter contaminants were cleaned. the filtered dataset consisted of 12,399,095 reads from the thymus (between 20-63 bp) and 22,577,294 reads from the stimulated pooled dataset (between 20-73 bp). the filtered reads were de novo assembled using the software packages velvet and oases. the resulting oases assemblies consisted of 247,909 contigs (n50 1244 bp) from the thymus and 313,641 contigs (n50 733 bp) from the pooled samples. the largest contigs in the thymus and pooled samples were 11.8 kb and 8.9 kb respectively, both of which correspond to the dna-dependent protein kinase catalytic subunit (dna-pkcs) which is represented by a 12.4 kb transcript in other species, including horse. for comparative purposes, an assembly using mira was also generated. summary statistics from the velvet, oases and mira assemblies are listed in additional file 1: table s1 . all subsequent analyses were performed using the oases assemblies. to identify orthologues of known mammalian protein coding genes, the bat contigs were used to search the kegg and ncbi non-redundant (nr) protein databases with blastx (e-value < 0.001). of the 247,801 contigs longer than 50 bp in the thymus sequence assembly, about 46% matched annotated proteins in the nr database. for the pooled dataset, about 51% of the 313,528 loci matched proteins in nr. similar results were obtained for both assembled libraries against the kegg database. of the assembled thymus transcripts annotated using kegg, 36% of all transcripts were more similar to horse sequences than to any other species, followed by dog (16%) and cow (12%) (figure 1 ). similar results were obtained for the pooled tissue dataset (not shown). this result is consistent with the now generally accepted view that bats belong within laurasiatheria, which includes carnivora, cetartiodactyla (whales and even toed ungulates), eulipotyphla (moles and shrews), pholidota (scaly anteater) and perissodactyla (odd toed ungulates) [22] [23] [24] [25] [26] [27] . however, until recently, the phylogenetic relationships within laurasiatheria have been controversial. conflicting results have been reported using complete mitochondrial genome sequences to infer phylogenetic relationships with support for a sister relationship between chiroptera and fereungulata (carnivora, pholidota, perissodactyla and cetartiodactyla) or a relationship between chiroptera and eulipotyphla [28] [29] [30] . analysis of the nuclear gene, protamine p1, as well as large genomic datasets, has provided evidence that bats are sister to a clade containing perissodactyla, carnivora, and cetartiodactyla [31, 32] . the volume of sequence data generated by transcriptome sequencing provides the opportunity for larger scale sequence comparisons than previously possible using the few full length bat genes available or by comparison with the limited whole genome sequence data. our results support the comparative analysis of retroposon loci which has also demonstrated that bats share a sister relationship with horses, forming a clade with carnivora [27] . alignment of contigs from the thymus and pooled datasets to the kegg database identified 178,554 and 285,268 contigs respectively with homology to 16,863 and 16,927 unique human proteins. to explore gene function, gene ontology (go) terms were used. of contigs that matched proteins in the kegg database, 86% were assigned go terms and 78% could be mapped to go slim terms using go term mapper (additional file 2: figure s1 ). genes with go slim terms were further classified into twelve selected classes ( figure 2 ). the most abundant go terms found in the thymus dataset were involved in cell growth and maintenance (16.8%), enzyme activity (14.8%), cellular components (14.3%) and metabolism and energy pathways (14.5%). similar results were obtained for the pooled tissue dataset (data not shown). the go classification demonstrates that a diverse range of genes were identified in each of our two datasets providing a broad survey of bat genes. a goal of the present study was to identify immune transcripts, particularly those that may play a role in antiviral immunity. only 3.5% of the bat transcribed genes from each of the datasets showed homology to genes associated with immune function. this represents about 500 different immune-related genes ( figure 2 ). the bat immune transcripts were further categorised using go terms to annotate the transcripts into 40 immune categories. represented in the datasets were genes involved in a broad range of immune activities with lymphocyte activation, cytokine production and t cell activation making up the largest proportions of immune transcripts ( figure 3 ). using kegg codes to identify immune genes, our data revealed 70 genes involved in toll-like receptor (tlr) cascades, 50 genes involved in b cell activation, 79 involved in t cell activation, 72 involved in natural killer cell cytotoxicity and 41 involved in antigen presentation. additional immune genes not identified in the kegg database were obtained by searching sequences from the nr database. the sequences of all genes described in the text are provided in the additional file 3. one hypothesis for the ability of bats to resist the pathological effects of viral infection is that they are able to rapidly control viral replication early in the immune response through innate antiviral mechanisms. the bat transcriptome contained representatives of a variety of immune genes including pattern recognition receptors, interferons, interferon stimulated genes and natural killer cell receptors. pattern recognition receptors (prr) including tlrs, rig-i like helicases (rlhs) and nucleotide oligomerisation domain (nod) like receptors (nlrs) recognise conserved molecular patterns associated with a broad range of pathogens. both tlrs and rlhs initiate signalling pathways that result in the induction of similar immune and inflammatory responses but are expressed in different locations within the cell and differ in the pathogens they recognise. tlrs are transmembrane proteins expressed by the plasma membrane or endosome and recognise a broad range of pathogens including viruses, bacteria and fungi. of eleven previously identified p. alecto tlr genes [18] , only tlr5 was absent from the oases assemblies, however it was present in the mira assembly, which used a lower coverage cut-off and is useful for identifying genes with low expression levels. rlhs are expressed in the cytoplasm where they recognise viral rna and dna [33, 34] . three bat rlh genes, retinoic-acid-inducible protein i (rig-i), melanoma-differentiation-associated gene 5 (mda5) and laboratory of genetics and physiology 2 (lgp2) were identified in our transcriptome datasets and have recently been described in p. alecto [17] . these results provide further evidence that bats are able to recognise a broad range of pathogens, similar to other species. nlrs are a diverse family of cytoplasmic prrs involved in the activation of a variety of signalling pathways. nlrs are primarily involved in bacterial recognition, although more recently, evidence for recognition of viral rna and dna by some members of the nlr family has been reported [35] [36] [37] . the only nlrs identified in the bat transcriptome datasets were nod-like receptor family card domain containing 5 (nlrc5) and nlr family, pyrin domain containing 3 (nlrp3). nlrc5 is a recently identified nlr proposed to function as a positive and negative regulator of antiviral immune responses [36] . nlrp3 (also known as nalp3) is activated by a variety of danger signals including viral and bacterial infections and environmental irritants. activation of nlrp3 in turn activates caspase-1 in the inflammasome which proteolytically cleaves the cytokines il-1î² and il-18 into active mature peptides [37] . the identification two nlrs with associations with antiviral immunity in the bat transcriptome is remarkable and provides a starting point for understanding the role of nlrs in antiviral immunity in bats. the interferon (ifn) response is a key component of the innate immune system and the first cytokines induced against viral infection. since the ifn response is important in the control of viral replication in other mammals, we searched the bat transcriptome for ifns and ifn stimulated genes (isgs) that may be critical to the ability of bats to remain asymptomatic to viral infections. type i (including ifnî± and î²) and iii (ifnî») ifns are induced directly in response to viral infection and play a role in the earliest stages of the innate immune response. type i (î±) ifn and its receptor (ifnar1 and ifnar2) were identified in the bat transcriptome datasets (additional file 3). although type iii ifns, ifnî»1 and ifnî»2 are upregulated in stimulated bat cells [21] , neither of these genes were identified in our datasets, likely reflecting a low expression level in our samples. the il-10r2 chain of the type iii ifn receptor was present in the bat transcriptome, but its partner chain ifnlr1 was not found. both il-10r2 and ifnlr1 were recently described in p. alecto and ifnlr1 was demonstrated to act as a functional receptor for ifnî» [20] . the induction of type i and type iii ifns results in the transcription of hundreds of isgs including prrs that detect viral rna, transcription factors that result in the amplification of the ifn response and a small number of proteins that are directly responsible for inducing an antiviral state. the isgs, myxovirus resistance (mx) gtpases, protein kinase r (pkr), 2'-5' oligoadenylate synthetases (oas), ribonuclease l (rnasel) and isg15 are among the proteins with confirmed antiviral activity in other mammals [38] . the bat transcriptome datasets contained genes orthologous to mammalian mx1, mx2, oas1, oas2, oas3, oas-like (oasl), pkr, rnasel and isg15 consistent with the presence of an isg repertoire in bats that is similar to that of other species. these results provide the first evidence that the pathways activated by the ifn response are likely similar in bats to those described in other mammals. the mx gene family is among the best characterised isgs, first identified as antiviral proteins following the observation that the sensitivity of many inbred mouse strains to orthomyxovirus was solely due to mutations within the mx locus [39] . the mx family of gtpases trap essential viral components, and in so doing prevent viral replication at early time points. although the full spectrum of mx antiviral activity is unknown, representatives of both rna and dna viruses have been shown to be sensitive to the effects of mx [40] . a full length transcript, encoding a 667 amino acid protein was identified in our bat transcriptome datasets and found to be orthologous to mx1 based on comparison with known mammalian mx1 and mx2 family members (figure 4a and data not shown). bat mx1 contained the highly conserved tripartite gtp-binding domain found in all mammalian mx proteins. in addition, a dynamin family signature and putative leucine zipper motif were found near the c terminal end, represented by a stretch of evenly spaced leucine residues. the bat protein was also conserved in the region identified as the stalk of human mxa including loop 2 which is associated with antiviral activity. consistent with other species, loop 4 of the mxa stalk is the least conserved region of the bat mx protein [41] . loop 4 has been reported to be proteinase k sensitive and may play a role in lipid binding [42, 43] (figure 4b ). bat mx1 does not contain the stretch of basic amino acids (k/r) near the c terminal end associated with nuclear localisation of mouse mx1, consistent with the bat protein remaining localised within the cytoplasm [44] . the conservation of key residues important in antiviral activity is consistent with the bat mx1 playing a role in antiviral immunity similar to other species. the identification of the sequences of important isgs will now allow us to determine whether functional differences in the initiation and regulation of these proteins account for the differences in susceptibility of bats to viral infections compared to other mammals. natural killer (nk) cells are an important component of the innate immune response, providing a first line of defence against viruses and tumours. to our knowledge, no investigations of nk cell receptors from any species of bat have been reported previously. nk cells express cell surface receptors that recognise major histocompatibility complex (mhc) class i or class i like molecules on the surface of cells and lyse infected or abnormal cells by cytotoxicity. two families of nk receptors that bind classical mhc class i ligands have been identified: the killer immunoglobulin like receptors (kirs), which are encoded by genes in the leukocyte receptor complex (lrc), and the killer cell lectin like receptors (klrs), which are encoded by genes in the natural killer complex (nkc). different lineages of mammals make use of genes from the two different superfamilies to carry out analogous functions. kirs are used preferentially by primates, cattle, domestic cats, dogs and pigs [45, 46] . similarly, the kir-like receptors, marsupial immunoglobulin-like receptors (mairs) and chicken immunoglobulin-like receptors (chirs), have expanded in marsupials and chickens respectively [47, 48] . although chir-ab binds igy, the ligand for the majority of chirs is unknown and the presence of a charged transmembrane residue and a cytoplasmic immunoreceptor tyrosine-based inhibition motif (itim), are consistent with the possibility that they play a role in nk activity [49] . rodents, horses and platypus are the only species so far described that have expanded the klrs, represented by the ly49 family [50] [51] [52] . in the bat transcriptome dataset, no transcripts with homology to kirs or ly49 receptors were identified. in bony fish, novel immune type receptors (nitrs) which contain an n terminal variable domain and a c terminal ig domain have been identified as the primary activating and inhibitory receptors expressed by nk cells [53] . nitrs were also used to search the bat transcriptome but failed to identify any orthologous transcripts. the failure to find kir or ly49 like receptors in the bat transcriptome may reflect low expression levels of these genes resulting in their absence from our datasets. however, blast searches of the publicly available whole genome sequence of the closely related pteropid bat, p. vampyrus revealed no evidence of kirs or ly49 receptors. as this is a low coverage genome (2.63x), further work is required to determine whether pteropid bats have kir and/or ly49 receptors. overall, the absence of these important nk receptors from our datasets warrants further investigation into the nature of nk cells in bats. nk cells in a wide range of mammalian species additionally express cd94/nkg2 (also called klrd1/ klrc) lectin-like receptor heterodimers. unlike the kir and ly49 receptors, which bind (classical) mhc class ia ligands, the cd94/nkg2 heterodimer binds the (non-classical) mhc class ib ligands hla-e and qa-1 in humans and mice respectively [54] . the cd94/ nkg2a heterodimer generates inhibitory signals whereas the cd94/nkg2c heterodimer generates activating signals within nk cells. both cd94 and nkg2a were identified in the bat transcriptome, however nkg2c transcripts were not identified, possibly reflecting the low abundance of transcripts of this gene in our datasets. two and 37 nkg2a transcripts were identified in the thymus and pooled datasets respectively and six transcripts corresponding to cd94 were identified in the pooled dataset. two of the longest nkg2a sequences were aligned with nkg2a and nkg2c sequences from human and mouse. as shown in figure 5a , the bat genes display highest conservation with other nkg2a genes including the presence of conserved itim motifs in their cytoplasmic domains, designated by i/v/l/sxyxxl/v indicating that they are likely functional inhibitory receptors [55] . the more divergent nkg2d, which binds mhc class i chain-related genes, mica/b, and the ul16 binding proteins (ulbps) in human [46] , was also detected. two distinct bat cd94 contigs were identified, one of which is missing two conserved cysteines in the stalk region, the first of which forms an interchain disulfide bond with nkg2 and the second which forms an intrachain disulphide bond. the second bat cd94 sequence is missing a conserved cysteine in the extracellular domain that forms an intrachain disulphide bond (figure 5b ). the absence of key cysteines in both of the bat cd94 sequences may have implications for the formation of heterodimers with nkg2 and for the unique folding of the cd94 chain. combined with our failure to detect kirs or ly49 receptors, our data may provide the first evidence for the presence of atypical nk cell responses in bats. however, confirmation of the nature of the nk response and the composition of receptors used by bat nk cells awaits further investigation. other nk receptors were also identified in our datasets including cd244 which acts as an activating or inhibitory receptor on human and mouse nk cells respectively [56] and the natural cytotoxicity receptors expressed by nk cells. co-receptors including cd16 and cd56 expressed by subsets of nk cells in other species were also identified in the bat transcriptome. identification of nk cell receptors and co-receptors provides information for the development of reagents to identify bat nk cells and paves the way for further studies of nk cell function during viral infection in bats. genes involved in the adaptive immune system, including mhc class i and ii genes and t and b cell receptors and co-receptors were highly represented in both the thymus and pooled datasets providing evidence that bats have all of the components necessary to mount an adaptive immune response. mhc class i molecules play an important role in the initiation of the adaptive immune response through recognition of endogenously-derived peptides from viruses and other pathogens. in the thymus dataset, 46 contigs had homology to mammalian mhc class i proteins, while 24 were homologous in the pooled data. other transcripts in the mhc class i antigen-loading and presentation pathway were also identified, including beta-2-microglobulin, transporter associated with antigen processing 1 (tap1), calnexin and tapasin. class irelated genes were also present in the bat transcriptome dataset including cd1a, cd1b, cd1d, mr1, hfe, fcrn and ulbps, which have a variety of immune and nonimmune functions. the presence of ulbps is consistent with the expression of nkg2d, but orthologues of mica/b or mill were not observed. the presence of nkg2d suggests bats should have a mic homologue, but these may not be detected possibly due to low or tissue-specific expression. to our knowledge, these sequences provide the first class i and class i-associated transcripts from any species of bat. of the 46 contigs with homology to mhc class i genes in the thymus dataset, 29 contained in-frame stops. these may be expressed pseudogenes, represent assembly or sequencing errors or result from reading frame shifts due to the presence of unprocessed transcript. as the sequences were obtained from multiple individuals, it is not possible to confidently distinguish between alternative isoforms, alleles and in some cases, loci. however, clustering the remaining contigs with open reading frames (orfs), there are clearly at least 9 distinct mhc class i genes expressed. the majority of class i contigs contained the î±1 or î±2 domain or partial sequence corresponding to both domains and were used for further sequence analysis. the deduced amino acid sequence of contigs with the most complete î±1 or î±2 domains were aligned with human hla-a ( figure 6 ). all of the bat class i sequences contained a unique three amino acid insertion in the î±1 domain that appears to be bat specific. as shown in figure 6 , the bat transcripts display amino acid variation in their î±1 and î±2 domains, corresponding to the peptide binding region. however, they appear to be remarkably conserved from residues 131 to 175 of the î±2 domain. these results may indicate that bats contain a very closely related class i gene repertoire that have coevolved with the specific viruses they carry. some of the class i transcripts represented in the thymus and pooled datasets contained an 84 bp insertion at the end of the î±1 domain. the longest of these transcripts corresponded to the leader peptide through to 71 amino acids of the î±2 domain and is shown in figure 6 . the insertion at the end of the î±1 domain is not present in class i sequences from other mammals and includes two in frame stop codons that would prevent translation beyond the î±1 domain ( figure 6 ). this sequence was figure 5 a. alignment of deduced amino acid sequences of bat nkg2 with human and mouse nkg2a and nkg2c. sequences are divided into cytoplasmic, transmembrane, stalk, and lectin domains. the predicted itim motifs in the cytoplasmic domain are shaded. the conserved cysteine residue in the stalk predicted to be involved in interchain disulphide bond formation with cd94 is shaded and indicated with an asterisk. dashes indicate similarity and dots indicate gaps. b. alignment of the deduced amino acid sequences of bat cd94 with the human and mouse orthologues. sequences are divided into cytoplasmic, transmembrane, stalk, and lectin domains. conserved cysteines predicted to be involved in disulphide bond formation are shaded. cysteine pairs are indicated by identical numbers below the cysteine. the cysteine predicted to form a disulphide bond with nkg2 is indicated with an asterisk. confirmed by race pcr and transcripts were detected in a variety of tissues including lymph node, spleen, liver, lung, heart, kidney, small intestine, brain and salivary glands, thus providing evidence that they are not an artefact of the transcriptome assembly (data not shown). comparison with the closely related p. vampyrus whole genome sequence available in ensembl revealed that the 84 bp insertion is identical to the beginning of intron 2 of a p. vampyrus class i gene. mhc class i splice variants that retain intron sequence and result in the translation of a truncated protein have been identified in other mammals, including soluble splice variants of human hla-g that plays a role in immunoregulation at the feotal-maternal interface [57] . further investigation will be required to determine whether the bat gene encodes a soluble protein corresponding only to the î±1 domain or whether it represents a transcribed pseudogene. however, given the abundance of this transcript in our datasets it is possible that it plays a role in immune regulation in p. alecto. unlike class i molecules, which are ubiquitously expressed, class ii molecules are expressed only by antigen presenting and b cells and present exogenously derived peptides to t cells. the mhc class ii molecules are composed of an î± and a î² chain encoded by a and b genes respectively [58, 59] . eutherians have three main classical class ii gene clusters: dp, dr, and dq, as well as the nonclassical dm and dn/do gene clusters [60, 61] . sequences corresponding to exon 2 of mhc class ii drb genes have been described in four species of microbats [62] [63] [64] . however, prior to the present study no class ii genes have been reported from any species of megabat. sequences corresponding to genes involved in the class ii antigen processing and presentation pathway were also identified in our datasets including the class ii invariant (cd74) chain and cathepsin s (additional file 3). in the p. alecto thymus and pooled datasets we identified 78 and 238 contigs respectively that were homologous to class ii sequences. phylogenetic analysis revealed that the alpha chain sequences were homologous to dma, doa, dqa and dra from other mammals (figure 7a ) and the beta chain sequences were homologous to dmb, dob, dqb and drb (figure 7b ). these results are consistent with orthologous relationships between the bat class ii genes and those from other mammals. t cell receptor (tcr) genes corresponding to all four chains of the t cell receptor were present in our datasets, consistent with bats having both î±î² and î³î´ t cells. sequences corresponding to the constant and variable domains of the tcr were identified including many tcrî± related contigs, tcrî² related contigs, a few tcrî³ and tcrî´ chain related contigs. in humans and mice approximately 95% of circulating t cells express the î±î² t cell receptor. in contrast, î³î´ t cells account for up to 70% of circulating t cells in young ruminants, rabbits and chickens [65, 66] . the low abundance of tcrî³ and tcrî´ related transcripts in our datasets is consistent with the possibility that î±î² t cells may be the predominant tcr present in bats. in addition, a variety of t cell co-receptors, including the accessory tcrî¶ chain, cd3, cd4, cd8 and cd28 were identified in our datasets. we previously described the immunoglobulin heavy chain diversity of p. alecto, revealing the presence of a highly diverse variable region gene repertoire [16] . sequences encoding the variable and constant domains of immunoglobulin heavy and light chains were represented in our datasets. these included heavy chain genes encoding iga, igg, igm and ige, which have previously been described in the megabat, cynopterus sphinx. no evidence for the transcription of igd was observed in the p. alecto transcriptome, a result which is consistent with c. sphinx [67] . the two light chain subtypes, kappa and lambda and a variety of b cell co-receptors including cd19, cd22, cd72, cd79a and cd79b were also identified in our datasets (additional file 3). many of the bat immune transcripts showed high levels of sequence similarity compared to homologues from other mammals. among the most conserved bat innate immune genes were the prrs; the tlrs, rig-i helicases and nlrs, which displayed >80% amino acid identity with homologues. this likely reflects their roles in the recognition of conserved pathogen motifs. members of the oas family were also highly conserved, in particular oas1 which shared 87% amino acid identity with the dog oas1 sequence. in addition, the nk co-receptor, cd56 shared 93% amino acid identity with mouse, hamster, guinea pig and human sequences. among the adaptive immune genes, mhc associated proteins, calnexin, tap1 and cathepsin s shared 89-95% identity with corresponding sequences from other mammals reflecting their conserved roles in the antigen processing and presentation pathway. several members of the mhc class i and ii families were also highly conserved, including cd1b and cd1d which shared 88 and 89% amino acid identity with horse and chimp sequences respectively. the bat mhc class ii doa and dra shared 91 and 89% amino acid identity with orthologous sequences in other mammals. the t cell co-receptor, cd28 shared 90% identity with the rhinoceros cd28 sequence and the constant domain of igm shared 92% identity with camel igm. there were >77,000 unannotated contigs in the thymus and pooled datasets. only about 3% of these contigs matched predicted cdnas from the p. vampyrus genome sequence, which are annotated using orthologous sequences from other species [68] . the unannotated contigs contained a total of 3266 open reading frames (orfs) longer than 300 bp. of these, 92.6% (e-value < 10 -3 ) aligned to the closely related p. vampyrus whole genome sequence and represent highly divergent homologues or bat specific genes. the remaining loci represent either misassembled contigs or bat-specific transcripts that are located in sequencing gaps in the low coverage p. vampyrus genome sequence. the 3266 long (>300nt) orfs were searched for conserved domains using profile hidden markov models with hmmscan (hmmer v3; http://hmmer.org/) obtained from the pfam database [69] . this identified 345 orfs containing 214 unique domains, including several defensins, antimicrobial peptides and dna-binding domains. searches using domain models from the pfam-b database, identified a further 437 unique, predicted-conserved domains in 733 orfs. a further 2188 orfs remained unannotated. a high proportion of these were rich in cysteine, tryptophan and proline, and prolines frequently appeared in low complexity regions (additional file 4: figure s2a and b) . further characterisation of these unannotated transcripts will provide insight into whether they are functionally significant and in particular whether any unique bat specific transcripts are involved in the antiviral immune response. bats are a highly diverse, species rich group of mammals that have evolved a variety of distinctive characteristics since their divergence from other mammals [70] . despite the central importance of bats in harbouring a variety of viruses with the potential to spillover to other species, very little is known about antiviral immunity in bats. next generation sequencing provides the opportunity to survey genes that are conserved between distantly related species as well as to provide insights into novel adaptations through the identification of previously unidentified transcripts. to identify genes involved in the immune response, we carried out a transcriptome analysis of thymus and immune cells and tissues of the australian black flying fox, p. alecto. this study represents the first survey of expressed bat immune genes and complements existing low coverage bat genome sequences. our analysis provides a broad overview of the bat transcriptome and contains representatives of all of the major classes of immune genes. the results are consistent with bats having all of the components of the immune system present in other mammals. the majority of these correspond to genes that have not previously been described in any species of bat and thus represent an important resource for future investigations into antiviral immunity in bats. animals p. alecto bats used in this study were wild caught from east brisbane, queensland, australia. bats were handled and euthanised as previously described [15] . all experiments were approved by the australian animal health laboratories animal ethics committee (protocol aec1281). the thymus was removed from a juvenile male bat and immediately stored in rnalater (ambion) for rna extraction. the spleen, lymph nodes (ln), thymus, bone marrow and peripheral blood were collected from one adult male and one pregnant female bat. single cells were extracted from the spleen, thymus and ln by tissue extrusion through a 70 î¼m sterile sieve (bd biosciences) in the presence of dmem supplemented with 15 mm l-glutamine, 100 units/ml penicillin and 100 units/ml streptomycin (invitrogen). splenocytes and peripheral blood lymphocytes (pbmls) were isolated by density centrifugation over lymphoprep (nycomed, oslo) as described previously [21] . cells were resuspended in dmem with 10% fcs, 15 mm l-glutamine, 100 units/ml penicillin and 100 units/ml streptomycin and cell numbers were determined using a haemocytometer with trypan blue exclusion. the thymus and bone marrow cells were stored in rnalater (ambion) for rna extraction and the spleen, thymus and ln were cultured with a variety of stimulants. the isolated splenocytes, ln and pbmls from each bat were pooled and were then seeded at 1 x 10 7 cells per well in 24 well tissue culture plates (nunc) with pha (10ug/ml; sigma) and lps (10 î¼g/ml; sigma); pma (50 î¼g/ml; sigma) and ionomycin (2nm/ml; sigma); or polyic (30 î¼g/ml; invivogen) and incubated in a humidified atmosphere of 5% co 2 in air at 37â°c. cells were harvested in rlt buffer (qiagen) at 0, 1, 4 and 18 hours and homogenised using a qiashredder (qiagen) following the manufacturer's instructions. the lysate was then stored at â��80â°c and total rna extracted the next day (0, 1, and 4 hours) or processed immediately (18 hours). rna extraction was carried out as previously described using the rneasy mini kit (qiagen) with removal of genomic dna with dnase i digestion [16] . total rna from the thymus of a juvenile male bat was used for illumina sequencing separately from all other samples. total rna obtained from the stimulated and unstimulated cells from the two adult bats was pooled as follows: 22% thymus total rna (11% from each bat) and 78% pooled total rna from the rest of the mitogen stimulated and unstimulated cells/tissues (~3.45% for each sample; total of 22 samples). sequencing mrna isolation from total rna, library preparation and single-end read sequencing was performed by geneworks pty ltd, thebarton south australia using the illumina genome analyser iix sequencing platform. library preparation was performed as per illumina's mrna sequencing sample preparation guide (part # 1004898 rev. d) except 5 î¼g of total rna was used for mrna selection using poly-t oligo-attached magnetic beads. the thymus library was run on a single lane of a flow cell resulting in more than 12.5 million 65-base sequences for a total of about 0.82 gigabases (gb) of sequence. the pooled library consisted of 4 lanes resulting in 24 million 76 bp sequences for a total of about 1.8gb of sequence. sequence pre-processing and de novo assembly the quality of the sequences were evaluated using fastqc [71] . sequences were pre-processed in two stages. first, all bases at the 3' end of the reads with quality scores of 3 or lower were removed. second, poly a/t tails, uninformative sequences (ns) and primer/adaptor contaminants were trimmed using snowhite (version 1.1.3) [72], a cleaning pipeline for next-generation cdna sequences, which includes seqclean [73] and tagdust [74] . we ran snowhite with two runs of seqclean and one run of tag-dust and a final minimal length cutoff of 20 bp was used. the pre-processed sequences were de novo assembled using two different approaches. (1) the reads were assembled with velvet (version 1.0.12) [75] using individual kmers from 19 bp to 31 bp. next, the contigs produced by velvet were processed using oases (version 0.1.15) [76] . oases loci were then merged using cd-hit-est (version 4.0) [77] with a global sequence identity threshold of 1.0. finally, a length cutoff was set to 50 bp and the default coverage cutoff of 3 was used. we term the final result of this process a contig (2) . the reads were also assembled using mira 3 (v3.2.0rc3) [78] with default setting for est and illumina reads assembly, i.e. maximum front and end gap clip is 2 bp, maximum length of the possible vector leftover allowed is 18 bp, minimum quality score, window length and read length were all set to 20, allowed to clip poly a/t at ends, and minimum read coverage per contig was 2. the bat contigs were firstly annotated by using the best hits of blastx [79] search against nr protein database and kegg pathway database with an e-value cutoff of 0.001 for annotating the protein coding contigs that were conserved with other species. then the unannotated contigs were further annotated by using blastn search against refseq_rna database with an e-value cutoff of 10 -5 for the contigs containing conserved utrs and without significant protein coding regions. the contigs not annotated by the above two steps were further analysed by using blastn against the cdnas from megabat (p. vampyrus) and microbat (m. lucifugus). we translated the un-annotated transcripts into protein sequences from 6 frames, extracted the orfs longer than 300 bp. this was performed separately for the 2 datasets. these orfs were searched against pfam-a and pfam-b databases to identify conserved domains. the two sets of long orfs were pooled and clustered based on cd-hit with sequence identity of 50% [77] . the amino acid compositions were further analysed for the nonredundant longer orfs with composition profiler [80] . all the kegg ids of the human proteins identified by the blastx searches were extracted from the annotation process and were mapped to uniprot ids. then the go analysis for the uniprot proteins (uniprotkb-goa: gene_association.goa_human) was used to assign the go terms for the transcripts. the number of genes in categories of the go slim database was counted using the go term classification counter, categoriser [81, 82] and the immune category of the bat genes was annotated using the generic gene ontology term mapper [83] . the go classifications were further grouped into twelve broad categories as follows: cell death and apoptosis go:0005783; extracellular matrix go:0005739; cell, go:0005623 and nucleoplasm, go:0005654). binding (binding, go:0005488; protein binding, go:0005515 go:0003677; nucleotide binding go:0004872; actin binding, go:000377; calcium ion binding, go:0005509; chromatin binding, go:0003682; carbohydrate binding, go:0030246 and rna binding, go:0003723). reproduction and development (development enzyme activity (catalytic activity go:0016787 ca), based on the protein alignment to retain codon positions. based on the nucleotide and protein alignments, phylogenetic trees were constructed by the neighbour joining method [85], maximum parsimony and minimum evolution using the mega4 program the genbank accession numbers for sequences used in the sequence and phylogenetic analysis are as follows: mhc class i: (cap58485) hla-a; mhc class iia: human, homo sapiens (hosa) dma (nm_006120) dqa (m21931), dra (u13648) dra (nm_001113706), dma (nm_001004039) gallus gallus (gaga) b-la (ay357253) mhc class iib: hosa dqb (m20432), drb (nm_021983), dob (l29472), dpb (m57466), dmb (u15085) rattus norvegicus (rano) dqb (x56596) equus caballus (eqca) dqb (l33910) dmb (dq431246), drb (ay191776) ovis aries (ovar) dqb (l08792) mumu (nm_001136068) hosa (af135187_1) avian mx: gaga (np_989940) mammal species of the world: a taxonomic and geographic reference in the timetree of life human ebola outbreak resulting from direct exposure to fruit bats in luebo, democratic republic of congo swanepoel r: fruit bats as reservoirs of ebola virus studies of arthropod-borne virus infections in chiroptera. iv. the immune response of the big brown bat (eptesicus f. fuscus) maintained at various environmental temperatures to experimental japanese b encephalitis virus infection experimental inoculation of plants and animals with ebola virus transmission studies of hendra virus (equine morbillivirus) in fruit bats, horses and cats experimental hendra virus infection in pregnant guinea-pigs and fruit bats (pteropus poliocephalus) bats: important reservoir hosts of emerging viruses bats as a continuing source of emerging infections in humans experimental nipah virus infection in pteropid bats (pteropus poliocephalus) australian bat lyssavirus infection in a captive juvenile black flying fox pathogenesis studies with australian bat lyssavirus in grey-headed flying foxes (pteropus poliocephalus) a phylogenetic supertree of the bats (mammalia: chiroptera) establishment, immortalisation and characterisation of pteropid bat cell lines immunoglobulin heavy chain diversity in pteropid bats: evidence for a diverse and highly specific antigen binding repertoire molecular characterisation of rigi-like helicases in the black flying fox, pteropus alecto molecular characterisation of toll-like receptors in the black flying fox pteropus alecto interferon production and signaling pathways are antagonized during henipavirus infection of fruit bat cell lines type iii ifn receptor expression and functional characterisation in the pteropid bat, pteropus alecto type iii ifns in pteropid bats: differential expression patterns provide evidence for distinct roles in antiviral immunity complete mitochondrial genome of a neotropical fruit bat, artibeus jamaicensis; and a new hypothesis of the relationships of bats to other eutherian mammals parallel adaptive radiations in two major clades of placental mammals molecular phylogenetics and the origins of placental mammals resolution of the early placental mammal radiation using bayesian phylogenetics molecules consolidate the placental mammal tree pegasoferae, an unexpected mammalian clade revealed by tracking ancient retroposon insertions the phylogenetic position of the talpidae within eutheria based on analysis of complete mitochondrial sequences monophyletic origin of the order chiroptera and its phylogenetic position among mammalia, as inferred from the complete sequence of the mitochondrial dna of a japanese megabat, the ryukyu flying fox (pteropus dasymallus) maximum likelihood analysis of the complete mitochondrial genomes of eutherians and a reevaluation of the phylogeny of bats and insectivores confirming the phylogeny of mammals by use of large comparative sequence data sets characterization and phylogenetic utility of the mammalian protamine p1 gene differential roles of mda5 and rig-i helicases in the recognition of rna viruses shared and unique functions of the dexd/h-box helicases rig-i, mda5, and lgp2 in antiviral innate immunity function of nod-like receptors in microbial recognition and host defense regulation of immune pathways by the nod-like receptor nlrc5 nlrp3 inflammasome activation: the convergence of multiple signalling pathways on ros production? interferon-inducible antiviral effectors transgenic mice with intracellular immunity to influenza virus the mx gtpase family of interferon-induced antiviral proteins. microbes and infection the interferon-induced mx protein of chickens lacks antiviral activity stalk domain of the dynamin-like mxa gtpase protein mediates membrane binding and liposome tubulation via the unstructured l4 loop structural basis of oligomerization in the stalk region of dynamin-like mxa transport of the murine mx protein into the nucleus is dependent on a basic carboxy-terminal sequence natural killer cell receptors in cattle: a bovine killer cell immunoglobulin-like receptor multigene family contains members with divergent signaling motifs comparative genomics of natural killer cell receptor gene clusters characterization of the opossum immune genome provides insights into the evolution of the mammalian immune system the leukocyte receptor complex in chicken is characterized by massive expansion and diversification of immunoglobulin-like loci the chicken leukocyte receptor complex encodes a family of different affinity fcy receptors the ever-expanding ly49 gene family: repertoire and signaling natural killer cell receptors in the horse: evidence for the existence of multiple transcribed ly49 genes identification of natural killer cell receptor clusters in the platypus genome reveals an expansion of c-type lectin genes the phylogenetic origins of natural killer receptors and recognition: relationships, possibilities, and realities nk gene complex dynamics and selection for nk cell receptors signaling pathways engaged by nk cell receptors: double concerto for activating receptors, inhibitory receptors and nk cells of mice and men: different functions of the murine and human 2b4 (cd244) receptor on nk cells biology and functions of human leukocyte antigen-g in health and sickness* nomenclature for factors of the hla system three-dimensional structure of the human class ii histocompatibility antigen hla-dr1 sequence organisation of the class ii region of the human mhc evolutionary relationships of class ii majorhistocompatibility-complex genes in mammals class ii drb polymorphism and sequence diversity in two vesper bats in the genus myotis non-neutral evolution of the major histocompatibility complex class ii gene drb1 in the sac-winged bat saccopteryx bilineata mhc class ii drb diversity, selection pattern and population structure in a neotropical bat species, noctilio albiventris prominence of gamma delta t cells in the ruminant immune system characterization of avian t-cell receptor î³ genes the two suborders of chiropterans have the canonical heavy-chain immunoglobulin (ig) gene repertoire of eutherian mammals the pfam protein families database a molecular phylogeny for bats illuminates biogeography and the fossil record tagdust-a program to eliminate artifacts from next generation sequencing data velvet: algorithms for de novo short read assembly using de bruijn graphs oases: de novo transcriptome assembler for very short reads cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences using the miraest assembler for reliable and automated mrna transcript assembly and snp detection in sequenced ests gapped blast and psi-blast: a new generation of protein database search programs composition profiler: a tool for discovery and visualization of amino acid composition differences categorizer categorizer: a web-based program to batch analyze gene ontology classification categories generic gene ontology (go) term mapper multiple sequence alignment using clustalw and clustalx the neighbor-joining method: a new method for reconstructing phylogenetic trees mega4: molecular evolutionary genetics analysis (mega) software version 4.0 statistics of local complexity in amino acid sequences and sequence databases accuracy of protein flexibility predictions the universal protein resource (uniprot) submit your next manuscript to biomed central and take full advantage of: â�¢ convenient online submission â�¢ thorough peer review â�¢ no space constraints or color figure charges â�¢ immediate publication on acceptance â�¢ inclusion in pubmed, cas, scopus and google scholar â�¢ research which is freely available for redistribution we thank craig smith, carol de jong, deborah middleton low complexity regions in protein sequences were detected with the seg program with default parameters [87] . the transcription of a bat mhc class i gene was examined using quantitative pcr (qpcr) as described previously [18] . briefly, total rna was prepared from lymph node, spleen, liver, lung, heart, kidney, small intestine, brain and salivary glands using the rneasy mini kit (qiagen) as described above. cdna was generated using a quantitect reverse transcription kit for rt-pcr (qiagen). qpcr primers were designed using primer express 3.0 (applied biosystems) with default parameter settings (5'-acgactcctattccccaggatag-f and 5'-gaaagc cactggtacctgtgaga-r). reactions were carried out using express sybr w greener tm qpcr supermix universal (invitrogen) and an applied biosystems 7500 fast real-time qpcr instrument. additional file 1: table s1 . summary of additive multiple-kmer velvet/ oases/mira3 assembly.additional file 2: figure s1 . overview of the bat transcriptome. the distribution of 178,554 and 285,268 transcriptome sequences that have mapped to human orthologues from p. alecto thymus and pooled tissue datasets based on go slim terms. sequences within the three areas of gene ontology: molecular function, biological process and cellular component are further divided into subgroups at the go slim level.additional file 3: sequences of all genes described in the manuscript.additional file 4: figure 2 . amino acid composition of large unannotated orfs. the horizontal axis shows amino acids sorted by flexibility index [88] .a. amino acid composition of 1656 large unannotated non-redundant orfs relative to proteins in the swissprot database [89] . the amino acids trp, cys and pro have twice the abundance in unannotated orfs compared to swissprot proteins.b. amino acid composition of 1195 low complexity regions in unannotated orfs relative to 1656 unannotated non-redundant orfs. prolines are abundant in low complexity regions, but trp and cys are not. the authors declare that they have no competing interests. key: cord-014462-11ggaqf1 authors: nan title: abstracts of the papers presented in the xix national conference of indian virological society, “recent trends in viral disease problems and management”, on 18–20 march, 2010, at s.v. university, tirupati, andhra pradesh date: 2011-04-21 journal: indian j virol doi: 10.1007/s13337-011-0027-2 sha: doc_id: 14462 cord_uid: 11ggaqf1 nan patients showed rashes on face, hand and foot. ev detection carried out in vesicular fluid, stool, serum and throat swab specimens by rt-pcr of 5 0 ncr gene. serotyping was carried out by using rt-pcr of viral protein of vp1/2a junction region followed by sequencing and phylogenetic analysis using neighbor-joining-algorithm and kimura-2 parameter model of mega-4 software. overall ev positivity detected in hfmd patients from kerala, tamil nadu, west bengal and orissa states was found to be 51.6%, 66.6%, 62.5% and 71.4% respectively. typing of vp1 gene sequences indicated presence of ca-6, ev-71, echo-9 strains in kerala and ca-16 in west bengal, orissa and tamil nadu. phylogenetic analysis indicated ca-6, ev-71, echo-9 strains showed 94.8-95.7% and 95-94.4% homology with japanese, australian and french strains. however, ca-16 strains were closer to malaysian strains with 91.2-95.6% nucleotide homology. the present study documents the association of multiple types of ev's i.e., ca-6, ev-71, echo-9 and ca-16 strains contributing as prime viral pathogens in hfmd epidemics in the reported regions with new emergence of ca-6 circulating strain in kerala, india. tasgaon september 2010. sera were collected from 162 suspected hepatitis cases and there contacts and tested for anti hev igm/igg antibodies (elisa) and liver enzymes like alanine aminotransferase (alt). anti hev igm antibodies were detected in 45.7% (74/162) of the suspected cases. the overall attack rate was 0.7%. male to female ratio was 2:1. majority (60.4%) of the cases were in the age group 20-40 years and recovered without any clinical complications. weekly distribution of cases showed that the majority (79.4%, 116/146) cases occurred between 2nd and 3rd week of june. dark urine (97.5%), jaundice (93.5%), fatigue (35.9%), abdominal pain (32.6%), anorexia (29.4%), vomiting (26.5%), fever (22.8%), giddiness (14.3%), diarrhoea (12.6%) and arthalgia (3.7%) were the prominent symptoms. sera collected from 73 antenatal cases (ancs) showed anti hev igm antibody in 3. affected pregnant women had a normal outcome. a death of 32 year, male hepatitis e case was reported during the outbreak period that had cirrhosis of liver with oesophageal varices. sanitary survey revealed that water pipelines were laid down in close proximity of sewerage system, and water posts were without tap. these are the likely sources of faecal contamination of water supplies. among 17 water samples collected from various places, 5 were found to be unfit for drinking based on the routine bacteriological tests conducted at state public health laboratory, pune. no case occurred after the pipelines were repaired. this typical outbreak of hepatitis e re-emphasizes need for proper water supply/sewage disposal pipelines and adequate maintenance measures. jayanthi shastri, nilima vaidya, sandhya sawant, umesh aigal department of molecular biology, kasturba hospital for infectious diseases, mumbai, india dengue and dengue haemorrhagic fever are amongst the most important challenges in tropical diseases due to their expanding geographic distribution, increasing outbreak frequency, hyperendemicity and evolution of virulence. the gobal prevalence of dengue has grown dramatically in recent decades. who estimates 50-100 million cases of dengue virus infections worldwide every year resulting in 250,000 to 500,000 cases of dhf and 24,000 deaths each year. public health laboratories require rapid diagnosis of dengue outbreaks for application of measures such as vector control. laboratory diagnosis of dengue virus infection can be made by the detection of specific virus, viral antigen, genomic sequence and/or antibodies. currently 3 basic methods used by laboratories for diagnosis of dengue virus infection are virus isolation and characterisation, detection of genomic sequence by nucleic acid amplification technology assay and detection of dengue virus specific antibodies/antigen. molecular diagnosis based on reverse transcription (rt)-pcr s.a. one step or nested pcr, nucleic acid sequence based amplification (nasba), or real time rt-pcr, has gradually replaced the virus isolation method as the new standard for the detection of dengue virus in acute phase serum samples. several pcr protocols for detection have been described that vary in the extraction method, genomic location of primers, specificity, sensitivity and the methods to determine the products and the serotype. pcr-based dengue tests, due to the specificity of amplification, enable a definitive diagnosis and serotyping of the virus. in addition dna sequencing of the amplification product enables the virus to be genotyped, providing important information on the sources of infection. more recently tests have incorporated flurogenic probe, so called taq man technology for the specific real time detection of dengue 1-4 amplicons. product is detected by a specific oligodeoxy nucleotide probe that is labelled with 6 carboxy-fluorescein (fam). this technology offers the advantage of being both rapid and potentially quantitative. second, the detection of product by hybridisation of flurochrome labelled probes increases specificity. third, as the product is detected without the need to open the reaction tube, the risk of contamination by product carry over is minimised. the advantages of speed, contamination minimisation and reduced turn around time justify application of this assay over the currently used nested pcr assay. during the period january 2007 to october 2009, molecular laboratory received 900 samples from patients presenting with acute onset fever for dengue .6%) samples were tested positive by this method. the disease peaks in the monsoon season with a percentage of 17.5%. rapid tests, igm and igg capture elisa are popularly used tests for diagnosis of dengue infection. its utility is limited for diagnosing dengue in convalescecce (8-14 days) . specificity is also compromised due to infections with flaviviruses: japanese encephalitis and chikungunya. dengue ns1 ag elisa with its cost effectiveness, specificity and sensitivity should be considered as the test of choice for diagnosing dengue in the acute phase of illness in the developing countries. molecular diagnosis enables confirmatory diagnosis of dengue in the acute phase of the illness and is suitable for further typing methods. assistant general manager and r&d coordinator, division of quality control and r&d, bharat immunologicals and biologicals corporation ltd., village chola, bulandshahr, up vaccine development in india, though slow to start, has progressed by leaps and bounds in the past 60 years. it was dependent on imported vaccines but now it is not only self-sufficient in the production of vaccines conforming to international standards with major supplier of the same to unicef. the role of drug authorities is to enhance the public health by assuring the availability of safe and effective a2 indian j. virol. (september 2010) 21(suppl. 1):a1-a58 vaccines, allergenic extracts, and other related products. vaccine development is tightly regulated by a hierarchy of regulatory bodies. guidelines provided by the indian council of medical research (icmr) set the rules of conduct for clinical trials from phase i to iv studies as well as studies on combination vaccines. these guidelines address ethical issues that arise during a vaccine study. a network of adverse drug reaction (adr) monitoring centers along with the adverse events following immunization (aefi) monitoring program provide the machinery for vaccine pharmacovigilance. genetic modifications have been developed to develop effective and cheaper vaccines by the use of recombinant technology. to ensure safety of consumers, producers, experimental animals and environment, governments all over the world are following regulatory mechanisms and guidelines for genetically modified products. as with other industrializing countries undergoing rapid shifts, india clearly recognizes the need to restructure its regulatory system so that its biopharmaceutical industry can compete in international markets. genetic engineering approval council (geac), recombinant dna advisory committee (rdac), review committee on genetic manipulation (rcgm), institutional biosafety committees (ibsc) are responsible for development, commitment for parameters and commercialization of recombinant vaccines. to centralize and coordinate the whole system, government has taken to form two agencies to regulate the regulation laws to develop recombinant pharmaceuticals products including vaccines. the first is the creation of the national biotechnology regulatory authority (nbra), under the department of biotechnology (dbt), as part of india's long-term biotech sector development strategy. the second major initiative will affect the entire indian pharmaceutical industry. this is the replacement of most state, district, and central drug regulatory agencies with a single, central, fda-style agency, the central drug authority (cda). the cda is expected to have separate, semi-autonomous departments for regulation, enforcement, legal, and consumer affairs; biotechnology products; pharmacovigilance and drugs safety; medical devices and diagnostics; imports; quality control; and traditional indian medicines. it will set up offices throughout india and will be paid for inspection, registration, and license fees. its enforcement powers will be strengthened by a new law increasing the criminal penalties for illegal clinical trials. in the manufacturing area, though, the country has been tightening the rules and enforcement. an amendment to the regulations, ''schedule m'' of the drug and cosmetics act, now specifies the good manufacturing practice (gmp) requirements for factory premises and materials. these requirements were modeled after us fda regulations, to improve regulatory coordination between indian and us regulators. india has realized the importance of regulations in pharmaceutical specially in vaccine field but it will take several years to implementation of these. india has coordinated some of its regulatory functions with western organizations. the us pharmacopoeia established an office in hyderabad in 2007. a representative of the indian pharmaceutical lobby also recently has expressed openness to an expansion of the fda's oversight of indian manufacturing. as india expands its global drug and biologicals production, us and europe, as the world's largest drug importers, will likely expand their regulatory support in the development of the country's regulatory systems. rapid diagnosis of japanese encephalitis virus (jev) infections is important for timely clinical management and epidemiological control in areas where multiple flaviviruses are endemic. however, the speed and accuracy of diagnosis must be balanced against test cost and availability, especially in developing countries. an antigen capture enzyme-linked immunosorbent assay (elisa) for detection of circulating jev specific nonstructural protein 1 (ns1) was developed by using monoclonal antibodies (mabs) specific to recombinant (ns1). the applicability of this jev ns1 antigen capture elisa for early clinical diagnosis was evaluated with 200 acute phase serum/ cerebrospinal fluid (csf) specimens collected from different epidemics during [2007] [2008] [2009] . jev ns1 antigen was detected in circulation from day 1 to 18. the sensitivity and specificity of jev ns1 detection in serum/csf specimens with reference to reverse transcriptase pcr was 82%, and 98.9% respectively. no crossreactions with any of the other closely related members of the genus flaviviruses (dengue, westnile, yellow fever and saint louis encephalitis (sle) viruses) were observed when tested with either clinical specimens or virus cultures. these findings suggested that the reported jev specific mab-based ns1 antigen capture elisa will be a rapid and reliable tool for early confirmatory diagnosis as well as surveillance of je infections in developing countries. manmohan parida the recent emergence of a novel human influenza a virus (h1n1) poses a serious global health threat. the h1n1 virus has caused a considerable number of deaths within a short duration since its emergence. a two-step single tube accelerated rapid real-time and quantitative swine flu virus specific h1 rtlamp assay is reported by targeting the h1 gene of the novel h1n1 hybrid virus. the feasibility of swine flu h1 rtlamp for clinical diagnosis was validated with a panel of 239 suspected throat wash samples comprising 116 confirmed positive and 123 confirmed negative cases of ongoing epidemic. the comparative evaluation of h1 specific rtlamp assay with real-time rt-pcr demonstrated exceptionally higher sensitivity by picking up all the 116 h1n1 positive and 36 additional positive cases amongst the negatives that were sequence confirmed as h1n1. none of the real-time rtpcr positive samples were missed by rtlamp system. the comparative study revealed that rtlamp was 100-fold more sensitive than rtpcr with a detection limit of 1 copy number. these findings suggested that rtlamp assay is a valuable tool for rapid, real-time detection as well as quantification of h1n1 virus in acute phase throat swab samples without requiring any sophisticated equipments. because of its recurrent nature. despite considerable progress in understanding of the virus at cellular and molecular levels, the proper management of the disease in its different stages is still a dilemma particularly whether to use antiviral or steroids or both. the risk of using steroids with its attendant complications has to be weighed against the risk of progression of the disease if avoiding the use of steroids. this dilemma can be reduced to a considerable extent if basic principles of virology and pathogenesis are kept in mind. this article reviews current concepts of virological and clinical aspects of hsv keratitis to enable a broad understanding of the disease process. it is recognized several influential host factors including the fact that hsk is more common in men than women. it is observed that the ability of hsv to establish latent infection in sensory neurons and possibly cornea, but have as yet been unable to use this knowledge to prevent the disease limitations. acknowledging limitations may further stimulate application of laboratory knowledge in coping with hsk which constitutes to present major challenge in terms of management. mvo-10 study on effect of human bhsp90 in immunity of hcv core protein and hbv hbsag there are more than 500 million individuals with hepatitis b and c in the world. in spite of vaccination in the different areas there are several reports about patients who got vaccine before. also there is not efficient vaccine against of hepatitis c and one of the important problems in vaccine project is development of effective and suitable adjuvant in human vaccines. at present research we applied human bhsp90 protein as adjuvant and chaperon. this protein injected to balbc mice as adjuvant together with recombinant proteins of hcv core and hbv hbsag. then humoral and cellular immune systems of the mice were studied. core and hbsag genes were cloned into petduet-1 vector and thermal vector of pgp1-2 was used for human heat shock protein 90 expressions. the different combination of these three proteins was injected to mice and we evaluated the total igg and igg2a of mice serums after a week. two weeks after booster injection, we studied the proliferation and cytokine secretion of spleen, inguinal and popliteal lymph nodes lymphocytes in vitro and ex vivo conditions. so the core/hbsag + hsp and core + hbsag + hsp complexes induced total igg and igg2a secretion. the spleen lymphocytes proliferation were increased equal to serum igg2a level that was constant in second time bleeding with significant different to complexes with freund's adjuvant. at first il-4 and il-5 cytokines were increased and then decrease of il-4 meaned no hypersensitivity. the chaperon effect of hsp90 on structure of core and hbsag proteins was studied by cd and flourometer. it could fold the proteins after heating and unfolding. hepatitis b virus (hbv) infection is vaccine preventable global public health problem. all commercially available vaccines contain one or more of the recombinant hepatitis b envelope protein or surface antigen (hbsag). measurement of antigen responsible for immunogenicity of vaccine is central to quality assessment. the problems associated with the use of a polyclonal antibody in an assay with regard to its poorly defined nature and batch-to-batch variation has been mitigated by the use of mabs as described in this paper. the initial capture of hbsag by the mab could orientate it such that the same antibody could bind to it as a detection antibody after labeling with out steric hindrance. the development of an immuno-capture elisa (ic-elisa) to measure the hbsag content using a monoclonal antibody (mab) specific to determinant ''a'' of hbsag in the experimental vaccine formulations is being discussed. murine mabs developed against hbsag, subtype adw2 were found to cross-react with the other subtypes viz. ad and ay too. the mabs have been characterized following which, one mab hbs06 was chosen for developing ic-elisa format for the quantification of the hbsag in the final algel adsorbed vaccines. the unadsorbed hbsag was used to establish the standard curve of hbsag/a. the elisa had a sensitivity of 10 ng/ml of hbsag. the recovery rate of hbsag/a was found to be around 70% in the vaccines treated to desorb the antigen from algel. twenty seven experimental batches of monovalent hepatitis b vaccines were analyzed for the hbsag content, both by ic-elisa and a commercial kit (axsym kit, abbott laboratories, usa). the statistical analysis of ic-elisa results indicated that an experimental equation f(x) = 0.0062(x) + 0.184, could precisely estimate the amount of hbsag in the adsorbed vaccines. the amounts of hbsag recovered from the adsorbed vaccines as estimated by the ic-elisa format had a good correlation with the estimates derived from a commercial kit, which is being used by several vaccine manufacturers in india for the quality control of vaccine antigen. the varying amounts of vaccine antigens that could be recovered seemed to depend on the quality of the hbsag and the methods of hbsag adsorption to the alum gel during vaccine manufacture. epidemiology of the spread of h1n1 virus. children of school going age have become victim of this deadly virus as evident from the reporting data generated in the past few weeks. the mortality rate has also been slightly increased. the disease spread in wave pattern and presently the world is passing through the second wave of pandemic with more severity in young and otherwise health people with a predilection for lungs leading to viral pneumonia and respiratory failure. now the pandemic gained hold in the developing world affecting more severely as millions of people live under deprived conditions having multiple health problems, with little access to basic health care. current data about the pandemic from developed counties need to be very closely watched in relation to shift in virus sub type, shift of the highest death rate to younger populations, successive pandemic waves, higher transmissibility than seasonal influenza, and demographic differences etc. presently the world appears to be better prepared. vaccine is available in market in many countries. even vaccine trials are actively going on in indian population. effective antivirals are available. although till now h1n1 diagnostic centers worked with cdc/who recommended h1n1 specific primer, probes with taqman chemistry by real time pcr, efforts on the development of indigenous diagnostics, vaccines and chemoprophylaxis is going on to have a better combat against this highly infectious virus. were positive for rotavirus infection by either page or elisa methods. the available data highlights the importance of rotavirus as a cause of diarrhea in children, which is severe enough to deserve specialized care. the observed proportion of 25.5% of all diarrhea cases being associated with rotavirus falls within the range of values reported by other workers. the reported positivity varies from 10.5 to 70.7%. in our study a complete concordance of elisa and page results were observed in 194 (97%) of the 200 tested specimens. this finding closely correlates with the findings of other authors who found a 96.7-97.14% concordance results between elisa and page methods. some authors found rna-page method that is as sensitive and rapid as elisa for detecting rotavirus in stool samples of cases of diarrhea and some others proposed elisa is more sensitive than page method fond to be 100% specific. the remaining 6 (3%) samples showed conflicting results. in a lone sample in which the od value of elisa test was 0.195, this value was almost at the cutoff level, the possibility of this sample being positive by elisa test is doubtful. negative result of the same sample in page method is difficult to explain, the possibility of presence of lot of empty virus particles or due to low concentration of viral rna in the fecal specimen and insufficient extraction of viral rna could be possible. on the other hand, 5 of the samples which gave positive results by page method were negative by elisa test. these 5 samples had a typical 4-2-3-2 rna pattern. the reason for their being elisa negative thus remains unexplained, however blocking factors or the presence of inhibitory substance in stools might have been responsible. the samples containing predominantly complete particles can also give false negative results. since, the group antigen is not exposed. earlier studies have also reported page to be the most sensitive technique although some are of view that it is laborious procedure. how ever, the page system used in this study was very simple to perform and the results were available on the same day. the main requirement was of trained personnel and proper standardization of the technique. most reports states that the greatest advantage of page and silver stain method are its lack of ambiguity and the fact that it provides information about viral electropherotypes. the modified page system was thus found to be reliable, simple and rapid, no expensive reagents were required. locally available reagents from hi media were used. the cost of the chemical for page per specimen was rs. 24 approximately as compared to rs. 110 per test by confirmatory elisa. a locally produced slab gel electrophoresis system with power pack was the only equipment required. this method could be used for the routine diagnosis of rotavirus infection in the laboratory. vaccine, rapid diagnosis plays an important role in early management of patients. in this study a qc-rt-pcr assay was developed to quantify chikungunya virus rna by targeting the conserved region of e1 gene. a competitor molecule containing an internal insertion was generated, that provided a stringent control of the quantification process. the introduction of 10-fold serially diluted competitor in each reaction was further used to determine sensitivity. the applicability of this assay for quantification of chikungunya virus rna was evaluated with human clinical samples and the results were compared with real-time quantitative rt-pcr. the sensitivity of this assay was estimated to be 100 rna copies per reaction with a dynamic detection range of 10 2 to 10 10 copies. specificity was confirmed using closely related alpha and flaviviruses. the comparison of qc-rt-pcr result with real-time rt-pcr revealed 100% concordance. these findings demonstrated that the reported assay is convenient, sensitive and accurate method and has the potential usefulness for clinical diagnosis due to simultaneous detection and quantification of chikungunya virus in acute-phase serum samples. in india, measles vaccine was introduced as part of expanded programme of immunization in 1985. measles, mumps and rubella (mmr) vaccine is still not part of the national immunization schedule of india. the indian association of paediatrics (iap) recommends measles vaccine at 9 months of age and mmr vaccine at 15-18 months. however, in a recent policy update, iap committee on immunisation opined that there is a need for a second dose of mmr vaccine for providing adequate immunity against mmr. the aim of the present study was to assess the extent of sero-protection against mmr at 4-6 years of age in children who have received one dose of mmr between 12 and 24 months of age. an attempt has also been made to assess the sero-response to the second dose of mmr vaccine in 4-6 years old children. a total of 106 consecutive children between the ages of 4-6 years who had received mmr vaccine between 12 and 24 months of age and attending the immunization clinic of gtb hospital, delhi were enrolled. the vaccination status, anthropometry and physical examination findings were recorded. three ml of venous sample was again withdrawn for estimation of post vaccination antibody titre. it was observed that 20.39%, 87.38% and 75.73% children were seroprotected for mmr respectively after 2.5-4.5 year of receiving first dose of mmr vaccine. seroprotection rose to 72.62%, 100% and 100% for mmr respectively after 4-6 weeks of receiving second dose of mmr vaccine. geometric mean concentration of antibody also rose significantly in all three diseases. in view of low seroprevalence of mmr and hence high susceptibility to infection at 4-6 years of age, who have already received mmr vaccine, there is need to boost the immune responses against these three diseases by giving a second dose of mmr vaccine. baseline information on the epidemiology of viral agents causing stis and types of risk behaviour of affected persons are essential for any meaningful targeted intervention. the present study documents the pattern of viral stis in patients attending a tertiary care hospital, correlating the syndromic approach and the laboratory investigations to determine the aetiology. three hundred consecutive patients attending the sti clinic were diagnosed and categorized according to the syndromic approach of the who along with detailed history and demographic data. majority of the patients were men (53.12%) with a mean age of 24 years. men received education up to middle school. half of the female subjects were illiterate. sixty percent of the patients were married and among these, 19% were regular condom users. first sexual contact at or before 18 years of age was more in men (31% vs. 22 .22% in women). promiscuity was more among male patients who had contact with csw. genital herpes was the commonest viral sti (86/300) followed by genital wart (60/300). concomitant infection with more than one virus was seen in 35% of patients. hiv was prevalent in 10.3% of sti patients. hepatitis b, hepatitis c, herpes simplex type 1 and molluscum contagiosum were the other viral agents seen in sti clinic attendees at our centre. this disease currently prevalent in more than 100 countries world wide and annually 50-100 million people are infected with dengue virus among which 2.5-5 lakhs cases were dengue hemorrhagic fever (dhf) and dengue shock syndrome (dss) which are serious forms of dengue virus infection and due to this condition 25,000 deaths might occur annually world wide and approximately 3 million children were hospitalized for the fast 3 decades. this disease is characterized by sudden onset of high fever with sever headache, pain in the back and limbs, lymphadenopathy macuolo-papulur rash over the skin and retro-bulbar pain. early diagnosis can be established with simple and rapid lgg/1gm antibodies detection in the blood samples of the patients based on the bi-directional immunoassay system for its management and control to reduce morbidity and mortality. details will be presented. myocarditis and dilated cardiomyopathy (dcm) are common causes of morbidity and mortality both in children and adults. the most common viruses involved in myocarditis are coxsackievirus b or adenovirus. recently, the coxsackievirus and adenovirus receptor (car), a common receptor for coxsackieviruses b3, b4 and adenoviruses 2, 5 has been identified. increased expression of car has been reported in patients with dcm suggesting utilization of car by these viruses for cell entry. the present study was designed to study the expression of car in myocardial tissue of patients with dcm. formalin fixed myocardial tissues were obtained from autopsy cases. a total of 26 cases of dcm and 20 cases of controls which included non-cardiac (group-a) and cardiac disease other than dcm (group-b) were included in the study. expression of car was studied by immunohistochemical staining of myocardial tissue using car specific rabbit polyclonal antibody and biotin conjugated secondary antibody. the tissue sections were considered positive when[25% of the cell showed brown color staining by immunohistochemistry (ihc). the car positivity in dcm cases was found to be 96% (25/ 26) as compared to 30% in control group a and 40% in control group b respectively. the car positivity was significantly higher in the test group as compared to both the control groups. further car positivity in all the cellular types (myocytes, endothelial cells and interstitial cells) was found significantly higher in test group as compared to both the control groups. the expression of car was significantly higher in myocytes as compared to both endothelial and interstitial cells in all the groups. however, no significant difference was observed in car positivity between endothelial and interstitial cells. the present study highlights the increased expression of car in dcm cases with further significance of car expression in myocytes and endothelial cells. this may help further in understanding the tropism of viruses or cellular susceptibility, which in turn will help in appropriate diagnostic and therapeutic approach in management of viral myocarditis and dcm cases. food security and safety vary widely around the world, and reaching these goals is one of the major challenges, raising public concern for the wellbeing of mankind, in particular. industrialized production and processing as well as improper environmental protection have clearly shown severe limitations such as worldwide contamination of the food chain and water. contaminated water and food during the processes of production, processing and handling are essentially responsible for food and water borne viral infections/diseases. the cases of viral food borne outbreaks are on the rise, creating a threat to human health. recent researches indicate that epidemiological studies are meager to focus the frequently contaminated foods and food borne viral diseases. current paper projects the etiology of select food borne viral diseases, probable reasons for non availability of appropriate methods to detect the viruses responsible for the diseases, routes of water and food borne transmission of enteric viral infections, currently available methods of detection of select viruses and bio safety measures to prevent food borne viral infections. dietary/nutritional management in food borne viral diseases is crucial to control weakness and gastro enteric intolerance due to disease condition and antibiotic therapy. it will principally improve food intake, resulting in better nutritional status leading to optimum immune response. food borne viruses are mainly belong to rotaviruses, enteropathogenic viruses, astroviruses, adenoviruses and caliciviruses, causes acute gastroenteritis (ag) which is an important health problem. the frequency of rotavirus as a cause of sporadic cases of ag ranges between 17.3% and 37.4%. astroviruses cause ag, with a frequency ranging between 2 and 26%: outbreaks have been described in schools and kindergartens, but also in adults and the elderly. the frequency of identification of adenoviruses 40 and 41 as causes of sporadic ag in non-immuno suppressed children ranges between 0.7% and 31.5%, although there is probably underreporting because the sensitivity of conventional techniques is low. caliciviruses are separated phylogenetically into two genera: norovirus and sapovirus. norovirus is frequently associated with food-and water-borne outbreaks of ag. it is estimated that 40% of cases of ag due to norovirus are food borne. in sweden and some regions of the united states, norovirus is the first cause of outbreaks of food borne diseases. sapovirus outbreaks due to person-to-person and food borne transmission affecting both children and adults have recently been reported in countries such as canada and japan. it has been predicted that the importance of diarrhoeal disease, mainly due to contaminated food and water, as a cause of death will decline worldwide. evidence for such a downward trend is limited. this prediction presumes that improvements in the production and retail of microbiologically safe food will be sustained in the developed world and, moreover, will be rolled out to those countries of the developing world increasingly producing food for a global market. sustaining food safety standards will depend on constant vigilance maintained by monitoring and surveillance but, with the rising importance of other food-related issues, such as food security, obesity and climate change, competition for resources in the future to enable this may be fierce. in addition the pathogen populations relevant to food safety are not static. food is an excellent vehicle by which many pathogens (bacteria, viruses/prions and parasites) can reach an appropriate colonization site in a new host. although food production practices change, the well-recognized food-borne pathogens, such as salmonella spp. and escherichia coli, seem able to evolve to exploit novel opportunities, for example fresh produce and even generate new public health challenges, for example antimicrobial resistance. in addition, previously unknown food-borne pathogens, many of which are zoonotic, are constantly emerging. awareness and surveillance of viral food-borne pathogens is generally poor but emphasis is placed on norovirus, hepatitis a, rotaviruses and newly emerging viruses such as sars. it is clear that one overall challenge is the generation and maintenance of constructive dialogue and collaboration between public health, veterinary and food safety experts, bringing together multidisciplinary skills and multi-pathogen expertise. such collaboration is essential to monitor changing trends in the well-recognized diseases and detect emerging pathogens. it is also necessary to understand the multiple interactions between these pathogens and their environments during transmission along the food chain in order to develop effective prevention and control strategies. to analyse the effectiveness of these sirnas targeting rabies virus l gene, the bhk-21 cells expressing sirnas in shrna form were produced by transduction of cells with radv-l. the transduced bhk-21 cells expressing sirna were infected with rabies virus pv-11 strain. there was reduction in rabies virus multiplication as analysed by reduction in fluorescent foci forming unit (ffu) count by 51.85% (70 ffu in bhk-21 cells expressing sirna-l compared to 135 ffu in bhk-21 cells expressing negative sirna). the expression of l gene mrna was reduced by 16.11fold in rabies virus infected radv-l transduced cells compared to radv-neg transduced cells (negative control) as detected using real-time pcr. after analyzing the effectiveness of radv-l in vitro, its effectiveness was also evaluated in vivo in mice after virulent rabies challenge. the mice were inoculated with 10 7 plaque forming units (pfu) of radv-l in masseter muscle (i/m route) and challenged with 15 ld 50 rabies virus challenge virus standard (cvs) strain. the results indicated 50% protection with improved median survival from 7 to 11 days compared with group of mice treated with radv-neg. the results of this study indicated that sirnas targeting rabies virus polymerase (l) gene delivered through adenoviral vector inhibited rabies virus multiplication in vitro and in vivo. and 4 were successfully produced and purified from the infected spodoptera frugiperda (sf-9) cells using these recombinant baculovirus. the morphology of the vlps was validated by electron microscopy in comparison to the authentic bt virions. the vlps produced here were stable and were highly immunogenic with intact outer layer which is rapidly lost during normal infection of btv. these btv-vlps elicited long lasting protective immunity in vaccinated sheep against virulent virus challenge. with the use of btv-vlps it was also possible to differentiate the infected and vaccinated animals (diva). vlp-based btv vaccine has potential advantages with regard to controlling the spread of btv with multiple serotypes. it is possible to produce milligram quantities of correctly folded and processed protein complexes using this baculovirus expression system and hence it is a more promising system for producing new generation vaccines like vlp subunit vaccine against any viral diseases in large scale. peste des petits ruminants (ppr), goatpox and orf are oie notifiable diseases of small ruminants especially goat and sheep. these diseases are economically important, in enzootic countries like india and cause significant loss and are major constraints in the productivity. considering the geographical distribution of ppr, goat pox and orf infections and prevalence of mixed infection, in the present study, safety and potency of the experimental triple vaccine comprising attenuated strains of thermostable-ppr virus (pprv jhansi, p-50) grown at 40°c, high passaged goat poxvirus (gtpv uttarkashi, p100) and attenuated orf virus (orfv mukteswar, p51) was evaluated in sub-himalayan local hill goats. goats simultaneously immunized with 1 ml of vaccine consisting of either 10 3 tcid 50 or 10 5 tcid 50 of each of pprv, gtpv and orfv were monitored for clinical and serological responses for a period of 3-4 weeks post-immunization (pi) and post challenge (pc). specific immune responses i.e., antibodies directed to pprv, gtpv and orfv could be demonstrated by ppr competitive elisa kit and capripox indirect elisa, snt, respectively following immunization. all the immunized animals resisted infections when challenged with virulent strains of either gtpv or pprv or orfv on day 28 dpi, while in contact control animals developed characteristic signs of respective disease. further, ppr viral antigen could be detected by using ppr sandwich elisa kit in the excretions (nasal, ocular and oral swab materials) of unvaccinated control animals after challenge but not from any of the immunized goats. triple vaccine was found safe at dose as higher as 10 5 tcid 50 and induced protective immune response even at lower dose (10 2 tcid 50 ) in goats, which was evident from sero-conversion as well as challenge studies. the study indicated that these viruses are compatible and did not interfere with each other in eliciting immune response, paving the feasibility of use of this triple vaccine in combating these infections simultaneously. toll like receptors (tlrs), primary sensors of microbial origin, plays a crucial role in the innate immunity. till now 13 mammalian tlrs have been identified, while there is no information available on tlrs of yak. this study is part of world bank funded-icar project. yak, named bos grunniens for its distinctive vocalization and relationship with cattle, is natural habitant of extremely cold environment. when these animals comes to a lower altitude grazing land, adjacent to villages, become susceptible to the diseases of cattle, buffalo etc. thus, present study was undertaken to with genetic characterization and evolutionary lineage analysis of yak tlrs. we worked on tlr7 gene, which plays an important role in recognition of ssrna viruses. total rna was extracted from mitogen stimulated pbmcs of yak. the rt-pcr conditions were standardized for full length amplification of tlr gene 7 using specific self designed primers. the expected amplicon of 3559bps was obtained. it was cloned in pgemt-easy vector followed by transformation in e. coli top10 strain. the recombinant clones were screened, picked up for plasmid isolation and release of tlr7 was confirmed by restriction digestion. the cloned tlr7 product was sequenced and analyzed for the nucleotide and deduced amino acid sequences, and 3d structure analysis. the results revealed that yak shows more than 98% sequence homology with other bos indicus breeds and bos taurus breeds. however, identity was less than 88% with other animal species (equine, murine, feline, canine etc.). the evolutionary lineage findings cluster yak more closely with bovine species. point mutations revealed changes at 25 nucleotide positions with corresponding amino acid change at 15 positions. smart analysis of yak protein domain architecture revealed toll-interleukin i receptor (tir), leucine rich repeats (lrr) and signal peptide region. the variations in yak mainly lie in the lrr region. homology modeling revealed horse shoe shaped structure with 5 alpha helix. the additional alpha helix present in bos indicus was not detected in yak. the present study shows existence of genetic variability in tlr7 gene of yak, in particular the lrr region, which plays an important role in the pathogen recognition and the evolutionary lineage analyses shows its closeness with other bovine species. a.p. aquaculture and fisheries, tirupati in this new millennium, aquatic animal health management strategies in asia expanded and adjusted to the current disease problems faced by the aquaculture sector. this presentation will briefly discuss some of the most serious trans-boundary pathogens affecting asian aquaculture including a newly emerging disease and highlight recent regional and national efforts on responsible health management for mitigating the risks associated with aquatic animal movement. a regional approach is fundamental since many countries share common social, economic, industrial, environmental, biological and geographical characteristics. capacity and awareness building on aquatic animal epidemiology, science-based risk analysis for aquatic animal transfers, surveillance and disease reporting, disease zoning and establishment of aquatic animal health information systems to support development of national disease control programs and emergency response to disease outbreaks are needed. molecular diagnostics with emphasis towards standardization and harmonization, inter-calibration exercises and quality assurance in laboratories, accreditation program and utilization of regional resource centres on aquatic animal health will also be needed. whilst most of these strategies are directed in support of government policies, implementation will require pro-active involvement, effective cooperation and strategic networking between governments, farmers, researchers, scientists, development and aid agencies, and relevant private sector stakeholders at all levels. their contributions are essential to the health management process. generally, aquaculture plays an important role in economy as harvests from natural waters have declined or, at best, remained static in most countries. fish and shrimp, the main aquaculture product sources, have gained the most attention. many factors can cause losses in yields of fish products and infectious disease in fish and shrimp is the biggest threat to the fishery industry. shrimp and fish aquaculture has grown rapidly over several decades to become a major global industry that serves the increasing consumer demand for seafood and has contributed significantly to socio-economic development in many poor coastal communities. however, the ecological disturbances and changes in patterns of trade associated with the development of shrimp and fish farming have presented many of the pre-conditions for the emergence and spread of disease. shrimp and fish are displaced from their natural environments, provided artificial or alternative feeds, stocked in high density, exposed to stress through changes in water quality and are transported nationally and internationally, either live or as frozen product. these practices have provided opportunities for increased pathogenicity of existing infections, exposure to new pathogens, and the rapid transmission and trans boundary spread of disease. not surprisingly, a succession of new viral diseases has devastated the production and livelihoods of farmers and their sustaining communities. this review examines the major viral pathogens of farmed shrimp and fish, the likely reasons for their emergence and spread, and the consequences for the structure and operation of the shrimp farming industry. in addition, this review discusses the health management strategies that have been introduced to combat the major pathogens and the reasons that disease continues to have an impact, particularly on poor, smallholder farmers in asia. btv isolates from the same geographic region have been termed as 'topotypes' and initial observation on segment 3 nucleotide sequences identified a correlation between topotypes and genetic information. later topotyping was proposed based on segment 10, on the premise that the encoding protein ns3, which is involved in virus egress from insect cells, would lead to evolutionary fitness in parallel with the geographic distribution of the different culicoides species. further studies attempted to extend this to nucleotide sequence homology in segments 7 and 10, but failed to identify clear cut correlations or any evidence for positive selection. for example, south african isolates were found not to cluster into separate african lineage. in this study, we carried out a more extensive analysis of segment 10 sequences. our analysis showed no segregation of isolates into topographically distinct groups. instead we observed topological clustering of the clades, and we attribute this to genetic bottleneck resulting in genetic drift and founder effect leading to homogenous gene pool in a geographical area. we hypothesize that when a new virus enters a geographical area where local btv strains are already circulating, the new genes/segments would enter into a bigger gene pool. consequently, the newer incursions into a heavily endemic area tend to get diluted and disappear from the population because the rate of drift is inversely proportional to the population size, unless they are positively selected. use of live attenuated vaccine in israel, europe, south africa and usa also led to more homogenous population similar to the vaccine strains due to continuous infusion of the vaccine type genes into the gene pool. we conclude that restriction of specific strains to certain geographical areas could generate uniquely imprinted genotypes which would not only indicate origin but also predict movement of viral strains to new areas. vvo-10 viral diseases of zoonotic importance: indian context k. prabhudas pd-admas, ivri, campus, hebbal, bangalore 24 zoonoses are generally defined as animal diseases that are transmissible to humans. they continue to represent an important health hazard in most parts of the world, where they cause considerable expenditure and losses for the health and agricultural sectors. the emergence of these zoonotic diseases are very distinct, hence their prevention and control will require unique strategies, apart from traditional approaches. such strategies require rebuilding a cadre of trained professionals of several medical and biologic sciences. the article discusses virus infections that have significant zoonotic implications for india. buffalopox is a contagious viral disease affecting milch buffaloes and rarely, cows, with a morbidity rate up to 80% in the affected herd. although the disease is not responsible for high mortality, it adversely affects the productivity of the animals, resulting in large economic losses. furthermore, the disease has zoonotic implications, as outbreaks are frequently associated with human infections, particularly in the milkers. the causative agent, buffalopox virus (bpxv), is closely related to vaccinia virus. the outbreaks of febrile rash illness among humans and buffaloes were investigated in the villages of districts solapur and kolhapur of western maharashtra. clinico-epidemiological investigations of humans and buffaloes were carried out and representative clinical samples were collected respectively. the samples include vesicular fluid, scab, and blood. laboratory investigations for buffalo-pox virus (bpxv) was done by pcr on blood samples, scabs and vesicular fluid. in vitro virus isolation attempts were carried out by using vero e-6 cells. negative staining electron microscopy was also employed for detection of virus particles. a total of 166 human cases with pox lesions on hand and other body parts from village kasegaon, district-solapur and 185 cases from 20 different villages of kolhapur district were reported. besides pox lesions patients were having fever, malaise, pain at site of lesion and axillary and inguinal lymphadenopathy. in kasegaon village, attack rate in human cases was 6.6% and in buffaloes 41.9% (231/551). whereas in kolhapur area attack rate in buffaloes was 11.75% (2633/22398). bpxv was confirmed in blood, vesicular fluid and scab specimens from human cases and scab specimen from buffalo by polymerase chain reaction (pcr) method. the bpxv was also isolated from 3 different clinical specimens and further identified by pcr and electron microscopy. clinical manifestation of the disease in buffaloes from solapur district was as reported earlier like common pox lesions on teats and udders whereas the buffaloes from kolhapur district had lesions on hairless parts of ears and on the eyelids with purulent discharge. bpxv from human and buffalo cases showed similarity. vaccines have been made against several diseases and used for controlling the afflictions. however a few of them were not effective for successfully controlling the disease. the reasons for the failure are many, the major being, either the pathogen is not completely cleared from the vaccinated animal or it reemerges after changing its antigenic structure, thus making the vaccination programme less effective. in addition to this, emergences of newer diseases such as hiv the development of suitable vaccines have become a challenging task. this is especially true in the case of viral diseases. these challenges have warned the researchers ''that protection by vaccination is not that simple and strait forward approach'', and lot need to be understood in terms of host virus interaction and role of environment in perpetuating the disease. so the immediate step that was considered was the environmental safety by way using non infectious materials as vaccines. with the understanding that has been developed in molecular immunology and molecular biology and with the availability of molecular tools that have been developed through recombinant dna technology the field of vaccinology has changed dramatically to emerge as modern vaccinology. this presentation deals with the modern approaches that are being used to produce effective vaccines in the case of foot and mouth disease of cloven footed animals. the similar approach may be worked out for other viral diseases also. despite the availability of an inactivated vaccine that is noted to provide solid immunity against the disease over a short period of time, the search for an ideal vaccine, the criteria for which are; safety of the vaccine for environment, easy in its preparation, does not require a cold chain for its storage, provides longer lasting immunity, economically viable and may be able to clear the virus in case of persistent infection is on. the advent of recombinant dna technology together with the information available on the molecular biology of viruses has enabled to design the development of newer vaccines that can induce strong cellular and humoral responses. the underlying principal in the present vaccine development strategy world over is the virus antigen gene has to be expressed in the tissue and the vaccine backbone has to trigger the immune system for eliciting desired immune response. bangalore campus of ivri has been vigorously pursuing research to develop ideal vaccines for foot and mouth disease keeping above principal in mind to achieve the previously mentioned criteria. the approaches selected are to see that the virus antigen/s replicate transiently in the host. the self replicating vaccines that have been developed are pox virus vectored vaccines, alpha virus replicase based vaccines and fmdv vectored vaccines. the approach and the result obtained so far will be discussed. silkworm, bombyx mori is affected with various diseases caused by viruses viz., nuclearpolyhedrosis (bmnpv), densosnucleosis (bmdnv) and infectious flacherie (bmifv). silkworm viral diseases form major constraints for the silk cocoon production in all the sericultural countries. the losses due to silkworm diseases is estimated about 20-40% and among them viral diseases are most common. in sericulture, prophylactic measures play a vital role in the management of silkworm diseases. these include disinfection of silkworm rearing house and appliances, rearing area, rearing surroundings, silkworm egg and body, and rearing bed disinfection associated with maintenance of general hygiene and personnel hygiene. all these activities are generally carried out as rituals by using general disinfectants often with partial success. recent trends in complete management of silkworm diseases include development of silkworm hybrids evolved from disease resistant/tolerant breeds, effective eco-and user-friendly disinfectants, anti-microbial feed-supplements and use of transgenic silkworms. biotechnological breakthrough in this regard is through rna interference (rnai) approach involving dsrna mediated nuclear polyhedrosis management and this is presently pursued by apssrdi, hindupur in collaboration with centre for dna fingerprinting and diagnostics (cdfd), hyderabad. nadu and karnataka. the disease appears to be more severe in rural flocks than organized farms. our investigations revealed the morbidity, mortality and case fatality rates among rural and organised farms as 9.34%, 2.69%, 28.84% and 6.22%, 0.47%, 7.63% respectively. higher morbidity and mortality in rural areas may be due to stress factors like poor nutrition, parasitic burden, fatigue due to long walks and non availability of veterinary aid. kulkarni et al. 1992 also reported the severe bt outbreaks in rural areas of maharashtra with overall morbidity, mortality and case fatality of 32%, 8% and 25% respectively. all the south indian sheep breeds were found to be susceptible and clinical farm of the disease is evident in all of them though saravanabava (1992) reported variations in susceptibility among the indigenous sheep. trichy black and ramnad white sheep were found to be more susceptible than the vambur and mecheri sheep of tamil nadu. prevalence of bluetongue in sheep, goat and cattle appears to be high in the region. serological surveys conducted in andhra pradesh during 1991 revealed the prevalence of btv antibodies in sheep (47.5%) goats (43.56%) cattle (33%) and buffaloe (20%). similar high prevalence of btv antibodies in sheep and goats were also reported from the other states in the region. clinical disease has not been recorded in kerala though btv antibodies were recorded in sheep (13.76%) and goats (7.10%) (ravi sankar 2003) . culicoides are the known biological vectors of btv. all the culicoides species are not capable of transmitting the btv. the occurrence of the disease is related to the presence of the competent vectors in the area. jain et al. (1988) established the involvement of the culicoides in transmitting the btv by isolating the virus from culicoides at haryana, the north indian state. c. imicola and c. oxystoma were found to be prevalent in andhra pradesh and tamil nadu. narladakar et al.(1993) reported the presence of c. schultzei, c. perigrinus and c. octoni in marathwada region of maharastra. culicoid vectors are significantly affected by the climate and annual variations in the climate reflects the outcome of the disease. the monsoon season (june to dec) with the temperature ranging from 21.2 to 35.6°c appears to be favourable period for the multiplication of culicoides. the maximum no of outbreaks were recorded during the north east monsoon period (oct-dec) followed by south west monsoon period (june to sep) in the region. however, details on the distribution of the competent vectors, feeding habits and their dynamics in the region is lacking multiple btv serotypes were found to be circulating in the region. (kulkarni and kulkarni 1984; janakiraman etal. 1991; mehrotra et al. 1996) a total of 10 serotypes viz. 1-4, 8, 9, 15, 16, 18 and 23 were identified based on the virus isolations. sreenivasulu et al. 1999 isolated btv serotype 2 from an outbreak of bt in native sheep of andhra pradesh. btv serotype 9, 15 and 21 were also isolated from the outbreaks occurred in andhra pradesh. some of the isolates need to be serotyped. deshmukh and gujar (1999) isolated btv type 1 from maharashtra. following is the summary of the distribution of btv serotypes in this region. clinical picture of bt in native sheep appears to be slightly different, the major difference being that swelling of lips and face was less conspicuous. mucocutaneous borders appeared to be very sensitive to touch and bleed easily upon handling. the classical signs of cyanosis of tongue and reddening of coronary band are not the common features of the disease in native sheep. the disease was also confirmed by the virus isolation and identification. clinical disease has not been reported in cattle, buffaloes and goats in spite of high seroprevalence. in conclusion bt is established in native sheep and causes severe economic losses to the farmers. the disease is concentrated in the southern peninsula of the country. the disease is seasonal and is associated with the rain fall. multiple serotypes appear to be circulating in this region. the btv serotypes were of virulent in nature as evident by severe outbreaks. s. janardana reddy*, d. c. reddy department of fishery science and aquaculture, sri venkateswara university, tirupati 517 502 in less than three decades, the penaeid shrimp culture industries of the world developed from their experimental beginnings into major industries providing hundreds of thousands of jobs, billions of u.s. dollars in revenue, and augmentation of the world's food supply with a high value crop. concomitant with the growth of the shrimp culture industry has been the recognition of the ever increasing importance of disease, especially those caused by infectious agents. in india viral diseases have become an important limiting factor for growth of shrimp aquaculture industry. although more than 30 different viral pathogens have been identified in different species of shrimp world wide, only a few viruses have identified which are causing disease problems in cultured tiger shrimps in india, east coast of andhra pradesh, in particular. diagnostic methods for these pathogens include the traditional methods of morphological pathology (direct light microscopy, histopathology, and transmission electron microscopy), enhancement and bioassay methods, traditional microbiology, and the application of serological methods. while tissue culture is considered to be a standard tool in medical and veterinary diagnostic labs, it has never been developed as a useable, routine diagnostic tool for shrimp pathogens. the need for rapid, sensitive diagnostic methods led to the application of modern biotechnology to penaeid shrimp disease. the industry now has modern diagnostic genomic probes with nonradioactive labels for viral pathogens like infectious hypodermal and hematopoietic necrosis (ihhnv), hepatopancreatic virus (hpv), taura syndrome virus (tsv), white spot syndrome virus (wssv), monodon baculo virus (mbv), and bp. highly sensitive detection methods for some pathogens that employ dna amplification methods based on the polymerase chain reaction (pcr) now exist, and more pcr methods are being developed for additional agents. these advanced molecular methods promise to provide badly needed diagnostic and research tools to an industry reeling from catastrophic epizootics and which must become poised to go on with the next phase of its development as an industry that must be better able to understand and manage disease. within this field, shrimp immunology is a key element in establishing strategies for the control of diseases in shrimp aquaculture. research needs to be directed towards the development of assays to evaluate and monitor the immune state of shrimp. the establishment of regular immune checkups will permit the detection of shrimp immunodeficiencies but also to help monitor and improve environment quality. for this, immune effectors must be first identified and characterised. in the end, however, the assumption may be made that the sustainability of aquaculture will depend on the selection of disease-resistant shrimp, i.e. to develop research in immunology and genetics at the same time. the development of strategies for prophylaxis and control of shrimp diseases could be aided by the establishment of a collaborative network to contribute to progress in basic knowledge of penaeid immunity. however, to improve efficiency, it appears essential also to open this network to complementary research areas related to shrimp pathology, physiology, genetics and environment. bluetongue is an important viral disease of sheep causing severe economic losses to the farmers. lack of effective vaccine is the major impediments in controlling the disease. multiple serotypes were found to be circulating in the state. attempts are being made to develop the vaccine employing the available serotypes to control the disease. hence, it is essential to identify the antigenic relationship among the serotypes to identify the candidate vaccine strains to be incorporated in the preparation of vaccine. reciprocal cross neutralization test was employed to find out the r% values between btv-2, -9 and -15 which indicated the extent of antigenic relationship between the serotypes. r% value between btv-2 and btv-9 was recorded as 2.8 r% value of 3.53 and 2.8 were observed between btv-2 and -15 and btv-9 and -15 respectively. the r% values recorded in the present study revealed a weak antigenic relationship between the btv serotypes. the extent of antigenic relationship between the btv serotypes was also determined by multiple sequence alignment of the nucleotide and amino acid sequences of the reference btv serotypes 2, 9 and 15. the sequence analysis of the vp2 gene revealed a homology of 47-53% and 29-41% at the nucleotide and amino acid levels respectively. r% values obtained using reciprocal cross neutralization test with the btv-2, 9 and 15 serotypes isolated in native sheep of andhra pradesh and the genomic analysis of the reference serotypes of btv-2, 9 and 15 revealed very weak antigenic relationship and were highly divergent. diseases especially those by viral pathogens cause greater economic losses in most horticultural crop species throughout the world as compared to agricultural crops. non-genetic methods of management of these diseases include quarantine measures, eradication of infected plants and weed hosts, crop rotation, use of certified virus-free seed or planting stock and use of pesticides to control insect vector populations implicated in transmission of viruses. however, none of these measures is likely to provide an enduring solution against these diseases especially those caused by viruses due sometimes to the huge expenditure involved, but mostly to the questionable effectiveness and reliability of those methods. as key control pesticides are getting increasingly abandoned, development of alternative methods to control diseases has been a felt-need in the recent past. though breeding for disease resistance generally provides a reliable security in a long run, introgression of host plant resistance did not materialise in most important crops. non-availability of an appropriate source of resistance in inter-fertile relatives, linkage to undesirable traits, or often times polygenic nature of such sources of resistance are the stumbling blocks in breeding programs. the limitations of conventional breeding and routine cultural practices prompted the need for the development of other approaches of virus control that could be fully incorporated into traditional methods. in this perspective, the concept of pathogen-derived resistance offers an attractive strategy to evolve newer methods of virus management, by transforming crop plants with nucleotide sequences derived from the pathogen's genome. an increasing number of molecular characterisation of plant virus genomes and the stable transformation of a number of horticultural crop species have in fact opened an avenue for molecular breeding against virus pathogens. successful field-testing of genetically modified crop cultivars renders proof of their supremacy over existing cultivars. it also contributes to demonstrate their capability with regard to environmental safety with a view to winning over public concern and scepticism. in general, the eventual commercialisation transgenic lines expressing virus resistance will rely upon a host of factors including their field performance, genetic stability, public acceptance and the resolution of environmental concerns and patent related issues. as such, elaborate field trials and allied studies are now required to adapt genetically engineered horticultural crops expressing virus resistance for their implementation into practical agriculture. a few examples from current research at tnau, in india or elsewhere will be discussed in this presentation. virology unit, division of plant pathology, iari, new delhi 12 in recent times there has been greater emphasis on vegetatively propagated crops in india to help diversify the indian agriculture. fruit, flower, spice and plantation crops are important vegetatively propagated horticultural crops, which have become a driving force for economic development in several parts of india. however, most of the vegetatively propagated crops are threatened by biotic stress caused by plant pathogens in general and plant viruses in particular. plant viruses produce specific and non specific symptoms and in some cases no symptoms are produced. correct identification and diagnosis of viral diseases is first step in the management of any disease including viral diseases. there have been two major breakthroughs in virus diagnostics during last four decades. the first one was serological assay using monoclonal or polyclonal antibodies in enzyme linked immunosorbent assay (elisa) and the other one was the use of in vitro amplification of dna in polymerase chain reaction (pcr). a significant development in serological assays has been its simplification in form of user's friendly quick strip/dip stick method. the one-step lateral-flow (lf) tests have been developed for the on-site detection and identification of several plant viruses. rapid advancement in virus genome characterization has led to the development of novel approaches of nucleic acid based diagnostics which include conventional pcr, real time pcr, multiplex pcr, micro/macro arrays and biochips. pcr protocols already exist for many plant viruses of citrus, banana, apple, papaya, vegetables, ornamental and spice crops. a further advancement has led to development of realtime pcr assay which is relatively easy but requires training for diagnosticians. in real-time pcr assays, results can be available within 20 min. the nucleic acid template preparation in pcr has been simplified. membrane based dna template protocol and co-isolation of nucleic acid template preparation are novel approaches in pcr detection of virus and virus like pathogens. since many of the horticultural crops are often infected by more than one virus, their individual detection by pcr is not only expensive but also time consuming. therefore, multiplex pcr has been developed where in genome of more than one virus could be amplified and detected in the same reaction mixture. development of nucleic acid based chip is now one of the fastest and recent growing areas in the field of pathogen detection. these nucleic acid based chips have been named as dna/rna chips, biochips, genechips, biosensors or dna arrays. when it comes to applications of microarray technology for plant viruses, it is not too difficult to see the value of a method that could potentially detect a whole range of viruses using a single test. however, microarrays are unlikely to become the only method in use in a diagnostic laboratory. processing of germplasm including transgenic planting material imported for research purposes into the country. during the last two decades, a total of 49,923 samples of wheat including transgenics were imported from cimmyt (mexico), icarda (syria) and many other countries. these were grown in post-entry quarantine nursery each year at nbpgr, new delhi and the transgenic samples were grown in national containment facility of level-4 (cl-4) since its inception to ensure that no viable biological material/pollen/pathogen enters or leaves the facility during quarantine processing of transgenics. in addition, post-entry quarantine inspections of the transgenic wheat grown by indenters are also undertaken by nbpgr quarantine scientists. virus-induced gene silencing (vigs) is a technique in which viral genomes are used, usually after appropriate modifications, for transient gene silencing in plants. the mechanism behind vigs is the phenomenon called rna-interference (rnai), which is widespread in many organisms and is believed to be form of inherent defence system against intracellular pathogens, such as viruses and transposons. double-stranded rna or rna containing strong secondary structures, commonly produced during viral infections, are believed to cause triggering of rnai, which employs a battery of proteins and nucleoprotein complexes to identify and degrade specific viral transcripts. in vigs, viral genomes not causing severe symptoms, but which can accumulate and spread efficiently in the host plant are used as vectors in which a host gene is cloned and introduced into the plant. upon replication, the viral vector triggers rnai response in the host plant, which also targets the host gene, leading to its silencing and subsequently, the silenced phenotype revealing gene function in vivo. vigs has been used extensively to study gene functions in dicot plants, such as tobacco, tomato, pea, soybean, etc., using vectors derived from reference genes are commonly used as an/the endogenous normalisation measure for the relative quantification of target genes. the expression (characteristics) of seven potential reference genes was evaluated in tissues of 180 healthy, physiologically stressed and barley yellow dwarf virus (bydv) infected cereal plants. these genes were tested by rt-qpcr and ranked according to the stability of their expression (characteristics) using three different methods (two-way anova, genorm and normfinder tools). in most cases, the expression (characteristics) of all genes did not depend on the abiotic stress conditions or on the virus infections. all the genes showed significant differences in expression (characteristics) among plant species. glyceraldehyde-3-phosphate dehydrogenase (gapdh), beta-tubulin (tubb) and 18s ribosomal rna (18s rrna) always ranked as the three most stable genes. on the other hand, elongation factor-1 alpha (ef1a), eukaryotic initiation factor 4a (eif4a), and 28s ribosomal rna (28s rrna) for barley and oat samples; and beta-tubulin (tubb) for wheat samples were consistently ranked as the less reliable controls. the bydv titre was determined in two oat varieties by rt-qpcr by three different quantification approaches. statistically, there were no significant differences between the absolute and the relative quantification, or between quantification using gapdh + tubb + tuba +18s rrna and ef1a + eif4a + 28s rrna. the geometric average of gapdh, 18s rrna, tuba and tubb is suitable for normalisation of bydv quantification in barley and oat tissues. for wheat samples, a combination of gapdh, 18s rrna, tubb, eif4a and e1fa is recommended. department of microbiology, yogi vemana university, vemanapuram, kadapa 516 003 large scale production and import of propagative material poses potential risk of introducing several destructive pathogens particularly viruses and mycoplasma like organisms in our country. this demands adequate quarantine safe guards such as growing them under approved post entry quarantine facility for specific period so as to facilitate virus detection, thereby curtailing risk. when such facilities are coupled with propagation by tissue culture will ensure virus free propagative plant material. the requirement of nationwide network of post entry quarantine facility working in close collaboration with crop institutions are very much emphasized for considering import of high risk plant genera for agriculture development. present paper discusses about virus disease of quarantine importance affecting ornamental and fruit plants such as chrysanthimum, dahlia, dianthus, rosabengalensis, cattleya, cymbidium, dendrobium, lilium, citrus, vitis etc. the paper also discusses on immunodiagnostic methods of detection and methods of obtaining virus free propagative material. rice tungro occurs as epidemics in regular cycles and has been reported in the last 50 years from all the major rice growing regions of india, especially prevalent in the southern and eastern states. development of the durable resistant varieties to tungro is crucial for the management of the disease. molecular breeding, involving the use of dna markers linked to the resistant gene(s) for selection, can overcome the difficulties encountered in conventional resistant breeding programs. for successful marker-assisted selection (mas), the identification of closely linked markers through the process of gene tagging and mapping is a prerequisite. attempts have been initiated for identification of tungro resistance genes through molecular mapping and their introgression into the target varieties using marker-assisted selection at drr, hyderabad. the inheritance of resistance to rice tungro virus disease was studied in seven resistant rice cultivars with field evaluation at hot spot locations. the microsatellite markers linked to rice tungro resistance in utri merah was studied and found that resistance genes were linked to rm 336 on chromosome 7. through molecular mapping two qtl were identified controlling rtv resistance on chromosomes 7 and 2 in 'utri rajapan' explaining 40.8% and 21.6% of the phenotypic variance. in variety 'vikramarya', another two qtl for rtv resistance were detected on chromosomes 7 and 1 explaining 18.7% and 16.4% of the phenotypic variance. the closely linked markers identified in this study flanking the gene of interest through mapping will improve the efficiency and precision of introgression programs in marker assisted breeding for rtv resistance. functional characterization of these qtl for rtv resistance is under progress. there is only a limited pool of natural virus resistance in cassava against cassava mosaic geminiviruses and cassava brown streak ipomovirus hence the development of transgenic resistance in this significant crop might present an option. rna mediated resistance through the expression of inverted-repeat dsrna sequences derived from the virus genome and the modification of plant microrna to produce antiviral artificial microrna are strategies that have recently been proven very effective for induction of virus resistance (immunity) against a number of rna viruses. results from rna interference strategies against geminiviruses never resulted in immunity of transgenes. however, it suggest that viral mrna are targets of rna silencing and that the success of the strategy depends on the relevance of the target gene in the systemic spread of the virus. we have generated a number of rna silencing constructs to induce resistance against cbsv and the indian cassava mosaic viruses icmv and slcmv. due to the serious problems inherent with transformation of cassava and subsequent resistance screening, these constructs were tested for efficiency either by transient-or by transgenic expression in n. benthamiana. complete immunity was reached in transgenic n. benthamiana against cbsv using inverted repeat or amirna constructs. using different species of cbsv for resistance screening, immunity was broken, to show the minimum context for broad spectrum resistance. similarly, highly specific resistance was reached in expression of amirna. in contrast, virus resistance against icmv/ slcmv using single amirna constructs was not successful. results from the experiments to generate virus resistance against cbsv and icmv/slcmv will be shown; methods to evaluate efficiency of rnai gene constructs by transient gene expression in n. benthamiana and strategies to develop efficient resistance against rna and dna viruses in cassava will be discussed. bitter gourd (momordica charantia l.) which is also called bitter melon, balsam apple and balsam pear belongs to family cucurbitaceae. it is an important traditional vegetable of nutritive and medicinal value that is cultivated in tropical and sub-tropical asia, but is considered as a weed host reservoir for viruses in jamaica. viral disease-like symptoms were observed occurring naturally on the crops of bitter gourd grown in the fields of northern india during 2007-2009. an incidence of 78.5% of diseased plants was recorded which showed chlorotic spots and mosaic ranging from mild mottling to green blisters along with leaf smalling, leaf and fruit deformations, bud necrosis and stunted growth whereas 20.2% plants exhibited leaf curling alone or in combination with mosaic-type disease. a reduction of 34.5% in fruit yield was recorded in mosaic-like disease which could be attributed to lesser fruit setting due to bud necrosis, smaller fruit size and stunted plant growth. such plants produced deformed, notched, irregularly shaped fruits wherein pre-mature yellowing and necrosis on the anterior and posteriors ends made 22.4% fruits unfit for marketability. the dwindling yield and production of unmarketable fruits posed a major constraint for profitable cultivation of this economically important crop, thus warranting for studies on etiology and management of these diseases. the mosaic-like disease was transmitted to healthy seedlings of bitter gourd at 2-leaves stage by sap inoculation as well as by aphid viz., myzus persicae sulz. and aphis gossypii glov. initially studies were carried out to optimize protocols for efficient plant regeneration and agrobacterium-mediated transformation for nagpur sweet orange, which is a popular and elite citrus cultivar in india. organogenesis was induced in etiolated epicotyl explants of one-month-old axenically raised polyembryonic seedlings by culturing them in mt medium supplemented with 30 g/l sucrose with varying concentrations of plant hormones. it was found that bap at 1 mg/l without auxin was best for efficient shoot regeneration in citrus using epicotyl explants. a 100% regeneration frequency was obtained and multiple shoot formation was obtained from both the cut ends of all the explants. an average of 8.24 well-differentiated shoots per explant were obtained, all of which rooted normally under the influence of 1 mg/l iba. this improved regeneration protocol was utilized in standardizing agrobacterium-mediated transformation of citrus using a. tumefaciens strain eha 105, containing binary plasmid pcambia 2301 that harbors gus reporter gene and npt-ii plant selection marker gene. one-month-old epicotyl explants infected with over-night grown agrobacterium (a 600 0.6-0.8) for 15 min and co-cultured for 3 days were found to be optimum for transformation as assessed on the basis of pcr analysis and gus activity displayed by the stem and leaf sections of putative transgenics. overall transformation frequency ranged from 38 to 48%. current study focuses on the generation of citrus transgenics for ctv resistance using a. tumefaciens strain eha 105 containing binary plasmid pbinar harboring portion of coat protein gene of ctv and npt-ii gene employing the standardized protocols. several putative transgenic shoots were recovered on selection medium and they are being utilized for molecular analyses and resistance against ctv. work is also in progress on the generation of citrus transformants using rnai construct harboring ctv cp and p23 genes, singly and in conjunction. our lab was also involved in developing rice transgenics for resistance against rice tungro disease, which is one of the most important and widespread virus diseases of rice in south and southeast asia, causing an annual estimated loss in crop yield of economic losses worth millions of rupees are caused due to these diseases annually. virus diseases are frequently less conspicuous than those caused by other plant pathogens and last for much longer. this is especially true for perennial crops and those that are vegetatively propagated. one further problem with attending to assess losses due to various diseases on a global basis is that what most of the data are from small comparative trials rather than wide scale comprehensive surveys, even the small trials do not necessarily give data that can be used for more global estimates of losses. this is for several reasons, including: (1) variation in losses by a particular crop from year to year; (2) variation from region to region and climatic zone to climatic zone: (3) differences in loss assessment methodologies; (4) identification of the viral etiology of the disease; 5 variation in the definition of the term 'losses' and (6) chilli is the major vegetable and spice crop grown in thar desert areas of rajasthan. leaf curl disease (chlcd) is one of the major constrains in chilli cultivation faced by farmers and cause yield loss up to 100%. a survey was conducted in major chilli growing areas of thar desert; bikaner, nagur, jodhpur and jalore districts of rajasthan during november, 2009 to understand the present status of leaf curl disease in chilli. among the four district surveyed for chlcd, the disease incidence was recorded maximum (up to 98%) in jodhpur district followed by jolore district (up to 88%). no relation was found between the disease incidence and varieties. the major varieties grown in these area are; mehsana, rch (mandoria), haripur raipur, mathania and local cultivars. the number of whitefly was also counted in top, middle and bottom leaf of chilli grown in these areas. the average number of whitefly per plant ranged from 0.0 to 4.0. more number of whitefly (4.0) was recorded in jodhpur district and lowest (1.8) in jalore district. total dna was extracted from three leaf curl infected samples from each district and tested for the presence of begomovirus using coat protein (cp) and dna-b specific primers. all the samples were positive for cp and dna-b amplifications by pcr. the cloning and sequencing of selected cp gene and dna-b fragments are in progress. the preliminary investigations shows that the leaf curl disease of chilli is widespread in the arid region of rajasthan and may be caused by begomovirus associated with satellite dna-b. bittergourd (momordica charantia) is an important vegetable crop of kerala. the crop is affected by several diseases of which mosaic is a prominent one. a field experiment was conducted to evaluate the efficacy of potentised resistance inducing substances (ris) viz., mosaic affected bittergourd plant tissue, ash of mosaic affected bittergourd plant tissue, plumbago and salicylic acid for control of bittergourd mosaic in march 2008. ris were applied as drench and foliar spray at three potency levels twice, before flowering of the crop. the experimental crop was grown as per the package of practice recommendations in split plot design with five replications per treatment. the disease incidence, disease severity and yield of the crop were recorded. the result of the experiment shows that spraying was more effective than drenching of treatments for reducing mosaic incidence and severity. among treatments, infected plant extract at 19 potency was the most effective one for reducing mosaic incidence and it showed the maximum incubation period and minimum disease severity. the spray application of treatments produced significantly higher yield than drenching. among the treatments, ash of infected plant at 19 and 309 potency and infected plant extract at 69 potency were on par and produced comparatively higher yield. elephant foot yam (amoprhophallus paeoniifolius), colocasia (colocasia esculenta) and tannia (xanthosoma sagittifolium) are the major edible aroids cultivated in india. the elephant foot yam cultivation is gaining importance due to its high production potential, nutritional and medicinal values and good economic returns. all these aroids are vegetatively propagated and viral diseases are spreading through planting materials. ctcri has the mandate of producing healthy planting materials of these edible aroids. accurate diagnosis and identification of the virus is essential for production of healthy planting material and effective management of the disease. though occurrences of viral diseases on edible aroids in india were known in 1960s, not much attention was given for detection and identification of the virus involved. in case of elephant foot yam 5-30% mosaic incidence was observed with varying symptoms of mosaic, puckering, filiformy etc. in colocasia and tannia, 5-10% incidence was noticed. rt-pcr amplification with potyvirus group specific primers and subsequent cloning and sequencing of the amplified product has confirmed the association of dasheen mosaic virus (dsmv) with all the three edible aroids cultivated in india. the complete full length coat protein gene of dsmv infecting elephant foot yam was cloned in pgem-t vector and sequenced. further sequence analysis revealed that the cp of dsmv consisted of 942 nucleotides and the 3 0 utr comprised of 260 nucleotides. blast and phylogenetic analysis showed highest similarity of 89% with that of dsmv isolate af048981, reported from usa. the deduced amino acid sequence of cp had 92.0-98.0% identity with other dsmv isolates. blast analysis of the partial cp gene sequences of colocasia and tannia also confirmed that the virus involved is dsmv. rt-pcr analysis of large number of samples from all the three crops confirmed that the potyvirus group specific primers (mj1 and mj2) are good for rapid detection of dsmv in these crops. dsmv specific biotinylated cdna and digoxigenin labelled crna probes were also prepared and dsmv in elephant foot yam was detected through nucleic acid spot hybridization. yellow leaf disease (yld) caused by sugarcane yellow leaf virus (scylv) is a recently recorded disease in india and is found wide spread throughout country. in popular varieties, the disease incidence varied from 0 to 75.0% and attained epidemic levels under field conditions. detailed studies on the impact of yld on sugarcane revealed that the virus infection significantly reduces various cane growth parameters, cane yield and juice quality. sequence comparisons of the coat protein (cp) and movement protein (mp) of 22 scylv isolates from india and database sequences showed a significant variation between indian isolates and the database sequences both at nt and aa level in the cp/mp coding regions. the significant variation in our isolates with the database isolates, even in the least variable region of the scylv genome showed that the population existing in india is different from rest of the world. further, comparison of partial sequences encoding for orf 1 and 2 revealed that yld in sugarcane in india is caused at least by three genotypes viz., cub, ind and bra-per, of which a majority of the samples were found infected with cuban genotype (cub). the genotype ind was identified as a new genotype and this was found to have significant variation with the reported genotypes. we have identified specific primers from cp region of the virus and optimized rt-pcr conditions to diagnose the virus. this assay has been found efficient in detecting the virus in asymptomatic plants and tissue culture derived seedlings. elimination of the virus through meristem culture has been demonstrated to purify the virus from the infected planting materials and this technique needs to be adopted to supply disease-free planting materials for effective management of the disease. studies are also in progress to identify the yld-resistant sources in sugarcane germplasm to initiate breeding for yld-resistance in sugarcane. mycoviruses are viruses that infect fungi. they have been identified in all major fungal families. in the present scenario, mycoviruses are the important means of biocontrol of plant fungal pathogens. most identified fungal viruses have double stranded rna genomes, often with more than one dsrna present per virus particle, and have been spherical in shape. these viruses are mostly vesicle bound, as other viruses have protein coatings. to be a true mycovirus, they must demonstrate an ability to be transmitted-in other words be able to infect other healthy fungi through anastomosis and spores. mycoviruses lead 'secret lives', reduce the ability of their fungal hosts to cause disease in plants. this property, known as hypovirulence (hypovirulence is a term used to describe reduced virulence found in strains of pathogens), this phenomenon was first observed in cryphonectria (endothia) parasitica (chestnut blight fungus) on european castanea sativa in italy, where naturally occuring hypovirulent strains were able to reduce the effect of virulent ones. these slower growing hypovirulent strains of c. parasitica contain a single cytoplasmic element of double-stranded rna (ds rna) similar to that found in mycoviruses that was transmitted by anastomosis in compatible strains through natural virulent populations of c. parasitica. hypovirulence has also been reported in many other fungal plant pathogens, including rhizoctonia solani, gaeumannomyces gramini var. tritici, ophiostoma ulmi, sclerotinia homoeocarpa, diaporthe ambigua alternaria alternata, and fusarium sp. etc. hypovirulence has attracted attention owing to the importance of fungal diseases in agriculture and the limited strategies that are available for the control of these diseases. it reduces the use of toxic fungicides which also affect the plant growth. the symptoms resulted by the mycoviruses are reduction in growth, reduction in pigmentation and sporulation, excessive sectoring and aerial mycelial collapse. these are the consequences of alteration in complex physiological and biochemical processes involving interaction between host and virus. cassava (manihot esculenta crantz.) is the major tuber crop in peninsular india, it is grown in an area of 2.4 lakh hectares with the annual production of 6.7 million tonnes both for direct consumption and the starch grain (sago) producing industries, mainly in the southern states of tamil nadu, kerala and andhra pradesh (fao 2005) . in tamil nadu, cassava primarily produced for sago producing industries where it is considered as an industrial crop rather than food crop, so the resource rich farmers are cultivating the cassava as irrigated crop in their fertile land and the poor farmers are raising the crop under rainfed conditions. in south india in addition to cassava there is a practice of intercropping important vegetable crops like, tomato, brinjal, legumes and gourds in cassava fields since all the above mentioned crops are short duration and are money spinners for the farmers. unfortunately, the major production constraint in these vegetable crops including cassava is the geminiviruses belonging to the family of in recent years there has been growing concern regarding the standard of scientific researches in india. the strengths, weaknesses, opportunities and threats (swot) analysis on indian scientific research reviewed the progress of science during the last six decades. although the 'strengths' were highlighted in good measure, it was the list of 'weaknesses' that called for attention to upgrade the standard of research and 'opportunities' that provide scope for overall scientific growth. a comparison between india and other countries in terms of research papers published revealed that india's contribution to science has come down enormously. what ails indian science? should we compare the growth of indian science with other developed countries? what criteria should be adopted to judge the quality and standard of scientific research? how to motivate the scientists to improve their scientific output? how do motivate the scientists to improve their scientific output? how do indian journals perform in maintaining quality? this paper analyses critically the scientific journals around the world, based on the scores allotted by the national academy of agriculture sciences (naas) in 2003 and 2007 for 1460 and 1608 journals respectively. in general, the indian journals performed poorly irrespective of the disciplines with only 25-30% in the high standard. the paper dealt with the reasons for low impact factor, the anomalies in the allotment of scores to wide spectrum of the journals and the disadvantages the scientists face with the scoring system. a case study was presented of an institute with over 50 scientists whose publications were analyzed to discuss the merits and demerits of the system. the performance of the journals published by prestigious academics, societies and councils was also projected. the paper concluded with the need for enhancing the image of the country through research publications in high standard journals and the role of various scientific bodies with shore and long term measures. poster session herpes simplex virus (hsv) keratitis is a leading cause of corneal blindness throughout the world. the infection can be diagnosed by clinical manifestations but in case of atypical ocular cases, laboratory diagnosis is more helpful in timely management of disease. collection of corneal scrapings in all cases of stromal and epithelial keratitis may not be possible, but collecting tear fluid is a convenient procedure causing less discomfort to the patients. therefore, the present study was intended to evaluate the suitability of tear specimens for detecting hsv by polymerase chain reaction (pcr) and immunofluorescence (ifa). tear fluid and corneal scrapings were collected from 134 patients of suspected herpetic keratitis. hsv-1 antigen was detected by ifa using rabbit anti-hsv antibodies. pcr was performed to amplify 111 bp region of thymidine kinase (tk) coding gene and 144 bp region from dna polymerase coding gene of hsv. out of 134 patients hsv antigen was detected in 25 (18.65%) of corneal scrapings and 15 (11.19%) of tear specimens and in 12 (8.95%) patients from both the specimens. hsv gene could be amplified in 44 (32.83%) of corneal scrapings and 16 (11.94%) of tear fluids and in 13 (9.71%) patients from both the specimens. although, corneal scraping seemed to be marginally superior material for detection of hsv, tear fluid may also serve as an appropriate alternative clinical specimen, due to ease of collection and least discomfort to the patients. in either cases pcr detected higher number of hsv cases than ifa. therefore if and when feasible, both ifa and pcr should be used simultaneously on each specimen to obtain best results. cytokines play a key role in the regulation of immune responses. in hepatitis c virus infection (hcv), the production of inappropriate cytokine levels appears to contribute to viral persistence and to affect response to therapy. il-6 is produced by a variety of cells including t cells, phagocytes and fibroblast. cytokine genes are polymorphic at specific sites, and certain mutations located within coding/regulatory regions have been shown to affect the overall expression and secretion of cytokines in patients with hcv infection. to correlate the serum levels and polymorphism of il-6 gene in chronic hepatitis c patients and healthy controls. forty patients positive for hcv rna attending the medicine out patient department and wards of lok nayak hospital, new delhi as well as forty healthy controls were enrolled for the study. the serum level of il-6 was detected by using elisa. genomic dna was extracted from whole blood of hcv infected patients and healthy controls by using accuprep genomic dna extraction kit according to manufacture's instruction. the genotyping of il-6 promoter (-174 variant) was carried out by pcr and direct sequencing using the method of patricia woo et al. 1998. the serum level of il-6 was significantly down regulated in hcv infected chronic patients as compared to the healthy controls. genotyping of -174 promoter variant of il-6 was performed by pcr and direct sequencing. il-6 polymorphism in the g/g, g/c and c/c allele was non significant when compared to hcv patients and healthy controls. the il-6 serum levels were significant among hcv infected patients when compared to healthy controls. the polymorphism in the promoter region of il-6 (-174) was found nonsignificantly associated in hcv patients compared to healthy controls. in conclusion, the present study suggests that the host il-6 polymorphism alone may not play a significant role in the outcome of hcv infection. acute gastroenteritis (age) is a global health problem and has been associated with multiple etiological agents, which include bacteria, protozoa and viruses. viral gastroenteritis is considered as the second most common illness in children after upper respiratory tract infection. among enteric viruses, rota, noro, enteric adeno, astro and enterovirus are found to be associated with gastroenteritis. although, association of enteric viruses has been established in children hospitalized for age no such data is available from hospitalized children other than enteric infections. to determine the prevalence of enteric viruses circulating in hospitalized children. fecal samples, n = 292 (177 symptomatic and 115 asymptomatic for age) were collected from children \5 year of age from three different hospitals across the city of pune from june 2008 to feb. 2009. detection of group a rotavirus was carried out by using antigen captured elisa. rt-pcr and pcr was carried out for the detection of norovirus, enterovirus, astrovirus and enteric adenovirus detection by using primers targeted to rdrp gene, 5 0 ncr gene and consevered gene for serine protease and hexon gene respectively. out of 177 fecal samples tested for enteric viruses in age cases, the prevalence of rota, entero, noro, enteric adeno and astrovirus were 33.3% (59), 14.7% (26), 6.2% (11), 2.8% (5) and 1.1% (2) respectively. however, the presence of these viruses in the asymptomatic cases (n = 115) was detected at 7.8% (9), 5.2% (6), 7.8% (9), 0.86% (1) and 1.7% (2) levels respectively. mixed infections of enterovirus and rotavirus were found in both symptomatic 1.6% (3) and asymptomatic cases 0.8% (1). however, mixed infection of enterovirus with adenovirus were found only in asymptomatic cases 0.8% (1). no marked difference was observed in the seasonal pattern of all viruses in the patients with or without gastroenteritis. the findings of this study document highest circulation of rotaviruses in patients symptomatic and asymptomatic for age. the entero and noroviruses remain second most important enteric viruses in these patients. influenza in humans is a major public health concern and the understanding of its evolution in the light of its ''antigenic drift'' helps prediction of epidemics and update of yearly influenza vaccine. to antigenically characterize influenza a (h3n2) isolates and study antigenic drift during 1990 to 2009 in pune city. patients with influenza like illness were identified using a strict case definition from dispensaries located in different areas in pune and clinical samples (ns/ts) were collected after obtaining informed consent. these clinical samples were processed in vivo (in fertile eggs) and in vitro ( overall, an additional 35 (39.7%) positive cases of dengue could be detected when ns1 antigen assay was also used in the study. highest ns1 antigen positivity was encountered among the samples collected on the 3rd day of fever whereas mac elisa for anti igm antibody was positive after 4th day and gradually there was an increase in the positivity towards the convalescent phase of the disease. the results of this study indicate that ns1 antigen based elisa test can be an useful tool to detect the dengue virus infection in patients during the early acute phase of disease since appearance of igm antibodies usually occur after fifth day of the infection. concurrent use of both diagnostic assays namely ns1 antigen as well as mac elisa will improve the overall detection of dengue infection. early detection of acute dengue virus infection is crucial to provide timely information for the management of patients. human parvovirus b19, a member of the parvoviridae family, is a pathogen associated with a wide variety of diseases. most commonly, it causes childhood rash erythema infectiosum, but in some cases more serious symptoms such as persistent arthropathy, critical failures of red cell production causing transient aplastic crisis, this infection in pregnancy causes hydrops fetalis and myocarditis. traditional immunosuppressive therapy being unsuccessful, anti-viral therapy might be worthy of consideration. functional annotation would provide role of viral proteome in its survival and pathogenic mechanisms. svmprot functional family annotations of vp2 protein had deciphered its zincbinding, coat protein, outer membrane, chlorophyll biosynthesis, dna repair and calcium-binding nature. vp2 protein is having a key role in viral assembly of b19 virus and being non-homologous to human proteome, it was identified as an attractive molecular target for structure based drug discovery. the vp2 protein crystal structure was energy minimized using charmm. a structure based virtual screening method was applied using ligandfit to identify potential inhibitors of vp2 protein from chembank database and ten potential human parvovirus b19 vp2 inhibitors were proposed. prism 310 genetic analyzer. the drafting of the sequences was performed using bioedit software and were submitted in genbank. for phylogenetic interpretation denv representing the full extent of genetic diversity in denv-1, denv-2 and denv-3 were collected from genbank. neighbor joining algorithm was implemented with bootstrap value of 10,000 replicates for phylogenetic inference using mega 4.0.2. the genomic region 134 to 644 (c-prm gene junction) of denv were amplified directly from patient serum. twelve of 72 samples were positive for dengue viral rna. of these 4 were dengue type 1, 1 was dengue type 2 and 7 were dengue type 3. for molecular epidemiological survey and genotyping of the sequences more than 100 sequences from different geographical areas including sequences form previously reported north indian isolates were compared with our present data set. the critical analysis of the sequences revealed: 4 dengue type 1 sequences were clustered within sub-type 2 of genotype iii and all the 7 sequences of den-3 clustered along with genotype iii. thus, among the dengue types 1, 2 and 3 currently circulating in north india, dengue type 3, genotype iii, being the predominant one followed by, genotype iii of dengue type 1. although there is no specific treatment or vaccine available currently, the confirmative rapid diagnosis based on detection of viral nucleic acid or igm antibodies in serum, an indication of recent infection, helps in epidemiological monitoring, symptomatic treatment of patients and determining prognosis. serological detection of anti-cgv igm antibodies was performed using rapid immuno-chromatographic assay (rica) and igm-antibody capture enzyme linked immunosorbant assay (mac-elisa). eighty convalescent sera were tested by rica and 60 of them were found positive for anti-cgv igm antibodies. twenty-five anti-cgv igm antibody rica positive sera were further assayed using mac-elisa. more sera from the patients are currently being tested to compare the sensitivity of these two serological assays in anti-cgv igm antibody based early serological diagnosis of cgv infection and the findings will be presented. thus the present study was designed to evaluate the utility of multiplex pcr (mpcr) for simultaneous and rapid detection of dengue and chikungunya viral infections. seventy-two acute phase blood samples from clinically suspected dengue cases were subjected for dengue and chikungunya uniplex pcr using dengue genus specific primers and e gene specific primers for chikungunya virus as well as multiplex pcr was developed for simultaneous detection of dengue and chikungunya infection. standard strains of dengue and chikungunya virus were used as controls. 13 of the 72 clinically suspected dengue samples were found to be positive for dengue viral rna by dengue uniplex pcr as well as dengue chikungunya mpcr whereas none of the samples were positive for chikungunya virus infection by both uniplex chikungunya pcr and dengue chikungunya mpcr. the result of dengue and chikungunya uniplex pcr was found to be 100% concordant with dengue chikungunya multiplex pcr. dengue chikungunya multiplex pcr was found to be a potential rapid test to detect dengue and chikungunya viral infections simultaneously in clinical samples. sheetal malhotra, neelam marwaha, karan saluja, ratti ram sharma department of transfusion medicine, pgimer, chandigarh 160012 transmission through blood and blood products can be reduced to a great extent by efficient and reliable testing of the blood. the newer fourth generation elisa assays simultaneously detect antibodies against hiv-1 and 2 and the presence of p24 antigen and thus shorten the window period to about 14 days, as compared to 22 days with third generation elisa. to compare the hiv seroprevalance among blood donors using fourth generation elisa (antigen-antibody) versus third generation elisa (antibody) assay. this was a prospective study involving 5100 blood donors of which 3400 were voluntary donors (1700 being students and 1700 being non students) and 1700 were replacement donors. sex workers are one of the core group for transmission of sti/hiv and as a ''bridge group'' to the general population. accordingly, highest priority is given to this group in targeted intervention for prevention of hiv/aids. here we are describing one such female sex worker who was harbouring 5 concomitant sti including 4 viral sti. a 25 year old female sex worker was brought to the sti clinic of a tertiary care hospital by ngo with complaint of genital discharge for 3 days. on per speculum examination, cervix was slightly erythematous, tender with mucopurulent discharge. there was no vaginal discharge or ulcer in anogenital area. however, there was a wart at lateral wall of vagina. as per naco syndromic management guideline, treatment was given for n. gonorrhoeae, c. trachomatis and hpv. cervical swab was taken and subjected to various microbiological investigation for the detection of sti viz n. gonorrhoeae, c. trachomatis, t. pallidum, candida spp., t.vaginalis, hsv-1, hsv-2, hiv, hbv, hcv, hpv and m. contagiosum. saline wet mount showed pus cells, but no yeast cells or trophozoite of trichomonas vaginalis. gram stained smear showed more than four polymorphonuclear leucocytes in the absence of gramnegative intracellular diplococci and a presumptive diagnosis of non gonococcal urethritis was made. no organism was isolated on any culture media after appropriate incubation. cervical swab was negative for antigen of c. trachomatis. serum was tested positive for hbv, hcv, hsv-2 and t. pallidum though it was seronegative for hiv. in the present case, the female sex worker was harbouring four viral sti viz hsv-2, hbv, hcv and hpv alongwith t. pallidum. however clinically she was diagnosed and treated accurately only for genital wart while cervical discharge due to hsv-2 was misdiagnosed. it is necessary to try to test alternative approaches such as periodic presumptive therapy of viral sti, because this will not only boost up the efforts of sti control in the target group but also help in hiv control. alternatively, regular clinical and laboratory screening for viral sti may be tried. densonucleosis viruses (dnv) belong to parvoviridae family. they are the etiological agents of insect's disease known as densonucleosis, which leads to death or loss of vital functions of the infected insect. densonucleosis virus of mosquitoes has generated lot of scientific interests because of its tremendous potential in biological control and its application as a transducing vector. earlier, we have reported the isolation and characterization of a dnv from aedes aegypti mosquitoes and its prevalence among different ae. aegypti populations from india. there are reports suggesting that when aedes albopictus mosquitoes co-infected with dengue-2 and dnv, the multiplication of den-2 is suppressed. the present study focus on the effect of coinfection of ae. aegypti mosquitoes with dnv and chikungunya virus (chik). the first instar mosquito larvae were infected with dnv and the emerging dnv infected females were then infected with chikv by oral feeding. thus obtained chik infected female mosquitoes were analyzed by real time pcr for both dnv and chikv on alternate days post-infection, up to the 14th day. the data showed no significant difference in the multiplication of either of the viruses after co-infection. results suggest that chikv neither stimulates the replication of dnv nor is its own replication suppressed due to co-infection. this study forms an initial step in understanding the role played by such endogenous viruses on the vector dynamics. chandipura virus pathogenesis is manifested as encephalitis in young children with a very high mortality rate. this damage could be due to direct replication of the virus in brain parenchymal tissue or immune system mediated. this study aims at elucidating the role of brain infiltrating lymphocytes in pathogenesis using mice as the model system. mice were inoculated intracerebrally with the virus and the perfused brain tissue was used to isolate the lymphocytes. control mice were inoculated with an equal amount of media. in order to standardize the procedure for isolation of lymphocytes from brain tissue, splenocytes were processed to isolate the lymphocytes using histopaque density gradient method. methods to isolate lymphocytes from brain tissue as described by earlier workers were tested for the ease and efficiency of procedure using known suspension of lymphocytes from spleen. percoll density gradient method provided optimum yield of lymphocytes with an ease of handling. in this, brain cell suspension used to prepare 30% percoll is layered over 70% percoll prepared using media in 1:2 ratio. density gradient centrifugation is carried out at 9009g for 20 min at 15°c to obtain lymphocyte layer at the interface. leishman staining was performed to analyze the morphological characteristics of isolated lymphocytes. normal lymphocytes showed dark blue stained nucleus. some bigger sized cells with diffused nucleus characteristic of atypical lymphocytes were observed and some of the cells were surrounded by hair like structures. phenotypic characterization was carried out using flow cytometry. the presence of cd4 + , cd8 + and cd19 + cells was observed. the percentages of cd8 + , cd4 + and cd19 + cells was found to be 7.60%, 35.14% and 34.32% respectively in the lymphocytes isolated from infected animal and 5.65%, 30.27% and 3.13% respectively from control animal. hence, cd19 + cells showed maximum infiltration after infection. (santosh et al. 2008; pradeep et al. 2008 ). in the present study chikv suspected blood samples were collected and the acute phase samples were subjected to rt-pcr for the presence of virus specific rna by using the primer pair dvrchk-f/dvrchk-r as described by us earlier (naresh kumar et al. 2007 ). the convalescent phase samples were screened for chikv specific antibodies by using sd bioline chikungunya igm rapid test. six sets of primers were designed to amplify the complete nsp4 and complete structural genes of chikungunya virus. the products were further gel purified, cloned in ptz57r/t vector and the recombinant clones were sequenced and submitted to the genbank. the complete ns4gene and structural genes were compared with other available sequences in the genbank. sequence analysis results will be presented. the present study discusses these aspects in detail. . some of these phages (viz. v953, v954) showed plaques at 42°c but not at 37°c. thus they seem to be lysogenic. for propagating and increasing the titre of all the above isolates, various previously described methods were attempted, but none of these methods were satisfactory. but when siliconized glassware and plastic-ware were used, propagation was successful. we showed that siliconization of glassware and plastic-ware was essential for the propagation of our mycobacteriophage isolates v951, v952, v953, v954 and v955. also, phage dilution medium (pdm) as described by chaterjee et al. (2000) was found to be effective for picking out of the plaques made by the phages. in this way, the phage isolates were propagated up to p 3 . the various passages of the phage isolates v951, v952, v953, v954 and v955 (i.e. original, p 1 , p 2 and p 3 ) were stored at -80°c. the four major routes of transmission are unsafe sex, contaminated needles, transmission from an infected mother to her baby at birth (vertical transmission) and breast milk. screening of blood products for hiv has largely eliminated transmission through blood transfusions or infected blood products in the developed world. in 2008, globally, about 2 million people died of aids, 33.4 million were living with hiv and 2.7 million people were newly infected with the virus. hiv infections and aids deaths are unevenly distributed geographically and the nature of the epidemics vary by region. more than 90% of people with hiv are living in the developing world. there is growing recognition that the virus does not discriminate by age, race, gender, ethnicity, socioeconomic status-everyone is susceptible. however, certain groups are at particular risk of hiv, including men who have sex with men (msm), injecting drug users (idus), and commercial sex workers (csws). the present study indicates the prevalence of hiv infection among the people residing in the northern region of india predominantly among the foothills of the himalayas. the study was carried out on the patients visiting herbertpur christian hospital (a unit of emmanuel hospital association) under the integrated counselling and testing centre scheme at the respective hospital during the 2009-2010. the study indicates the screening of people groups residing in the respective area through community health schemes. the diagnosis of the hiv infection is done by three types of assays namely. the tridot method which is the rapid method of diagnosis followed by the. hiv coombs test which involves the dot immunoassay principle. the third assay is the enzyme linked immunosorbent assay (elisa). the number of patients screened during the period of september 2009 to march 2010 is 635 which include patients coming from four different states namely haryana uttarakhand uttarpradesh and himachal pradesh. the number of people who were tested positive are 8 and the number of people who were tested negative are 627. the people tested positive are sent to the higher centre for other confirmatory tests such as pcr and western blot analysis. these patients are sent for treatment and prophylaxis at a respective recognised centre in dehradun. the present study determines a consistent community hiv screening and treatment approach through diagnostics counselling and awareness programmes. classical swine fever (csf) also known as hog cholera is a highly contagious and fatal disease of swine. csf became rapidly a major issue of pig industries. it still causes important economical losses worldwide. it is considered as a major health problem of swines in india. during the month of august to october 2009 there was an outbreak of classical swine fever in bihar. from three districts darbhanga, patna and supol, total 36 numbers of different infected tissue samples like kidney, spleen and lymphnode were collected from the dead morbid/pigs. total rna was isolated from 20% homogenate of infected tissues in sterile pbs by tri-reagent (sigma, usa) according to the manufacturer's instructions and cdna was prepared by using commercial available kit. the cdna was stored frozen at -20°c until used. for the molecular detection of classical swine fever virus specific nested pcr amplification of e2 and 5 0 ntr was done along with ns5b and e rns amplification. primarily these samples were found positive with these primers. further confirmation by sequencing was done by cloning of these pcr products in pgem-t easy vector. e2 and 5 0 ntr sequences were considered for phylogentic analysis along with 20 complete available sequences of csfv. nucleotide sequence alignments were carried out using the clustalw program (dnastar) and phylogenetic tree analysis (dnastar) showed that 5 0 ntr have close proximity with taiwan strain (accession no. ay568569) and e2 shows close proximity with chinese isolate csfv-39 (accession no. af407339). peste des petits ruminants (ppr) and sheeppox are oie notifiable diseases of small ruminants especially sheep and goat. both the diseases are economically important, in enzootic countries like india and are major constraints in the productivity of animals. considering the geographical distribution of both ppr and sheep pox infections and prevalence of mixed infection, in the present study, safety and potency of the experimental duel vaccine comprising attenuated strains of thermostable-ppr virus (pprv-revati, p-50) grown at 40°c and attenuated sheep poxvirus (sppv-srinagar, p40) was evaluated in local non-descript sheep. experimental animals were grouped into four groups and each group was comprising six animals, received 100 doses (10 5 tcid 50 ), 1 dose (10 3 tcid 50 ) and 1/10th dose of vaccines and normal saline as control in 1 ml volume subcutaneously, respectively. serum samples were collected on 0, 7, 14, 21 and 28th day post vaccination. sheep simultaneously immunized with 1 ml of vaccine consisting of either 100 or 1 doses of each of pprv and sppv were monitored for clinical and serological responses for a period of 3-4 weeks post-immunization (pi) and post challenge (pc). specific immune responses i.e., antibodies directed to both pprv and sppv could be demonstrated by ppr competitive elisa kit and capripox indirect elisa, respectively following immunization. all the immunized animals' resisted infection when challenged with virulent strain of sppv (srinagar isolate at p-6) on day 28 dpi, while in contact control animals developed characteristic signs of sheeppox. the challenge of the sheep against ppr was not carried out, however, the antibody titre after immunization determined by snt and elisa, indicated that protective titre, as per earlier report on the goats. dual vaccine was found safe at higher dose and induced protective immune response even at lower dose (10 2 tcid 50 ) in sheep, which was evident from sero-conversion as well as challenge study with sppv. the study indicated that both the viruses are compatible and did not interfere with each other in eliciting immune response, paving the feasibility of use of this dual vaccine in combating both infections simultaneously. goatpox is one of the highly contagious, oie notifiable and economically important viral diseases of goats. the disease is caused by goatpox virus (gtpv) is classified of the genus capripoxvirus in the family poxviridae. the disease incurs severe economic losses in terms of high morbidity in adults and heavy mortality in young kids and is a major constraint in goat farming in india. considering the enzotic nature and economic impact of the disease, it is all important to control the infection by developing an effective vaccine. recently, vero cell based a live attenuated goat pox vaccine; using gtpv uttarkashi isolate (p60) has been developed in authors' laboratory and evaluated in goats. the vaccine was found safe, potent and immunogenic experimentally and even at field trials. the vaccine has been evaluated at large-scale at different regions of the country and found suitable for mass vaccination. however, the longevity of potency was not evaluated. therefore, a long term potency trials were studied for a period of 4 years with annual challenge by using virulent goatpox virus and sero-monitoring. a sufficient number of hill goats has been vaccinated with 1 dose of vaccine (10 2.5 tcid 50 /ml) and monitored for clinical and serological response. every year, significant number of vaccinated (n = 5) and control animals (n = 2) were used for challenge with virulent strain (2 9 10 7.0 srd 50 /ml, gtpv mukteswar). sera of pre-and post-challenged (14 dpc) animals including controls have been collected and monitored for serological response in the form of specific antibody production by snt and indirect elisa. all the vaccinated animals were protected on challenge, whereas, all unvaccinated controls developed infections. the same has been reflected in sero monitoring of collected sera. so the developed live attenuated goat pox vaccine was found safe, immunogenic and potent for a period of 4 years of immunization and suitable for mass scale vaccination in control and eradication of goat pox along with a/are suitable diagnostic tool/s in goatpox enzootic country like india. rotavirus infection in avian species varies from subclinical infections to outbreaks of diarrhea. the economic significance of rotaviral enteritis to the poultry industry has not yet been defined, but by analogy to the situation in mammals, it is likely to be significant. unlike the extensive studies performed on rotavirus infection in humans and animals, limited studies have been carried out to determine the extent of exposure of poultry birds to rotaviruses. to determine the prevalence of avian rotavirus antibodies in commercial broiler chickens. a total of 120 chicken serum samples were collected from the lairage of a poultry slaughter house where birds from four different broiler farms in and around pune city were supplied to. the serum samples were tested by an igg antibody capture elisa wherein purified chicken rotavirus ch2 was used as coating antigen. sera from specific pathogen free (spf) chick (n = 20) served as negative control in the test. cut off was calculated as mean negative control ? 3sd (standard deviation). s/co (mean sample od 450 /cut off) values above 1 (1.113-4.445) in 60% (72/120) serum samples were indicating positivity to rotavirus antibodies. the result of the study indicates exposure of the birds to avian rotavirus or similar agent that is circulating in pune city. bluetongue has become established in south india causing regular outbreaks in sheep. btv serotypes 2, 9, 15 and 21 were isolated from native sheep of andhra pradesh. the other serotypes circulating in the state need to be identified. however the major constraint is the serotype identification. to overcome the difficulties of traditional serotyping methods (neutralization tests), nucleic acid based tests are being tried. rt-pcr for serotyping was standardized using primers specific to vp2 gene of btv-2, 9 and 15 serotypes. rt-pcr resulted in 653 bp product of btv-2, 1241 bp product of btv-9 which was defined by specific primers. however non specific amplification at two different sites i.e. 700 bp and 1500 bp was noticed for btv-15. specificity of rt-pcr was evaluated. btv-2 and btv-9 specific primers could amplify only btv-2 and btv-9 respectively where as btv-15 type specific primers amplified not only btv-15 but also btv-2 and btv-9. nucleic acid sequence data obtained from btv-2 pcr product and btv-9 cloned products were specific to vp2 gene of btv-2 and btv-9 respectively. however, 700 and 1500 bp products of btv-15 were identical to vp 4 gene of btv-2, 8, 10, 11, 13 and 18 and vp 1 gene of btv-2, 8 and 10 respectively, indicating the non specific amplification of btv-15. foot and mouth disease is the most contagions and highly economically impotent disease of cloven footed animals. the disease is controlled by regular vaccination using the vaccine produced from the virus grown in the cell culture. the vaccine strain used for vaccine production is selected from the field isolates based on the adaptability and growth kinetics in bhk21 cells and antigen coverage. however the field viruses need to be passaged several times to adapt in tissue culture. passage of field viruses in tissue culture may results in development of mutants whose genetic makeup may differ from the field samples also some of the field strains may fail to adapt or may grow poorly in the tissue culture, thus the efficiency of the vaccine gets affected. structural proteins of fmdv carry the sequences which determine the serotype specificity and immunogenicity. thus one may replace the gene coding for structural proteins from the full length cdna copy of the vaccine strain that has been adapted to the tissue culture with the poly-structural protein gene (pi) so that the chimeric virus gets the serotype specificity of the field strain besides retaining the other characteristics that are needed for a vaccine virus. we have made replication competent fmdv asia i full length genome and cloned under t7 and cmv promoter separately in plasmid vectors. bam h1 sites were created for inserting pi-2a gene of other field strains. the p1-2a of type 'o' vaccine strain was amplified directly from the cattle tongue material, cloned in plasmid vector and studied the specificity by sequence analysis and gene expression. we have introduced 'o' p1-2a gene into the full length construct devoid of asia 1 structural protein gene, p1-2a. the in vitro transcribed rna in case of t7 promotered construct and plasmid dna in case of cmv promotered construct were transfected into the bhk21 cells. after the passaging the virus obtained was studied for the speciality. this approach may be used not only for rapid selection of vaccine strain and also as a repository of the cdna copy of the virus. the p1 is composed of 1a, 1b, 1c and 1d (vp4, vp2, vp3, and vp1) respectively of which the vp1 is the most immunogenic and subunit vaccine produced with vp1 alone was able to induce high level of neutralising antibodies. thus to control the disease in india polyvalent vaccine consisting of the inactivated virus of all the three serotypes are in use. however the conventional vaccines have several drawbacks which include safety and temperature sensitivity. hence alternatively sub-unit vaccines consisting of vp1 protein has been tried. however this showed limited success due to the antigenic variations occurring in the field viruses thus escaping the neutralization from the antibodies generated from single cloned protein. hence the present study was undertaken with an objective to include all the neutralizing epitopes present in the three serotypes by linking vp1 (1d) genes and produce a poly valent protein for using as poly subunit vaccine. in this study we have constructed a cassette by linking the genes of three serotypes 'o' (622 bp), 'a' (640 bp) and 'asia 1' (622 bp). these genes were cloned individually in commercially pbsk vector and confirmed by sequence analysis before linking in pc dna vector. the linked gene construct was sub-cloned in pet32 expression vector. the expression of the protein gene from the pet vector was induced with iptg and analysed by sodium dodecyl sulphate polyacrylamide gel electrophoresis (sds-page). a fusion protein of size 72 kda was observed in page gels. since the protein contains 6 his residues from the vector at the n-terminal end, affinity purification was carried out using nickel nitrilo-tri-acetic-acid (ni-nta) agarose matrix. the immunoreactivity of the purified protein was assayed by western blot with the anti fmdv type 'o' and 'asia 1' specific sera. the may be used as a subunit vaccine. silkworm diseases caused by viruses, bacteria, fungi and protozoans form major constraints for the silk cocoon production in all the sericultural countries and among these silkworm viral diseases viz., nuclear polyhedrosis and infectious flacherie caused by bmnpv and bmifv cause severe crop loss. the traditional disease management strategies include prophylactic measures and use of disease free silkworm eggs. the prophylactic measures such as disinfection of silkworm rearing house and appliances, egg surface, silkworm bed disinfection and rearing surroundings. the disinfectants used presently in sericulture are either formaldehyde or chlorine based products, but these chemicals are neither eco-nor user-friendly. the awareness about health hazards caused by formaldehyde and environmental pollution caused by cl 2 necessitated the development of eco-and user-friendly disinfectant products for use in sericulture. these include alternative disinfectant products developed using biodegradable chemicals and plant based ingredients by apssrdi, hindupur and central silk board for the management of silkworm diseases in india. the ideal disinfectant for sericulture would be the one which can inactivate silkworm pathogens of diverse origin and economical for sericulture. the paper discusses on the disadvantages of hcho and cl 2 based disinfectants and advantages of eco-and user-friendly disinfectant for the management of silkworm diseases especially the ones caused by viruses. the baculovirus expression vector system (bevs) is widely used for the production of high levels of properly post-translationally modified, biologically active and functional recombinant proteins and has facilitated basic biomedical research on protein structure, function, drug discovery and the roles of various proteins in disease. bevs is based on the introduction of a foreign gene into nonessential for viral replication genome region via of homologous recombination with a transfer vector containing target gene. the resulting recombinant baculovirus lacks one of nonessential gene (polh, v-cath, chia etc.) replaced with foreign gene encoding heterologous protein which can be expressed in cultured insect cells and insect larvae. insect cell-bev system is widely used to produce recombinant proteins. bevs also eliminates concerns regarding pathogens that could potentially be transmitted to humans as it is non-infectious to vertebral animals. these features make silkworm system an ideal expression and delivery package for producing proteins of medicinal importance. the efficiency, low cost and large-scale production of proteins using bevs represents breakthrough technology that is facilitating highthroughput proteomic studies. the bevs has become a core technology for cloning and expression of genes for study of protein structure, processing and function; production of biochemical reagents; study of regulation of gene expression; commercial exploration, development and production of vaccines, therapeutics and diagnostics; drug discovery research; exploration and development of safer, more selective and environmentally compatible biopesticides. utilization of silkworm larvae and pupae as bioreactor with recombinant bmnpv producing foreign proteins extends the usages of silkworms. due to its large-size and high protein synthesis ability as well as the expediency in mass culture, silkworm is considered as good candidate for producing recombinant proteins. wssbv is the causative agent of a disease, which has recently caused high shrimp mortalities and severe damage to shrimp culture. wssbv has been found across different penaeid shrimp species. in order to develop a effective diagnostic tool, a wssbv genomic library was constructed by cloning wssbv genomic dna extracted from purified virions. in the present study wssv disease free (confirmed by pcr analysis) were collected from hatcheries from different areas of guntur and prakasam districts and analysed to study the effect of various physical parameters like temperature, p h , salinity and turbidity on the prevalence of above disease. the studies on the surface water temperature revealed fluctuations in the ponds ranging between 19 to 30.2°c in diseased ponds and 25.2 to 34.5°c in healthy ponds. these results show definite influence of temperature on the prevalence of wssv. present day strategy in vaccine development is to include marker facility that helps in distinguishing antibody response due to vaccination vis-à-vis infection in vaccinated animals. such information becomes relevant for effective disease control programmes especially when using inactivated virus vaccines like foot and mouth disease (fmd). the antibodies generated in the animals, only through vaccination, is the measure of vaccine efficacy and safety. presently inactivated fmd virus (fmdv) vaccines are used to control the disease in the endemic countries like india. the quality assurance of the vaccine depends on the efficacy of the vaccine in generating protective antibody without causing subclinical disease due to improper inactivation. since protective antibody response in vaccinated animals can not be distinguished from that of infected animals one needs to assay the antibody response against non structural proteins (nsps) and the vaccine must be free of contaminated nsps. production of vaccine free of nsps requires the cumbersome method of virus purification which adds to the cost of the vaccine. alternatively one may develop a positive marker vaccine by including a foreign protein or epitope which is not expected to be present in the vaccine and the antibodies generated against which helps in detecting the vaccine related response. here we report a molecular approach by which we introduced a immuno-dominant epitope of green fluorescent protein (gfp) into the structural protein gene of foot and mouth disease virus vaccine strain asia 1 (63/72). our laboratory has produced a mini-genome of fmdv asia 1 that lacks structural protein gene (p1-2a) coding for all the structural proteins (vp1-4) of fmdv asia 1 as a vector (pcfl dasia 1). the p1-2a of the asia 1 vaccine strain was cloned separately into a plasmid vector and by successive pcr mutagenesis and cloning we have introduced nucleotide sequence corresponding to 9 amino acid epitope of gfp into p1-2a gene. gfp epitope was inserted by replacement at n-terminal region of vp-2 which is not immunogenic. the modified p1-2a was expressed in e. coli and studied. the modified p1-2a gene with gfp epitope was inserted into the pcfl dasia 1 to get full length replication competent cdna cloned under cmv promoter in pcdna (pcflasiagfp). this can be used to produce synthetic virus with gfp epitope that can generate antibodies not only against neutralizing epitopes but also against gfp epitope. presence of antibody against gfp epitope in the vaccinated animal will reveal vaccine efficacy. elisa against gfp can be used as a companion test not only for safety evaluation but also for quick evaluation of efficacy. further absence of nsp antibodies in the serum may reveal the quality of the vaccine in respect of safety. self replicating dna vaccines are developed to achieve robust immune response through enhanced antigen production and gamma interferon expression in vaccinated animals. since self replicating dna vaccines induce gamma interferon expression which helps in viral clearance such vaccines are expected to be useful to cure even the carrier and persistently infected animals. understanding the events that help in the elicitation of both the arms of immune response in vaccinated animals is necessary to understand the effectiveness of the vaccine. the work presented here deals with the immunological evaluation of a sindbis virus replicase based dna vaccine carrying linked fmdv vp1 genes in vaccinated guinea pigs. we have constructed self replicating dna vaccine vector and to the down stream of a sub genomic promoter we have inserted secretory signal followed by linked-vp1 genes of 3fmdv serotypes (o-a-asia 1) with glycine and proline bridge in between. guinea pigs were vaccinated with the construct and the sera at 28 days post vaccination were evaluated both for cellular response by studying the cd8 levels and by mtt and cytokine profiles by real time assays. the humoral response was evaluated by studying cd4 levels in the whole blood by facs analysis and serum antibody levels by snt and elisa. the animals were challenged with 100 gp infective dose of fmdv type 'o' virus lesions were scored. further the replicative efficiency of the challenge virus was studied by 3ab elisa. the results showed that all the assays except antibodies against 3ab protein have positive correlation with the protection. as expected the titre of the antibodies against 3ab protein was lower indicating that the challenge virus replication was inhibited in the vaccinated animals. the limited studies conducted by us showed that self replicating vaccine has a potentiality to emerge as potent vaccine for fmd. ganjam virus (ganjv) belongs to the genus nairoviruses (family bunyaviridae). these viruses cause diseases in livestock. it has been isolated from different animal hosts and tick vectors from india. genus nairoviruses includes a total of 34 tick-borne viruses, classified into 7 serogroups. the important serogroups are crimean congo hemorrhagic fever (cchf) and the nairobi sheep disease (nsd). the main members of the nsd group are nsd and dubge viruses. their genome consists of three segments of single stranded rna, viz. s, m and l that encodes viral nucleocapsid protein, viral glycoprotein g1 and g2 and the viral polymerase respectively. ganjv is very closely related to (nsdv). nsdv is found in east and central africa, causes very high morbidity and mortality in livestock. the present study involves phylogenetic comparison of ganjv isolates from india with other nairoviruses based on complete n gene. it will help to understand the kind of nucleotide (nt) and amino acid (aa) changes that have occurred in ganjv strains from different geographical areas. eight strains of ganjv isolated at niv during 1954-2002 from different parts of india were used in this study. virus stocks were prepared in vero e6 cell line these were used as the source of viral rna. the n gene was amplified either as a complete gene in one reaction or in fragments whenever necessary. thus obtained sequences were analyzed; annotated to get a consensus sequence, aligned against the sequence of prototype strain of ganjv and other representative nairoviruses. the nt sequences were converted to aa sequences and analysis was done at both nucleotide and amino acid levels. based on what nt or aa phylogenetic tree was constructed and compared with other nairoviruses (cchf, dugv, hazv, kupv and nsdv) where complete s segment sequences were available gen-bank database (ncbi). the phylogenetic data at both the nt and aa levels showed that all the strains of ganjv form monophyletic lineage with the nsdv. cchfv and hazv together formed another clade, whereas dugv and kupv made a separate branch in the tree. the different ganjv strains showed 9-10% difference with nsdv at the nucleotide level and 3-4% difference at the amino acid level. hazv showed 37-38% difference at the nt level and 37% difference at the aa level with ganjv as well as nsdv. the present data obtained suggests that ganjv and nsdv are minor variants of the same virus. diarrhoeal syndrome is one of the major concerns of the livestock industry. most of the diarrheic cases in animals go unnoticed and limited attention is paid on viral etiology. presence of large amount of fecal matter in animal shed acts as a source of infection for calves via drinking water, feed, or contaminated soil. keeping this in view, investigation was planned to detect the association of rotaviruses with diarrhea in dairy calves and to observe the genomic diversity among the circulating viruses in tarai area of uttarakhand. a total of 63 diarrheic fecal samples collected from instructional dairy farm, nagla, pantnagar, uttarakhand were screened during the study. samples were collected from both cow calves and buffalo calves in 0-3 months of age. for the diagnosis of rotavirus, all the fecal samples were subjected to rna-electrophoresis after nucleic acid extraction. viral genome segments were visualized by silver staining. out of the total 63 samples tested, seven were found positive in rna-page showing typical 11 genome segments migration pattern of bovine rotavirus. in the given samples prevalence of bovine rotavirus was 11.32% and 10% in cow and buffalo calves, respectively. on the basis of migration patterns of rotavirus in rna-page, group a were identified with typical 4:2:3:2 pattern. variation within movement of various genome segments among isolates of bovine rotaviruses was observed during the study that may be indicative of emergence of mutants in the circulating isolates. the vp6 gene based group a specific rt-pcr was standardized and all the isolates in this area were confirmed to be of group a type. work is in progress to genotype the bovine rotaviruses of this region based on vp7 and vp4 genes. this study emphasizes the need to explore the prevalence of bovine group a rotaviruses in different places of uttarakhand and their genetic characterization which could help in selection of control strategies for rotavirus infections. foot-and-mouth disease (fmd) is endemic in india causing enormous economic loss to the animal keepers and trade embargo with fmd free countries in livestock and animal products. rapid diagnosis of fmd is of immense importance in prevention and control of the disease. fmd is initially diagnosed clinically and confirmed by laboratory tests. virus isolation in cell culture and sandwich elisa for antigen detection are commonly practiced in laboratories. the virus isolation though is very sensitive but it can be slow and analytical sensitivity of the elisa is lower and can not be used with certain sample types. the use of molecular techniques in the diagnostic laboratory has greatly increased the speed, specificity and sensitivity of fmd diagnostic tests. molecular techniques like rt-pcr, pcr-elsa and dot hybridization can be used with more success for detecting carrier animals and animals harboring sub-clinical infection and can be applied in a wide range of clinical sample types. these techniques can be used as genus and serotype specific test including detection of particular lineage/genotypes with in the serotype. multiplex pcr has been used to differentiate serotypes of fmdv and the technique is sensitive, experimentally simpler, cost effective and less time consuming. the assay can be used for serotyping on elisa negative samples. the molecular techniques not only help in diagnosis but also useful for epidemiological studies. lineage differentiating rt-pcr has been useful in identifying different lineages of serotype asia 1 (lineage b, c and d) before proceeding with sequencing of 1d region. similarly genotype differentiating rt-pcr has been developed and used in differentiating two different genotypes of serotype a (genotype vi and vii). these assays have the potential to be applied on clinical samples directly, thereby saving much time needed for sample processing and nucleotide sequencing. recent development of real time rt-pcr methodology has allowed the diagnostic potential of molecular assays to be realised. advancement in real time pcr technology made it possible to combine several assays within a single tube which is in the progress in our laboratory. integration of these assays onto automated high throughput platforms provides diagnostic laboratories with the capability to test large numbers of samples. microarray technology was provided greater screening capabilities for pathogen detection. the microarray allows the addition of large number of oligonucleotide probes for identification of mutant pathogen and also for subtype determination. the combined properties of high sensitivity and specificity, low contamination risk, and speed has made realtime pcr and microarray technology a highly attractive alternative to conventional methods in increasing percentage of outbreaks confirmed and analyzed and for tracing the origin of fmd virus responsible for outbreaks. dna vaccines are expected to elicit both humoral and cellular responses, cellular response being long lasting. however the approach has several limitations like poor stability of dna, poor expression and risk of integration. poor expression becomes the major limitation in the case of fmd as fmdv proteins are poor immunogens. also dna vaccine vectors carrying only eukaryotic promoters elicit strong cmi response and weak humoral response. the methodology to achieve humoral response involves the expression and secretion of the expressed protein so that the antigen presenting cells will be able to process the antigen and produce humoral response. in case of fmd humoral response is as important as cellular response. the present project aims at addressing these issues; achieving higher expression and getting the protein secreted out by constructing self replicating gene vaccines for fmd and studying their efficacy. the vector for humoral immune response contains eef1 promoter, sindbis virus polymerase gene and secretory and anchoring signals. the integrity of the vectors was confirmed by sequence analysis. the linked polyvalent protein genes of fmdv serotype a, o and asia 1 were cloned into the vectors and the presence of the insert was confirmed by restriction enzyme digestion. the functionality of the constructed dna vaccine vector (pvac self rep 990) was assayed by transfecting the dna into bhk 21 cell monolayer and studying the 35 s labeled proteins in immuno-precipitation assays. the studies showed high level of expression in case of constructed vector as compared to infected virus for the specific protein. the secretion of the expressed protein was assayed by immuno-fluorescence assay and found to be positive. encouraged with these studies the preliminary studies were conducted on vaccine efficacy studies in guinea pig model. the immunized guinea pigs showed high antibody titres by snt and elisa, as compared to conventional dna vaccines (pup3cd) even at 1/10th of the dose. this approach of constructing self replicating dna vaccine for humoral response is the first report. genetically engineered microorganisms are important sources of industrial and medicinal proteins. over the past decade, plant host system has been investigated as potential host system for expressing proteins of therapeutic and diagnostic use. however concerns regarding the stability and environmental safety need to be addressed. chloroplast engineering is expected to resolve some of these issues since, plastids/chloroplasts are inherited maternally and are not disseminated through pollen. this makes plastid transformation a valuable tool for transgenic creation besides offering biological containment. since foot and mouth disease (fmd) of cloven footed animals is a major concern in the world over. foot and mouth disease (fmd) is the most feared, viral disease of the cloven footed animals causing heavy losses to the live stock industry. the disease is enzootic in many parts of the world including asia. the conventional vaccines for fmd have several limitations which include safety, temperature sensitivity and duration of immunity. attempts have been made to overcome these limitations using recombinant dna technology. amongst the newer vaccines, edible vaccines are cost effective and easy to administer. since the stability of the gene of interest is the major concern in the case of plant transgenics, marker genes are used for regular selection. the detection methods based on the available marker proteins like b glucoronidase (gus) protein/antibiotic selection are cumbersome and cost intensive. however selection based on herbicide resistance is much simpler and easy. hence in the present study, the 5-enolpyruvylshikimate-3-phosphate synthase (epsp) gene was used as a marker along with the immunogen gene of fmdv. epsp is the key enzyme in the shikimate biosynthesis pathway necessary for the aromatic amino acids production. in order to investigate the mechanism of long term immunity and the effect of protective immunity induced by cationic plg micro particle coated dna vaccination. we constructed the expression plasmid containing a foot-and-mouth disease virus (fmdv) id gene sero type a. intramuscular vaccination of guinea pigs with the micro particles coated plasmid dna induced a strong antibody response and neutralization antibodies, cellular mediated immune response which lasted 1 year. we further analyzed the persistence and expression of id gene by polymerase chain reaction and reverse transcriptase polymerase chain reaction and quantitative pcr. the results showed that id gene was present and expressed in the muscle cells up to 1 year after days post vaccination. furthermore, guinea pigs vaccinated with micro particles coated plasmid dna were protected against a challenge with fmdv virus. therefore the micro particles coated plasmid dna vaccination dose induce a protective immunity and long term humoral, cellular immuno responses against fmdv, which could be maintained by persistent expression of id gene in muscle cells. foot and mouth disease virus (fmdv) causes a highly contagious viral disease of cloven hoofed animals, which has a considerable socioeconomic impact on the countries affected. interleukin-18 (il-18) enhances the il-12 driven th1 immune response that is important in immunity against intracellular pathogens. the multiple roles of il-18 in many physiological and pathological processes have generated a great deal of interest in recent years. antiviral effects of il-18 have been reported. we evaluated the effects interleukin-18 (il-18) on the replication of fmdv in vitro in bhk-21 cells. bovine il-18 mature protein coding sequence was amplified from the bovine pbmc cells and cloned into prokaryotic expression vector pet32a. protein expressed was purified and specificity was confirmed by immunoblotting. bhk-21 cells were treated with purified expressed il-18 protein with (2 lg/ml) 4 h prior to fmd infection. cell culture supernatants were collected at 24 h post infection were subjected for elisa and virus titration assay. rna extracted from the cells was subjected to real time pcr for viral rna quantification. 2 log titer reduction was observed in the fmd virus titer in il-18 treated cells compared to the untreated cells where as virus antigen quantified by elisa has shown a reduction of 60-folds. 69-fold reduction in the fmd viral rna copy number was observed in the il-18 treated cell compared to the untreated measured by qpcr. current study demonstrated the potent anti viral activity of il-18 on fmdv by inhibiting the viral replication. these results further suggests that il-18 has the potential role of il-18 as molecular adjuvant in fmd vaccine development and development of therapeutic for fmd. foot and mouth disease is the most contagious viral disease of farm animals. control of the disease in animals is by vaccination and slaughtering of infected animals. conventional oil adjuvant vaccine has its own limitation. alternate to this genetic vaccines where the dna encoding viral antigen may be a promising approach. naked dna vaccine has limitations like poor uptake of dna by cells and more importantly by nucleus. as a result delivery of naked dna through calcium phosphate nanoparticle was attempted. calcium phosphate nanoparticle is a potential delivery agent which proved to enhance the immune response. fmdv p1-3cd ''o'' vaccine gene constructs in pcdna3.1? entrapped by the nanoparticles was prepared by using different molarity of calcium chloride and disodium hydrogen orthophosphate. the nanoparticles entrapping fmdv p1-3cd ''o'' and naked dna were presented to the guinea pigs through intramuscular injection to study the mrna expression of antigen by rt-pcr. animals were sacrificed at defined time to collect different organs and total rna was extracted. each time blood was collected to analyse the fmdv specific serum antibodies. dna vaccines presented through calcium phosphate produced transcripts in the injected muscle up to 240 days whereas naked dna up to 120 days. serum antibody levels of naked dna vaccine showed antibody titre till 60 days. whereas nanoparticle injected animals showed serum antibody till 120 days. serum neutralization titres of 1.5 were observed in calcium phosphate dna vaccines at about 28-150 days, where as naked dna sn titers were observed for short period of 30-90 days. the study clearly showed calcium phosphate nanoparticle entrapping fmdv vaccine dna may be a better delivery system for dna vaccines as it confirms availability of the antigen and persistence of antibody for longer duration than naked dna. capripox is highly infectious, contagious, and oie notifiable disease of small ruminants, caused by sheeppox and goatpox viruses which are members of capripoxvirus genus of the family poxviridae. in the present study, we analyzed the partial gene sequences of p32 protein, an immunogenic envelope protein of capripox viruses (capv) to assess the genetic relationship among different sheep pox and goat pox virus isolates from different geographical areas of the country. product of this gene has been shown to be important in attachment of capv to host cell surface receptors during viral entry and host immune response. the following virus isolates have been used in the analysis: gtpv-uttarkashi, p60, vaccine virus; gtpv mukteswar, p10, challenge virus; gtpv (akola), gtpv bareilly/00, gtpv ladakh/01 and gtpv sambalpur/82, field isolates and sppv srinagar, p40; sppv ranipet, p50; sppv-rf, p50, vaccine viruses and sppv makdhoom/07, sppv cirg/08, sppv pune/08, sppv bareilly, sppv 183/03 and sppv 125/02, field isolates. in this study, all virus isolates were confirmed by pcr amplification and analysed in pcr-restriction fragment length polymorphism (pcr-rflp) using ecori enzyme to confirm their specificity. further, the amplicons were cloned and sequenced commercially. nucleotide and the deduced amino acid (aa) sequences were compared with published sequences of the members of the genus capripox virus. sequence analysis of partial 172 bp sequence has shown high sequence identity among all indian sppv and gtpv isolates at both nt and aa levels. it revealed a 99.4-100% and 98.2 for gtpv field isolates where as, 100% for sppv field isolates at both the nt and aa levels. in general, capv isolates in this study shown 98.3-98.8 and 96.5% homology between gtpv and sppv at nt and aa levels as reported earlier. further, it revealed a unique change of g120a in all gtpv isolates resulting in formation of drai site in place of ecori and possible development of restriction enzyme specific pcr-rflp for differentiation of sppv and gtpv from field isolates. orf or contagious ecthyma is considered as non-contagious, proliferative disease and is caused by orf virus of the genus parapox virus of the family poxviridae. it is reported most commonly in sheep and goats and also a zoonotic agent. camels are also infected by orf virus and reported in camel rearing countries as a mixed infection with camel pox, the later is caused by an orthopox virus. in india, there are few reports of the orf virus infection in camels and identified by clinical signs and pcr. in this study, we identified the presence of orf virus from clinical samples of suspected case of sporadic infection in camels by serological and molecular techniques. viral dna isolated from processed scabs used initially in nested polymerase chain reaction as diagnostic pcr which successfully amplified 235 bp fragments and also sequenced to check the fidelity of the product. after confirming the infection by pcr, some of the structural and non-structural genes were amplified for sequence analysis. out of the five genes characterized, the major important one selected for sequence and phylogenetic analysis is b2l gene which is homologous to a major envelope protein p37k of vaccinia virus. full open reading frame of 1137 bp from orf b2l was amplified by pcr, cloned and sequenced commercially. nucleotide and deduced amino acid sequences of b2l were compared with other published sequences of the members of the genus papapox virus. sequence analysis shows a maximum percent identity of 94.8 and 95 (indian orf virus isolates); 94.7 and 94.5 (other orf isolates); 98.8 and 98.7 (orf-camel/jodhpur/08); 85 and 82.8 (bovine popular stomatitis virus) and finally 97.4 and 97.6 (pseudo cowpox virus) respectively at nt and aa levels. phylogenetic analysis of the isolate was also performed using the neighbour joining method in mega 4 program to know the phylogeny relatedness of the virus, which revealed that the isolate is well grouped with the jodhpur isolate and closely related to pseudo cowpox virus. it warrants further analysis of other potential genes to confirm the causative agent of the contagious ecthyma in camels as pseudo cowpox virus. chikungunya an arboviral disease is transmitted through the bite of an infected aedes mosquito. it causes a self limited febrile illness along with arthralgia and myalgia. in some cases neurological and severe hemorrhagic manifestations has been observed. chikv epidemic has been reported in africa, india, south east asian countries and during the current out break imported cases of chikv has been encountered in most of the european countries. the causative agent belongs to the genus alphavirus family togaviridae. human beings serve as the chikungunya virus reservoir host during epidemic periods. outside these periods the main reservoirs are monkeys, rodents, birds, and other unidentified vertebrates. antibodies to chikv have been detected in domestic animals. in the present study we surveyed madanapalli, palamaner, b. kotta kota and tirupati and collected a total of 67 rodent samples, 75 bovine samples; 20 sheep samples and 15 canine samples. total rna was isolated from all these samples and subjected to rt-pcr using a primer pair dvrchk-f/dvrchk-r which could amplify a 330 bp e1 gene product specific to chikungunya virus (naresh kumar et al. 2007 ). all the serum samples were further screened for chikv specific igm antibodies using commercially available ctk biotech strips. none of the samples were found positive either for chikv specific rna or chikv specific igm antibodies. more number of samples from domestic animals as well as rodents are being screened to study their possible role if any in the maintenance of chikv in nature and during the inter epidemic periods. the present study discusses these aspects in detail. petunia hybrida is widely used as experimental host plant for begomovirus identification and its characterization. hitherto, natural infection of begomovirus on petunia has not been reported in india. recently, petunia hybrida grown in and around ludhiana were found to be depicting typical symptoms caused by begomovirus. the symptoms include severe reduction in leaf size, downward curling and distorted leaves. severely infected plant became bushy, stunted and produces no flower. total genomic dna was extracted from the plants showing symptoms of begomovirus, by ctab method. the presence of virus was confirmed by using degenerated primers, designed to identify all the begomovirus prevailing in the world. to identify the strain associated with the disease, the positive samples along with healthy control were tested against different strain specific primers of tomato leaf curl virus, so far reported in india i.e. tomato leaf curl new delhi virus, tomato leaf curl palampur virus, tomato leaf curl banglore virus, tomato leaf curl karnataka virus and tomato leaf curl gujarat virus. among these, only tomato leaf curl new delhi virus specific primer was able to give the desired amplicon of *1180 bp. hence, it is confirmed that the leaf curl disease of petunia hybrida is associated with tomato leaf curl new delhi virus. this disease of petunia can become a sever production constraint in coming years. from last 2 years (2008 and 2009) it was observed that some varieties of brinjal grown in rainy season, showed typical leaf curl type of symptoms. the symptoms include upward curling of the leaves, cupping, vein thickening, reduction in leaf size and distortion of leaves. the severely infected plant remains stunted and bushy, became unproductive or produces only few fruits. the disease was experimentally transmitted from naturally infected brinjal to healthy seedlings by whiteflies (bemisia tabaci) and grafting, but not by mechanical or aphid transmission. to detect the begomovirus associated, total genomic dna was extracted from the plants showing disease symptoms. the presence of virus was confirmed by using pcr based begomovirus geneus-specific primers designed by deng et al., wyatt and brown and rojas et al. these degenerated primers give the expected product size of *530, *575 and *1280 bp, respectively. core coat protein (cp) gene and dna-b was also amplified in the samples using specific primers. to identify the strain associated with leaf curl virus, dna was subjected against primers of different indian tomato leaf curl virus strain i.e. tomato leaf curl new delhi virus, tomato leaf curl palampur virus, tomato leaf curl banglore virus, tomato leaf curl karnataka virus and tomato leaf curl gujarat virus, using pcr. among these, only tomato leaf curl new delhi virus primer was able to show the desired product size of *1180 bp. therefore, it was confirmed that leaf curl disease of brinjal is caused by tomato leaf curl new delhi virus in association with satellite b-dna. to identify the strain associated with the disease, all samples were further subjected to the specific primers, designed to amplify all the tomato leaf curl virus strains, so far reported from india i.e. tomato leaf curl new delhi virus, tomato leaf curl palampur virus, tomato leaf curl banglore virus, tomato leaf curl karnataka virus and tomato leaf curl gujarat virus, using pcr. among these, only tomato leaf curl palampur virus specific primer was able to give the expected product size of *900 bp. this shows the association of tomato leaf curl palampur virus with leaf curl disease of calendula and marigold. thus, calendula and marigold can act as a reservoir for the tomato leaf curl palampur virus and may cause severe constrain in the production of these important ornamental plants. groundnut bud necrosis virus (gbnv) belongs to serogroup iv of the genus tospovirus in bynayaviridae family and infects several economically important crops all over india. the nucleocapsid protein (np) encoded by the small rna of gbnv encapsidates the viral rnas. apart from this structural role, the np has also been implicated in the replication, transcription, maturation and cell to cell movement. with a view to study the structure and function, the np of gbnvtomato isolate from karnataka was over expressed in e. coli and purified by ni-nta chromatography. the purified np was present as ribonucleoprotein complex and as heterogeneous mixture containing monomers, tetramers and higher order multimers. in order to determine the regions involved in oligomerization and nucleic acid binding, mutational approach was taken. n-and c-terminal deletion clones were generated (n20np, n40np, c15np and c37np), over expressed in e. coli, and were purified by a procedure identical to that used for the wild type protein. initial studies on oligomeric status suggested that in addition to n-and c-terminal regions there may be additional regions or residues which contribute to multimerization of np. the amount of rna bound to the truncated proteins was reduced in case of n20np, n40np and c15np. interestingly removal of 37 amino acid residues (natively unfolded region) from the c terminus resulted in complete loss of nucleic acid binding suggesting that the rna binding domain was located in c-terminal region of np. further np was observed to get phosphorylated in in vitro kinase assays by a kinase present in the soluble fraction of tobacco plant sap. both atp and gtp were utilized as phosporyl donors and mn 2? was the preferred metal ion which suggests that np might be phosphorylated by a ck2-like protein kinase. phosphorylation studies with n-and c-terminal truncated proteins revealed that the site of phosphorylation lies within the amino acid residues 40-239. by mass spectrometric analysis of the protein threonine-84 and serine-202 were identified as possible phosphorylation sites. a naturally occurring isolate of virus infecting gherkin (cucumis anguira l.) showing mosaic symptoms of mosaic, leaf distortion and dark green islands in the lamina was identified in the export cultivars of gherkin grown in commercial fields of kuppam rural, chittoor district, andhra pradesh. the virus infection was deadly prevalent among the field that caused a lot of economic damage to the crop that resulted in yield losses and reduced quality of fruits meant for export. symptoms of the infected fruit included blistering and malformation of the fruit. the virus infected leaf samples were collected and initial host range tests were conducted with different cucurbit species showed that the host range include propagation hosts like cucumis anguira (gherkin), cucumis sativus, cucurbita pepo, cucumis melo, langeneria vulgaris, momordica charantia and local assay host like chenopodium amaranticolor. the virus host range was only restricted to cucurbit species and chenopodium. the virus was maintained for further studies on cucurbita pepo by sap or mechanical inoculation. the virus induced mosaic, vein clearing symptoms on pumpkin. electron microscopy of the leaf dip preparations stained with 2% uranyl acetate from the pumpkin leaves showing symptoms revealed the presence of a long flexuous filamentous particle measuring 750 9 12 nm. the virus positively reacted to the polyclonal antisera of papaya ringspot virus-w, potato virus y, tobacco etch virus and also strongly reacted with the polyclonal antiserum of zucchini yellow mosaic virus in direct antigen coated-enzyme linked immunosorbent assay (dac-elisa). because of very strong reaction to polyclonal antisera of zucchini yellow mosaic virus, we tried to amplify the partial nib and cp genes of the virus along with the 3 0 utr by using two primers zy2 5 0 gctccatacatagctgag acagc3 0 and zy3 5 0 taggctttttgcaaacggagtcta at c3 0 . total rna from gherkin infected leaves was isolated using trizol ls reagent (sigma). rt-pcr was performed to obtain an amplicon of *1.2kbp, cloned into fermentas ptz57r/t vector and sequenced at mwg biotech, bangalore. sequence analysis revealed that the virus was isolate of zucchini yellow mosaic virus and was showing 98% of homology to that of the zucchini yellow mosaic virus strain b genome ay188994 and zucchini yellow mosaic virus nat genome ef062582 which were strains reported from israel. the sequence of the present study was submitted to the genbank gq482976. the results state a suspicion that the virus could have been mobilized by some infected source brought by the commercial israeli based companies into india due to poor quarantine regulations as the gherkin cultivation in these regions is chiefly supported, purchased, exported and marketed by these private companies that are based from israel. this is the first report on molecular characterisation of zucchini yellow mosaic virus infecting cucumis anguira (gherkin) from india. they also exhibited synergism with other virus which was region specific. fifty percent of the total symptomatic plant population was found be positive only for carla while remaining showed mixed infection of carla with tospo in some regions while in others carla virus was found to be associated with cmv. presence of only carlavirus was up to 10-20% incidence, without association of tospo, cmv, poty or tobamo viruses was also observed in some fields. avijit tarafdar, raju ghosh, k. k. biswas plant virology unit, division of plant pathology, indian agricultural research institute, new delhi 110012 citrus tristeza virus (ctv), a brown citrus aphid (toxoptera citricidus) transmitted closterovirus under family closteroviridae, is one of the major limiting factors in cultivation of citrus worldwide. ctv is a longest known plant virus having flexuous particle of 2000 9 11 nm in size. ctv genome is a positive sense ssrna of about 20 kb nucleotide containing 13 open reading frames (orfs) encoding 17 proteins. several biological as well as genetic variants of ctv are reported in all the citrus growing countries in the world. ctv causes decline and death of millions of citrus trees in the world. in india, ctv is a century old problem, and has killed an estimated one million citrus trees till today. in molecular and genetic level, ctv isolates from india were not fully characterized. genetic diversity and sequence divergence in ctv isolates of india are not fully established. further, evidence of recombination and causes of evolution of ctv variants in india have not been studied till date. therefore, in the present study, effort has been made to characterize several indian ctv isolates in genetic level, examine their genetic diversity, identify recombination events and analyze evolution of divergent ctv. a total number of 73 ctv isolates from different regions of india (35 from darjeeling hills, five from bangalore, 15 from delhi and 18 from vidarbha) were under taken for genetic study. two genomic regions of ctv, i.e., entire cp gene (cpg) (672 nt) and a gene fragment of 5 0 orf1a (orf1a) (404 nt) were amplified, cloned, sequenced and nucleotides were analyzed. based on cpg, indian isolates shared 88-99% nucleotide identity, and based on orf1a they shared 82-99% identity, among them. incongruence of phylogenetic relationship was observed as on sequence analysis five phylogenetic clades based on cpg, and eight clades based on orf1a, were generated suggesting the recombination events have been occurred between the sequences of indian ctv isolates. thus, to identify the potential recombination events, and determine the parental sequences in ctv isolates, six recombination detecting algorithms, namely, rdp, genconv, bootscan, maxchi, chimera and siscan were used. out of 73 indian ctv, cpg of 18 and orf1a of 47 isolates of ctv showed recombination events suggesting orf1a was more prone and fragile to rna recombination as compared to cpg. this findings indicated that high degrees of genetic diversity and incongruent relationships of indian ctv isolates are due to genetic recombination occurred, which may be the important factors in driving evolution ctv variants in india, that was also supported by a splittree decomposition analysis. b. v. bhaskara reddy, y. sivaprasad, k. rekha rani, k. raja reddy department of plant pathology, regional agricultural research station, acharya n.g. ranga agricultural university, tirupati, andhra pradesh sunflower (helianthus annus l.) is one of the most important oil seed crops in the world which ranks third in area after soyabean and groundnut. the sunflower necrosis disease (snd) is characterized by necrosis of leaves, necrosis streaks on petioles, stem, floral parts and stunted growth. the causal agent of the disease has been identified as tobacco streak virus (tsv) which belongs to genus ilarvirus of the family bromoviridae. the suspected tsv infected sunflower samples collected from chittoor district in andhra pradesh were found positive for tsv-dac elisa. total rna was extracted from sunflower using rnaeasy isolation kit (qiagen). the tsv coat protein (cp) gene, movement protein (mp) gene and replicase (rep) gene were amplified by rt-pcr with specific primers, cloned in ptz57r/t vector, sequenced and deposited in genbank (gu355899, gu355900 and gu371445). the size of cloned cp gene was 717 bp and codes for 239 amino acids. the cp gene sequence analysis revealed that the tsv-tpt infecting sunflower has 98-100% homology at nucleotide level with soybean, tagietus-tpt and okra-tn isolates and 93-99% homology at amino acid level. the movement protein gene was 615 bp and codes for 205 amino acids. the mp gene sequence analysis showed that it has 94-97% homology at nucleotide level and 92-95% at aminoacid level. chilli (capsicum annuum), the important commercial vegetable/spice of himachal pradesh, is affected by several viral diseases; of them cucumo, tospo, poty and gemini viruses are the most common genera. however, these viruses are not identified clearly and characterized fully, which are foremost needed to formulate the management strategy. therefore, in the present study, effort has been made to identify and characterize the important viruses causing diseases in chilli. in this study, several farms in major chilli growing areas of bilaspur and kangra districts in himachal pradesh were surveyed and infected plant samples were collected randomly. virus infection in these samples were detected by das-elisa using antisera to cucumber mosaic virus (cmv) and potyvirus (group specific) and through slot-blot hybridization (sbh) using cmv, iris severe mosaic poty virus (ismv), tomato spotted wilt tospo virus (tswv) and chilli leaf curl gemini virus (clcuv). based on das-elisa and sbh, the incidence of disease was estimated and ranged from 18.2 to 21.8% by cmv and 3.5 to 5.4% by potyvirus. to detect tospo and geminivirus in the infected chilli, sbh test was carried out. infected samples showed maximum virus titer in both das-elisa and sbh test were further confirmed by pcr using specific primers. desired sizes of amplicons; *540 bp, *800 bp, *570 bp and *460 bp of cmv, poty, gemini and tospo viruses, respectively, were obtained. as the present study clearly indicated that cmv appeared as a major one among the viruses infecting chilli in the hilly region of himachal pradesh, two isolates of cmv were characterized in genetic level. thus the amplified products (*540 bp) of cmv, palampur 1 and palampur 2 were cloned in pgemt cloning vector, sequenced and the sequences were submitted to ncbi database (palampur 1: acc-fm209497 and palampur 2: acc-fm209498). the sequences were then analyzed and compared with other sequences available in the data base. based on sequence analysis, it was found that present cmv isolates shared 99% nucleotide identity between them, are closely related with australian cmv isolate cmv-ly (acc-af198103) by 98% nucleotide identity. in phylogenetic tree analysis, it was observed that indian cmv isolates formed same cluster along with cmv-ly. as it is known that cmv subgroup ii comprises cmv-ly, it is concluded that the cmvs of this hilly region of himachal pradesh belong to subgroup ii. chilli is essentially a crop of the tropics and grows better in hotter regions. chlii (capsicum annuum), a member of family solanaceae is an important vegetable and spice crop of immense commercial importance. the pungency in pepper is due to an alkaloid known as capsaicine and peppers are characterized as sweet, hot or mild depending on capsaicine content. the present investigation were conducted to find out the highly resistant cultivars of capsicum annuum against cmv and tylcv among ten cultivars of chilli in agroclimatic condition of aligarh. the highest (70 and 80) percentage of infection was observed in hc-201 and kalyanpur type-1 by showing the positive reaction to cmv by elisa test. no symptoms was recorded in case of bc-16, lca-235 and jca-154 and showed negative reaction to cmv by elisa. bc-16 and lca-235 also showed negative reaction to tylcv by elisa and these were symptomless. maximum infection (70 and 80) was registered in hc-201 and c 8 , cultivar. so, the bc-16, lca-235 and jca-154 has proved highly resistant varieties against cmv and tylcv and these may be used in breeding programmes against viruses. cotton leaf curl virus belongs to the family geminiviridae, genus begomovirus. the members of this family contain circular single stranded dna molecules as their genomes. there are two kinds of begomoviruses-bipartite viruses with genomes consisting of two dna molecules designated dna-a and dna-b and the monopartite viruses which contain only dna-a but not dna-b. in monopartite viruses, the dna-a is accompanied by a small circular dna molecule called dna-b which is essential for the development of typical disease symptoms. cotton leaf curl virus is a monopartite virus and causes the cotton leaf curl disease which has emerged as a major disease of cotton in the indian subcontinent. the non-structural protein ac4 of cotton leaf curl kokhran virus-dabawali isolate (clcukv-dab) was cloned into pgex5x2 vector and overexpressed in bl21(de3)plyss e. coli cells. the overexpressed gst-ac4 protein was purified by glutathione sepharose chromatography. the purified gst-ac4 protein was found to possess atpase activity. the optimum temperature and ph for the activity were 37°c and 7.4 respectively. the atpase activity was inhibited in presence of edta, showing that it is dependent on divalent metal ions. the activity was supported by magnesium, manganese and zinc ions but inhibited in presence of calcium ions. it was also inhibited by the non-hydrolyzable atp analogue adenosine-b, c-imido triphosphate and in the presence of other nucleotides like ctp and gtp. the k m and the v max of the reaction for atp as the substrate are 1.54 mm and 95.2 nmol/min/ mg of the protein respectively. the enzyme could also utilize gtp as the substrate. the fact that ac4 is specifically an ntpase and not a general phosphatase is revealed by the finding that it does not hydrolyze p-nitrophenyl phosphate to yield yellow colour while a similar reaction carried out in parallel with alkaline phosphatase readily yields the colour. it has been suggested earlier that ac4 may be involved in cell to cell movement of the virus (rojas et al. 2001) . it is possible that by its ability to hydrolyze atp, ac4 serves to power viral movement in the plant. thirteen sugarcane yellow leaf virus isolates causing yellow midrib and irregular yellow spot pattern from six states of india were characterized by rt-pcr assays. scylv-615f and scylv-615r primers were used as forward and reverse primer pairs and the amplified products were cloned and sequenced. comparative coat protein sequence analysis confirmed that all the scylv-indian isolates were clustered into two major groups confirming the existence of two strains of scylv affecting sugarcane crops of india. in a separate experiment, the member of both of the phylogenetic groups were found to be transmitted by the sugarcane aphid, melanaphis sacchari from infected to healthy sugarcane suggesting its secondary spread in nature. the symptoms produced by the virus causing cotton mosaic disease were little bit different in both sap inoculation and under natural field condition. in natural field condition it has shown clear chlorosis type of symptoms on major leaves of plants but in sap inoculated plants veinal chlorosis and mosaic type of symptoms are found to be common. in field conditions infected plants grows erect and have less boll formation. there is no effect found on seed shape or seed size. the initial symptoms produced on cotton leaves after inoculation were wonderful. local lesions observed in second week from inoculation and then they changes to chlorotic type of symptoms and some are necrotic symptoms also. the plants at early stage are found to be affected, has less lateral branch development and hence reduction in yield production. the naturally field infected plants showing good symptoms are also difficult to identify in lateral stage of plant. because they disappear with time. the virus is very easily sap transmissible. the virus is found to be transmitted by thrips palmi and thrips tobacci in persistent manner. no seed transmission is observed. virus showed same physical properties as it shows in stem necrosis of peanut or sunflower necrosis disease. the physical properties are found to be thermal inactivation point (tip) 55-60°c, dilution end point (dep) 10 -2 to 10 -3 and longevity in vitro (liv) 5 h, virus infecting nineteen different host plants are identified belonging to five different types of families viz. malvaceae, chenopodiaceae, compositae, leguminaceae and solanaceae. however they found to produce same types of symptoms as in most of the host that have been tested before. in elisa test report it is found that the virus showing positive test only with anti serum of tsv of a cowpea and cotton but negative reaction with pbnv of cowpea and cotton which clearly denied possibility of presence of pbnv in cotton producing these kinds of symptoms elisa report clearly shows that tsv antiserum of cowpea is showing positive results with clear chlorotic types of symptoms. a powerful approach to functional genomics, and an alternative to the massive generation of transgenic plants, is the use of the recently described virus induced gene silencing (vigs) process, which allows viral vectors to knock out the function of a gene-of-interest. vigs is based on a silencing mechanism that regulates gene expression by the specific degradation of rna. as a tool for reverse genetics, vigs has many advantages over other common ways to study gene function because of the ability of viruses to replicate and move systemically within a plant. vigs can generate a phenocopy of a mutant without all the troubles of traditional methods of mutagenesis. geminiviruses with their small dna genomes and ease of inoculation through agrobacterium, are excellent candidates for vigs vector development. as a first step, the geminivirus bhendi yellow vein mosaic virus, characterized in our lab (jose and usha, virology 305: [310] [311] [312] [313] [314] [315] [316] [317] 2003) has been chosen. the satellite b dna associated with this virus has a single open reading frame (bc1). bc1 is essential for symptom development but not for replication. therefore, bc1 has been replaced by a multiple cloning site harbouring sali, xbai, bamhi, bsrgi and xhoi, initially in a cloning vector and then in the binary vector containing the partial tandem repeat of the b dna. in the place of the bc1 orf, the plant phytoene desaturase gene has been cloned and the resulting construct was used for agroinfiltration along with the partial tandem repeat clone of the begomovirus (dna a component chilli (capsicum annuum l.) plants exhibiting prominent symptoms of begomovirus like: leaf curl, vein swelling, shortening of petioles, crowding of leaves and stunting of plants were collected from rorkee, uttarakhand and dhaulpur, rajasthan, india. total genomic dna was isolated from naturally infected chilli samples and pcr was carried out with coat protein (located in dna-a) gene specific primers. as expected to the primers, *800 bp dna fragments were amplified from the infected chilli samples. to know the bipartite nature of the virus isolates, nuclear shuttle protein (located in dna-b) gene specific primers were employed which also resulted in positive amplification of *850 bp dna bands with all the coat protein tested positive samples. to ascertain the association of dnab component with the virus isolates, a set of dna-b specific primers were used which resulted in positive amplification of full length (*1.3 kb) dna bands in the chilli samples collected from rorkee, uttarakhand, however, multiple sizes bands were resulted with the samples collected from dhaulpur, rajasthan. these findings confirmed that both the virus isolates under study are bipartite begomovirus associated with dna-b satellite. the sequencing of the pcr products is under progress which analysis will be discussed. groundnut bud necrosis virus (gbnv) belonging to the genus tospovirus, which is a unique member of the family bunyaviridae, infects several economically important crops. the virus has three genomic ssrna segments namely s (ambisense), m (ambisense) and l (negative sense). the s rna codes for nucleoprotein (np) and non-structural protein (nss) from viral complimentary and viral strands respectively. many viral nonstructural proteins such as ns3 of hepatitis c virus, yellow fever virus, dengue virus, sv40 large t antigen and cytoplasmic inclusion protein of tamarillo mosaic potyvirus are known to exhibit rna/dna stimulated ntpase, dntpase and helicase activity. nss of gbnv does not have any sequence similarity with any of the above mentioned viral rna/dna helicases but has a ntp binding domain. however, it has been implicated as suppressor of gene silencing in vivo. with a view to elucidate the mechanism by which nss could act as a suppressor of gene silencing and examine the other potential roles of nss in the life cycle of the virus, the gbnv (to) nss was over-expressed in e. coli and purified by ni-nta chromatography. in vitro studies with the purified rnss suggest that it exhibits an rna stimulated ntpase activity. many of the proteins that possess the rna/ dna stimulated ntpase and datpase activity, are also shown to have atp dependent nucleic acid unwinding activity. it was therefore of interest to examine whether nss has the nucleic acid unwinding activity. the helicase assays revealed that nss has dna/rna helicase activity. helicase activity of nss was absolutely dependent on atp and mg 2? ion. nss could unwind dsdna substrate with 5 0 overhang, or 3 0 overhang. mutation of the crucial lysine in walker motif a (k189) severely affected the unwinding activity where as mutation of aspartate residue in walker motif b (d159) resulted in only 20% loss of activity. in this regard, rnss is a unique enzyme which does not have the canonical helicase motifs but can catalyze dsdna/dsrna unwinding in an atp and mg 2? dependent manner. the rnss might act as a suppressor of by unwinding the dsrna, the substrate for dicer. in addition to being a suppressor of ptgs, nss may also regulate the viral replication and transcription by modulating the secondary structure of the viral genome. this new research finding on nss might pave way for further studies on its role in viral replication and transcription. yellow vein mosaic disease of pumpkin (cucurbita moschata) poses a serious threat to the cultivation of this crop in india. the disease was found to be associated with whitefly-transmitted bipartite begomoviruses were detected in varanasi field using polymerase chain reaction (pcr) with primer design through coat protein conserved region of begomoviruses from ncbi database. all plant samples showing symptoms were infected with begomovirus. the virus species were provisionally identified by sequencing *750 bp of the viral coat protein gene (av1 ageratum conyzoides is commonly known as billygoat-weed, chick weed, goatweed and whiteweed. in india it is popularly known as bill goat weed. it is an annual herbaceous plant with a long history of traditional medicinal uses in several countries of the world and also reputed to possess varied medicinal properties including the treatment of wounds and burns. in cameroon and congo, it is used traditionally to treat fever, rheumatism, headache, and colic. during survey in and around gorakhpur in 2009, ageratum plants were found affected with the symptoms of leaf curling, mosaic mottling and leaf yellows. the infected leaf samples were processed for virus identification and association with pcr assays. total dna was extracted and pcr were performed with begomovirus specific primers (tlcv-cp). a *800 bp band was consistently amplified on 1% agarose. the pcr products were directly sequenced and sequence was submitted in genbank with the accession no. gq412352. the blast search analysis showed highest similarity of 98% with the ageratum enation virus. vernonia cinerea leaves with yellow vein symptoms were collected around crop fields in madurai. a 550 bp product amplified from total dna extracted from symptomatic leaves with degenerate primers designed to amplify a part of the av1 gene from begomoviral dna a component was cloned and sequenced. based on the above sequences, specific primers were designed and the full length dna a of 2745 nucleotides with typical genome organization of begomoviral dna a was obtained and was submitted to embl data base (acc no: am182232). the sequence comparison with other begomoviruses revealed the closest identity (83%) with emilia yellow vein virus from china and less than 80% with all known begomoviruses. the international committee on taxonomy of viruses (ictv) has therefore recognized vernonia yellow vein virus (vyvv) as a distinct begomovirus species. conventional pcr could not amplify the dna b or dna b from the infected tissue. however, the b dna (1364 bp) associated with the disease was obtained (acc no: fn435836) by the rolling circle amplification-restriction fragment length polymorphism method (rca-rflp) using phi29 dna polymerase. sequence analysis shows that dna b of vyvv has the highest identity (81%) with dna b of ageratum leaf curl disease and 58-77% with the b dna associated with other begomoviruses. infectious clones of vyvv dna a and dna b as dimers were made using the products of rca-rflp. these infectious clones will be used for agroinfection of vernonia and the results will be discussed. this is the first report of the molecular characterization of vernonia yellow vein virus (vyvv) from vernonia cinerea in india. production of bulb and seed crop of onion (allium cepa l.) is hampered by onion yellow dwarf virus (oydv) and iris yellow spot virus (iysv) with an incidence of 83.22% and 89.97% in bulb crop and 90.65% and 89.58% in seed crop, respectively in the popularly grown cv. hisar-2. four symptom-based variants of oydv designated as grade a, b, c and d produced varied types of symptoms in onion crop incurring heavy losses in bulb and seed production. iysv caused tiny hay coloured spots of different shapes and sizes on leaves and scapes which later coalesced and led to drying and lodging of scapes. the plant height, bulb weight and bulb size were 37.7 cm, 75.5 g and 24.2 cm 2 in plants infected with oydv, 39.6 cm, 79.7 g and 25.5 cm 2 in iysv infection, 35.1 cm, 68.4 g and 22.1 cm 2 due to their combined infection, as compared to 40.6 cm, 88.4 g and 27.6 cm 2 respectively, in healthy plants of bulb crop. in plants infected with oydv grade a the plant height was minimum (90.33 cm) whereas the number of umbels was maximum (9.20 umbels/pl.) but other yield parameters viz., weight/umbel (2.32 g), number of seeds/umbel (209), seed weight/umbel (0.64 g) and seed yield/plant (5.88 g) were recorded to be the lowest. the minimum reduction in plant height (100.26 cm), weight/umbel (6.72 g), number of seeds/umbel (633), seed weight/umbel (2.36 g) and seed yield/plant (11.90 g) were recorded in oydv grade d. the plant height was 98.84 cm with 5.10 umbels per plant, 4.24 g weight/umbel, 428 seeds/umbel, 1.25 g seed weight/umbel and 6.37 g seed yield/plant in iysv infected plants. the plant height (96.26 cm), umbels/plant (5.97), weight/umbel (4.60 g), number of seeds/umbel (432), seed weight/umbel (1.42 g) and seed yield/plant (7.82 g) were found to be the lowest in combined infection of oydv and iysv diseases in comparison to higher values in healthy controls (104.50 cm, 4.90, 7.84 g, 677, 2.60 g, 12.74 g, respectively). a minimum reduction in the test weight, germination and seed vigour index were found (3.06 g, 75.68% and 926) due to oydv grade a infection, whereas these were 2.92 g, 70.42% and 788 in iysv disease infected plants and 2.62 g, 70.4% and 776 in combined infection of oydv and iysv diseases in comparison to 3.84 g, 88.67% and 1276 in healthy plants. the maximum hampering of seed vigour parameters was recorded due to iysv infection. lodging of scapes caused by this disease was responsible for heavy losses in seed production and seed quality. cotton leaf curl disease is one of the major threats to cotton cultivation from northern india. survey conducted during 2009, observed the disease incidence ranged from 70 to 90% from bhatinda, abohar, fazilka, sri ganganagar, hanumanghar. in order to study genetic variability in the virus, twelve clcuv isolates were partially characterized (700 bp common region, full length av2 gene and partial sequences of ac1 and av1 gene). full length characterization of representative isolates from bhatinda, abohar, fazilka, sri ganganagar, hanumanghar is under progress. partial sequence analysis of clcuv isolates revealed that, the virus isolates collected during 2009 cropping season are closely related to cotton leaf curl burewala virus from pakistan and results were discussed. pratibha singh, h. s. savithri department of biochemistry, indian institute of science, bangalore tospoviruses, belonging to the family bunyavirideae, infect economically important plants such as groundnut, tomato, watermelon etc. they have a tripartite genome, with l, m and s segments of rna, in pseudo circular (panhandle) form. the viral genomes encode four structural proteins (l, n, g1 and g2) in the antisense orientation, and two non structural proteins nss and nsm in the sense orientation. the nsm is the only protein unique to tospoviruseses that infect plants in the bunyaviridae family and hence is proposed to be important for cell to cell movement. ground nut bud necrosis virus (gbnv), a member of the tospovirus genus, is the most prevalent virus infecting several species of leguminosae and solanaceae plants in india. total rna was isolated from gbnv infected tomato leaves and rt-pcr was performed using appropriate primers to amplify the nsm gene. the pcr product was cloned in pgex5x2 vector. the recombinant nsm clone was transformed into bl21 (de3) e. coli cells and over-expressed by induction with 0.3 mm iptg. sds-page analysis of induced and uninduced fraction revealed the presence of overexpressed protein of expected size. the soluble gst-nsm was purified by gsh sepharose affinity chromatography. purified gst-nsm was shown to interact with in vitro transcribed rna transcript by electrophoretic mobility shift assay. further nsm was shown to interact with viral encoded proteins np and nss using elisa and yeast two hybrid system. nsm was also shown to be phosphorylated in vitro by pellet fraction of plant sap. thus the recombinant gbnv nsm possesses the characteristic features of a movement protein such as nucleic acid binding, interaction with nucleocapsid protein, and ability to undergo posttranslational modification. solanum melongena, commonly called as egg plant is one of the most important vegetable crop in the world. it is cultivated widely in the tropical and sub tropical regions. several viruses such as cucumber mosaic cucumo virus (cmv), potato virus-y (pvy), potato virus-x (pvx) and tobacco ring spot virus (trsv) infect egg plant under natural conditions. in india major crop losses due to cmv infection in brinjal is 57% (fao stat-2008) . in the present study the infected leaf samples were collected from local fields of ramapuram, chandamama palli, chandragiri, madanapalli, yadhamari, durgasamudram villages in and around tirupati, were tested for cmv infection by dac-elisa with cmv antisera. the resulting positive samples were further inoculated to the raised brinjal seedlings of selected varieties through mechanical sap inoculation. different varieties of brinjal like mullabadhine, ankhur, ravya, mattigulla, casper and easter egg were used for monitoring the susceptibility to cmv infection. the mosaic symptoms were observed after 2 weeks of inoculation in all varities of brinjal except mullabadhina. among all these susceptible varities ankhur variety is selected to study induced biochemical changes such as chlorophylls, carbohydrates, proteins, nucleic acids and polyphenol oxidases in cmv infected brinjal leaves. in the infected leaves considerable reduction in chlorophyll and starch and increase in total proteins, sugars, rna and polyphenol oxidases was observed when compared to healthy leaves. the amount of total starch, protein and dna decreased to about 25, 136 and 645 lg/g respectively in infected leaves, where as sugars (75 lg/g), rna content (754 lg/g) and polyphenol oxidase activity was increased as compared to healthy leaves. the above results suggests that there is an altered concentrations of chlorophyll, proteins, nucleic acids, carbohydrates and polyphenol oxidase activity in the brinjal leaves due to the effect of cucumber mosaic cucumo virus infection. leaf analysis was found to be used as widely accepted diagnostic tool to assess the nutritional status of the vegetables. the present study deals with these aspects in detail. the total rna and dna was isolated from infected leaf samples. rt-pcr assays were performed using sugarcane yellow leaf virus (scylv) specific primers (scylv-615f and scylv-615r). the infection of scylv was detected in all the collected samples, which showed the expected size (*610 bp) amplicon during rt-pcr. in another experiment with nested pcr analysis, a phytoplasma characteristic 1.2 kb rdna pcr product were amplified from dnas of all infected samples but not in healthy sugarcane plants tested using phytoplasma universal primer pairs p1/p7 and fu3/ru5. dna extracts from plants with yellow mid rib and leaf yellows produced products of 1250 bp, which gave typical phytoplasma profiles when digested with hae iii and hha i. no pcr amplifications were produced using dna from symptomless plants. our results suggest that the yellow mid rib and leaf yellows symptoms on sugarcane varieties in uttar pradesh and uttarakhand states of india exhibiting midrib yellowing and leaf yellows symptoms is mainly caused by mixed infection of scylv and scylp. the affected clumps showed reduction in stalk height as compared to healthy fields. thirty-one sugarcane mosaic isolates belonged to sugarcane mosaic virus (scmv) and sugarcane streak mosaic virus (scsmv were collected from china and india), confirmed in indirect elisa and rt-pcr amplification with scmv and scsmv-specific primers. the amplicons (0.8 kb) from the coding region of coat protein (cp) were cloned, sequenced and compared to each other as well as to the sequences of 15 scmv isolates from sugarcane (australia, usa, china, brazil, mexico and south africa), maize (australia, china, iranian) and one scsmv isolate from sugarcane (india) in genbank. maximum likelihood and maximum parsimony analyses robustly supported two major monophyletic groups that were correlated with the host of origin: the scmv subgroup that included 18 isolates from china and only 13 isolates from india, and the scsmv subgroup that contained all isolates from india. maize dwarf mosaic virus (mdmv) and johnsongrass mosaic virus (jgmv) were not detected in any of the samples tested. a strong correlation was observed between the sugarcane groups and the geographical origin of the scmv isolates. the 11 millable sugarcane samples from china contained a virus tentatively described as sorghum mosaic virus (srmv). three isolates from nine chewing canes in fujian, yunnan and guizhou provinces of china also contained srmv, and the other 12 samples including five isolates from india was found infected with scmv. no srmv infection has been detected in sugarcane mosaic samples from india. sequence comparisons and phylogenetic analysis indicated that srmv can be considered as the most common and prevalent potyvirus infecting sugarcane in china, however in india sugarcane streak mosaic virus is dominant in causing mosaic symptoms on sugarcane. dig-labeled dna probe complementary to coat protein (cp) region of tobacco streak virus (tsv) sunflower isolate was designed for the sensitive and broad-spectrum detection of tsv isolates, the most devastating virus in india. dot-blot and tissue print hybridizations with the digoxigenin labeled probe were performed for the tsv detection at field levels. here, dot-blot hybridization was used to check a wide number of tsv isolates with a single probe and sensitivity with different sample extraction methods. the probe with cp conserved region prepared from sunflower pcr amplicon was hybridized with the tsv field isolates of gherkin, pumpkin, sunflower, marigold and globe amaranth samples because of highly conserved with little variability in cp region. the sensitivity limits were decreased from total nucleic acid to partially purified and crude extract preparations. in particular, tissue blot hybridization offers a simple, reliable procedure as dot-blot, but requires no sample processing. because there is minimal sample preparation, tissue-print hybridization could be an important component of tsv management programs. thus, the above non-radioactive labeled probe techniques can facilitate in screening the samples during tsv outbreaks and in quarantine services. savita patil, rupali sawant*, k. banerjee virology group, agharkar research institute, macs, g.g. agarkar road, pune 411 004 two mycobacterium smegmatis strains (ari lab nos. v842 and v946) were employed for the isolation of mycobacteriophages from soil and sewage samples. mycobacteriophages were isolated from soil samples collected from an area surrounding the tuberculosis (tb) ward, naidu hospital, pune, against m. smegmatis strain v842. these were numbered as v942, v943 and v944 and were isolated by using washed-cell preparation method. the bacteriophages against the other m. smegmatis strain, i.e. v946, were isolated from soil samples (collected from around tb ward, sassoon hospital, pune). some of these phages (viz.v953, v954) showed plaques at 42°c but not at 37°c. thus they seem to be lysogenic. for propagating and increasing the titre of all the above isolates, various previously described methods were attempted, but none of these methods were satisfactory. but when siliconized glassware and plastic-ware were used, propagation was successful. we showed that siliconization of glassware and plastic-ware was essential for the propagation of our mycobacteriophage isolates v951, v952, v953, v954 and v955. also, phage dilution medium (pdm) as described by chaterjee et al. (2000) was found to be effective for picking out of the plaques made by the phages. in this way, the phage isolates were propagated up to p 3 . the various passages of the phage isolates v951, v952, v953, v954 and v955 (i.e. original, p 1 , p 2 and p 3 ) were stored at -80°c. pvp-29 effect on pigments due to geminivirus infection on cowpea (vigna unguiculata) shail pande*, naveen pandey, k. shukla mahatma gandhi p. g. college gorakhpur, d.d.u. gorakhpur university, gorakhpur geminiviruses are one of the most important group of viruses causing economic losses in tropics. the symptom produced are yellowing of leaves which directly affect the pigments of diseased plants it in turn affects productivity and yield of diseased plant. cowpea vigna unguiculata is one of the important crop cultivated throughout india for its green pods which are used as vegetables and seeds are used as pulse. cowpea is affected by many viruses amongst them geminiviruses are one of the important virus on the cowpea plant. in the present study total chlorophyll content was studied in leaf of cowpea of diseased and healthy plants using arnon's method. carotenoids were also studied using ikan's method. it was found that chlorophyll content in diseased plants were lower compared to healthy plant similar results were found with carotenoids so the geminivirruses infection lowers the chlorophyll and carotenoid content in diseased plants which reduces yield of diseased cowpea plant. shweta sharma 1 , amrita banerjee 2 , j. tarafdar 2 , r. rabindran 3 , indranil dasgupta 1 * 1 department of plant molecular biology, university of delhi, south campus, new delhi; 2 bidhan chandra krishi vishwavidayalaya, kalyani, nadia, west bengal 741235; 3 tamil nadu agricultural university, coimbatore, tamil nadu 641003 rice tungro disease is an important disease of rice, caused by a joint infection by two viruses: rice tungro spherical virus (rtsv) and rice tungro bacilliform virus (rtbv) in south and southeast asia. the complex of rtbv and rtsv is transmitted by an insect vector green leaf hopper (glh). previously we reported complete genomic sequences of two geographically distinct isolates of rtbv; rtbv-wb (west bengal) and rtbv-ap (andhra pradesh) collected from the field in mid-1990s. both the sequences showed high homology all along the genome but showed divergence from previously reported southeast asian isolate i.e. rtbv-phil (philippines). to check whether a time period of a decade has resulted into variability in the genomic sequence of different isolates of rtbv in india, we cloned and sequenced the complete genome of rtbv from two geographically distinct regions of india i.e. west bengal and kanyakumari collected from the field in 2008. the complete nucleotide sequence of the dna fragments covering the whole genome of rtbv was determined using universal primers m13f and m13r and by primer walking, without any ambiguities remaining. the nucleotide sequences of overlapping clones were assembled and analyzed using the dna analysis software generunner and blastn program of ncbi. homology search at the nucleotide and amino acid level were performed using the blastn and blastp (respectively) programs of ncbi. multiple sequence alignments were performed using clus-tal-w software. sequence analysis results thus obtained showed that both the recently obtained complete genomic sequences of rtbv from two geographically distinct regions of india i.e. west bengal and kanyakumari showed very high homology (both at the nucleotide and amino acid levels) with the two previously reported rtbv isolates from india i.e. rtbv-wb (west bengal) and rtbv-ap (andhra pradesh) all along the genome. as observed earlier both the sequences diverged significantly from the southeast asian isolates. this suggests that even after the spatial and temporal difference (a time gap of approx 10 years) between the two previously reported rtbv isolates and the recently reported one, there is very little sequence variability between them. this further strengthens the earlier reports that the rtbv genomes in india are highly conserved. homology search at the nucleotide level using blastn program with the previously existing rtbv isolates revealed a very high percentage identity of 99% with the rtbv west bengal isolate and 95% with the rtbv andhra pradesh isolate. this further strengthens the earlier reports that there is not much genetic variability in the rtbv genomes in indian subcontinent. complete genomic rna sequences of two geographically distinct isolates of rice tungro spherical virus (rtsv), a member of the genus waikavirus, family sequiviridae, were determined from india. out of the two previously reported sequences, the indian isolates were closer to the resistance breaking strain rtsv-[vt6] than rtsv[phila] . between them, the indian sequences showed nucleotide as well as amino acid identities of 96%. a moderate homology was observed between the leader peptide and a putative helper component protein involved in insect transmission of the maize chlorotic dwarf virus, a closely related waikavirus, indicating its possible transmission-related function. unlike rice tungro bacilliform virus, which causes rice tungro disease jointly with rtsv, and is significantly different between isolates from india and philippines, rtsv genomes were observed to be much more conserved between isolates from the two countries. rice tungro bacilliform virus (rtbv) are believed to be the joint causative agents for the devastating tungro disease of rice prevalent in south and southeast asia [11] . rice tungro disease has become the major cause of production losses in rice during last three decades in several rice growing states of india. here, we report, for the first time the complete sequence analysis of two geographically distinct indian isolates of rtsv. we analyze the deduced protein sequences and their phylogenetic relationship with the two complete rtsv sequences from philippines as well as with other members of sequiviridae family. we provide molecular evidence that the indian isolates of rtsv are closely related to those from the philippines. we had earlier reported that rtbv isolates between india and philippines differ significantly from each other [18] . this study was undertaken in order to see whether rtsv isolates from india also show similar difference from those reported from the philippines. frequent outbreaks of tungro were reported near kanyakumari in the last 2-3 years. the present work was undertaken to clone and sequence the full-length rtbv and rtsv genomes from the infected rice plants collected from above region and to analyze the similarity of its genetic material with the existing indian isolates of rtbv and rtsv. a 1.1 kb dna fragment encoding the reverse transcriptase gene of rtbv genome was amplified and cloned in t/a vector and was sequenced commercially. homology search at the nucleotide level using blastn program with the previously existing rtbv isolates revealed a very high percentage identity of 99% with the rtbv west bengal isolate and 95% with the rtbv andhra pradesh isolate. this further strengthens the earlier reports that there is not much genetic variability in the rtbv genomes in indian subcontinent. similarly, the cp3 region of rtsv was amplified by rt-pcr and was cloned in t/a vector. recently, rice tungro disease has been reported from kanyakumari district of tamil nadu. it is important to determine the genetic nature of this isolate in order to develop resistance strategies. it is thus necessary to clone and characterize the viruses from kanyakumari and to determine the mechanism of virus resistance in transgenic lines. rice tungro disease is an important viral disease of rice. rice tungro is caused by infection by two viruses: rice tungro bacilliform virus (rtbv) and rice tungro spherical virus (rtsv). rtsv is a plant picornavirus with a 12 kb single stranded rna genome. it belongs to genus waikavirus in the family sequiviridae and is necessary for transmission of the two viruses by the leafhopper vector nephotellix virescens. rtsv rna is translated to form a large polyprotein, which is then self cleaved to form the viral proteins, including the three coat proteins, replicase, protease. studies have been conducted on rtsv from philippines. correct information of sequence variability of viral isolates to check whether different geographical conditions like those present in india select for genotypically variable strain and to design for transgenic resistance strategy, information on rtsv from india is absolutely essential. the objective of this study was to clone rtsv isolates from india and compare the genetic diversity of indian isolates from other southeast asian isolates and amongst each other. also develop strategy to impair the attack of virus-complex on rice. the achieve this, complete genomes of two isolates from india were cloned by amplifying different genes by rt-pcr and subsequently cloned in ta vectors, followed by sequencing. subsequently constructs containing cp1-3, antisense replicase, sense replicase and double stranded replicase were cloned in plant transformation vector. these constructs were used to transform aromatic rice variety from indian-pusa basmati (pb1). pcr analysis of the above plants was done to check the stable insertion of insert in the transgenics. jatropha (jatropha curcas) of the family euphorbiaceae is being grown in india as a major commercial fuel (bio-diesel) crop. jatropha is cultivated in 200 districts of 19 potential states of india. unfortunately, the cultivation of jatropha is limited by the severe mosaic disease. recently, a severe mosaic disease with significant disease incidence was observed in 2006-2009 on j. curcas grown in experimental plots of nbri and j. gossypifolia, a weed growing road side around lucknow and kathaupahadi, madhya pradesh. the disease consisted of the symptoms of severe mosaic, blistering, leaf distortion and stunting of whole plant and no fruit/seed production in severely affected plants. symptomatology and whitefly population observed on them suggested the occurrence of begomovirus infection. to detect the begomovirus infection, the total dna from leaf samples of infected jatropha plants was extracted and polymerase chain reaction (pcr) were performed using three sets of begomovirus genus specific (cpit-i/cpit-t, paliv 1978/paric 496 and paliv 722/palic 1960) primers and the expected size *800 bp, 1.2 kb and 1.2 kb amplicons were obtained which confirmed the begomovirus infection. further to identify the begomovirus/es and investigate the genetic diversity among them exists if any, the *1.2 kb amplicons were cloned and sequenced. the sequence data were deposited in the genbank database under accession nos.: gq847545 and fj346232 (from j. curcas) and eu727086 and fj177030 (from j. gossypifolia). during blast analysis gq847545 and fj346232 shared highest 95% sequence identity with each other and 84-88%% with sri lankan cassava mosaic virus (aj579307, aj607394, aj890225, aj89 0229 and aj890224) and indian cassava mosaic virus from india (ay738105) therefore, designated as two strains of jatropha mosaic india virus-lucknow. blast analysis of eu727086 showed maximum 93% similarities with croton yellow vein mosaic virus (aj507777), 82% with tomato leaf curl new delhi virus (dq629102) and 80-79% with papaya leaf curl virus (aj436992 and y15934), therefore, identified as strain of croton yellow vein mosaic virus. blast analysis of the virus isolate (fj177030) showed highest 83% identities with tomato leaf curl virus-bangalore ii (tolcv-b ii-u38239) and 82-81% with tomato leaf curl karnataka virus (tol-ckv, ay754812, fj514798), therefore, considered as new begomovirus species ''jatropha yellow mosaic india virus''. the phylogenetic analysis of gq847545 and fj346232 (from j. curcas) and eu727086 and fj177030 (from j. gossypifolia) was performed along with some selected isolates of begomovirus which showed [90% sequence identities during blast analysis. the isolate eu727086 showed closest relationship with croton yellow vein mosaic virus while fj177030 showed separate clustering of all the four begomovirus from jatropha species. during phylogenetic analysis these isolates formed three separate clusters, therefore, they were considered as three distinct begomoviruses. the above data clearly show that some genetic diversity exists among the begomoviruses infecting jatropha species in india. bitter gourd (momordica charantia l.) of the family cucurbitaceae, also known as bitter melon is extensively cultivated in north eastern region of uttar pradesh, india. it is regarded as one of the world's major vegetable crops and has great economic importance. a severe yellow mosaic disease on bitter gourd (momordica charantia) with a significant disease incidence was observed during the survey of different locations of eastern up, india in the year 2007. the whitefly (bemisia tabaci) population was also observed in the vicinity. the characteristic disease symptoms and whitefly population indicated the possibility of begomovirus infection. total dna were isolated from infected as well as healthy leaf samples. two primer pair (tlcv-cp and roja's primer) were used to study, which resulted *800 bp with tlcv-cp in 3/3 samples and *1.3 kb amplicons with roja's primer in 3/4 samples. for further identification of the begomovirus, the pcr amplicons were cloned and sequenced (genbank accession no. eu439260 and eu888908, respectively). the blastn search analysis of eu439260 indicated 99-95% identity with several isolates of tomato leaf curl new delhi virus (tolcndv). the phylogenetic analysis also showed closest relationships of the isolate (eu439260) with tolcndv isolates. based on highest sequence identity and closed relationships with tolcndv the virus isolated from bitter gourd was considered as an isolate of tomato leaf curl new delhi virus. while, blastn search analysis of eu888908 isolate, shared highest 99-97% identites with pepper leaf curl bangladesh virus (peplcbv) isolates. the phylogenetic analysis of the virus isolate with selected begomovirus isolates revealed a closest relationship with peplcbv. these results confirmed the association of peplcbv on bitter gourd. study revealed the variability of viruses on bitter gourd in eastern up, india. tobacco streak virus groundnut isolate was characterized biologically by taking six cultivars (jl24, tmv2, k6, k7, k9) and one pre-release culture (k1271) using seedlings of 7-84 days old under glasshouse conditions. there were clear differences were observed among cultivars tested regarding incubation period, percent seedling wilt and time taken to death of seedlings. k-7 was least susceptible among all the cultivars tested and it supported least virus titer (a 405 nm: 0.11-1.23). both localized (necrotic lesions on leaf, veinal necrosis, leaf yellowing, wilting) and systemic (petiole necrosis, necrotic lesions on young leaves, death of top growing buds not only on main stem but also on all primaries (side shoots), followed by stem necrosis, stunted growth, axillary shoot proliferation with small leaves having general chlorosis, peg necrosis, pod necrosis, pod size reduction, wilt of plants) symptom were observed in all cultivars tested. biological differentiation of tsv and gbnv was made by sap inoculation of both viruses separately using susceptible groundnut cultivar jl24 under glasshouse conditions. there were certain similarities and differences were observed between these viruses infecting groundnut. seed infection of tsv ranged from 18.9 to 28.9% in seeds collected from naturally infected and sap inoculated groundnut cultivars/pre-releases (jl24, tmv2, k-6, k-7, k-9 and k-1271) belonging to spanish and virginia types. tsv was detected both in pod shell and seed testa from pod samples produced by sap inoculation under glasshouse conditions. however, seed transmission of tsv was not observed in groundnut. coat protein (cp) gene of three groundnut tsv isolates (gn-ap-1-00; gn-ap2-04; gn-ap3-07) were sequenced and all the three isolates contained a single open reading frame (orf) of 717 bp nucleotide and could potentially code for 238 amino acids (aa). cp gene of tsv isolates originating from different hosts shared high degree of sequence identity both at nucleotide (97.6-100%) and amino acid (95.7-100%) levels respectively. tones grown in an area of 3.83.430 ha (fao stat2007). in india papaya is grown in nearly 80,000 ha with an annual production of 7,00,000 tones (fao stat 2007) and occupies fourth place in the world. the crop is severely affected by a number of viruses. papaya ring spot virus (prsv-p) is the most important virus. the detection of virus infection in plants has traditionally involved either bioassay on indexing plants and or immunological methods (hill 1981, torrence and jones 1981) . use of nucleic acid probes has improved the detection and sensitivity of viruses. the most common non-radioactive probes are biotynilated probes, which are very specific and sensitive. papaya ring spot virus (prsv-p) is a positive sense ssrna virus belonging to the genus potyvirus family potyviridae and transmitted by aphids. prsv-p coat protein gene region was used as template cdna for probe preparation. dot-blot hybridization with the biotin labeled probe were performed for prsv-p detection. the clarified sap of healthy and infected plants were serially diluted and spotted onto the nitrocellulose membrane, hybridized to biotin labeled probe. biotin labeled rna's are employed as probes, with a subsequent detection based on streptavidin-alkaline phosphatase conjugates. the sensitivity for viral detection of the biotin labeled probe was found to be sensitive than enzyme linked immunosorbent assay (elisa). in recent years tospovirus is causing devastating damage to the yield of vegetables in india. it infects economically important crops viz., tomato, chilli, peppers, groundnut, watermelon and various legumes. now it is emerging as severe disease in brinjal also. in order to monitor the natural occurrence and distribution of tospovirus in vegetable, surveys were conducted in the predominant brinjal growing areas of gujarat, karnataka, maharashtra and andhra pradesh during 2008-2010 incidence ranging from 5 to 10%, 0 to 80%, 1 to 40%, and 0 to 55.78% respectively. samples collected from different places of india were found positive to pbnv in direct antigen coating-enzyme linked immunosorbent assay (dac-elisa). pbnv infected brinjal plants showed mosaic mottling of leaves with leaf distortion, longitudinal streaks on the stem and necrotic rings on leaves and fruits. early infection led to severe stunting and abnormal fruiting. biological and molecular characterization of pbnv-brinjal isolates were compared with other isolates and results are discussed. for identification of virus causing mosaic symptoms on soybean various host plants were tested. plants species belonging to the different families viz. caricaceae, graminae, leguminosae, malvaceae and solanaceae were tested. the virus produced symptoms on diagnostic plant species like chenopodium album, c. quinoa, helianthus anus, phaseolus vulgaris and vigna ungiculata. among tested families the leguminosae that were the host of virus included arachis hypogea, the virus causing mosaic symptoms in soybean is inactivated between 50 and 55°c and between dilution of 10 -4 to 10 -5 . all the inoculated plants of assay host showed the symptoms at 50°c but not at 55°c. similarly local lesions produced at 10 -4 but not at 10 -5 . the virus in crude sap was infectious up to 72 h but not at 96 h at room temperature. however, the percentage infectivity decreased progressively as the aging of the sap was increased at room temperature. on the basis of reactions on diagnostic hosts pvp-38 identification and characterization of potyvirus infected chilli (capsicul annum l the virus under study caused mild mosaic and severe mottling symptom in leaves of infected plants. the dilution end point (dep) of the virus was found to be 10 -3 to 10 -4 , longevity in vitro (liv) 1-3 days at room temperature (25°c), thermal inactivation point (tip) 50-55°c. electron microscopy of purified virus preparation revealed the presence of flexuous particle of size 780 nm long and 14 nm in width with characteristic cytoplasmic inclusions: pinwheels and scrolls. the virus was transmitted by sap and by aphid myzus persicae. the host range study revealed that the host species were restricted to family chenopodiaceae and solanaceae. on the basis of above characteristic, the virus under study was identified as potyvirus associated with mild mosaic and severe mottling symptom in capsicum. phytoplasma causing grassy shoot disease and sugarcane yellow leaf viruses are important pathogens of sugarcane. these pathogens are causing severe losses in sugarcane productivity. with a view to producing virus and phytoplasma free planting material of sugarcane, experiments were undertaken using infected varieties of sugarcane growing at the farms of sugarcane research institute. apical meristems measuring about 2 mm in length, were dissected out, surface sterilized and cultured on agar gelled murashige and skoog's (ms) medium containing growth regulators for shoot induction. the established shoot cultures were multiplied through repeated subcultures on fresh media at 10-12 days interval. elimination of gsd and scylv was confirmed through molecular analysis of regenerated plants using specific primers of scylv and gsd. results revealed that apical meristem culture technique is effective in eliminating the pathogens like scylv and phytoplasma (gsd) from the infected clones. this is probably the first report on elimination of grassy shoot disease in sugarcane through meristem culture. papaya ringspot virus (prsv), which causes the most widespread and devastating disease in papaya, isolates originating from different geographical regions in south india were collected and maintained on natural host papaya. the entire coat protein (cp) gene of papaya ringspot virus-p biotype (prsv-p) was amplified by reverse transcription-polymerase chain reaction (rt-pcr). the amplicon was inserted into pgem-t vector by t-a cloning method, sequenced and sub cloned into a bacterial expression vector prset-a using directional cloning strategy. the prsv coat protein was over expressed as fusion protein in e. coli. sds-page gel revealed that cp expressed as a *40 kda protein. the recombinant coat protein (rcp) fused with 69 his-tag was purified from e. coli using ni-nta resin. the antigenicity of the fusion protein was determined by western blot analysis using antibodies raised against purified prsv. the purified rcp was used as an antigen to produce high titer prsv specific polyclonal antiserum. the resulting antiserum was used to develop an immunocapture reverse transcription-polymerase chain reaction (ic-rt-pcr) assay and compared its sensitivity levels with elisa based assays for detection of prsv isolates. ic-rt-pcr was shown to be the most sensitive test followed by dot-blot immunobinding assay (dbia) and plate trapped elisa. key: cord-023208-w99gc5nx authors: nan title: poster presentation abstracts date: 2006-09-01 journal: j pept sci doi: 10.1002/psc.797 sha: doc_id: 23208 cord_uid: w99gc5nx nan background and aims: homodimerization of myd88 adapter protein is essential for nf-kb activation in the inflammatory pathway triggered by il-1 and tlr [1] . we designed a peptidomimetic of the myd88 tir domain consensus peptide arg-asp-val-leu-pro-gly-thr [2] , named st2825. here, we report its synthesis and biological activity. we also report the synthesis and biological activity of its enantiomer, st3511, and its diastereoisomers, st3489 and st3558. methods: the structure of the myd88 tir domain consensus peptide is subdivided into three distinct portions, the most important of which is a b-turn. in the peptidomimetic design we changed the b-turn with a tricyclic spirolactam [3] , already known [4] . we synthesized this building block, its enantiomer and two of 8 possible diastereoisomers by "in solution" synthesis. based on semiempirical calculation of heat of formation [5] , we could predict the right stereochemistry of the 4 products selectively obtained in the last cyclization step. results: these four compounds were tested for their biological activity by reporter gene assay (rga). some coimmunoprecipitation experiments were also carried out and we report their results. conclusions: the results show the activity of st2825 and its isomers on our target, with limited specificity towards their stereostructure. introduction of a methylene bridge between the cα(i+1) and the n(i+2) atoms in an open peptide (i) to mimic simultaneously the cαh(i+1) and hn(i+2) protons (β-lactam scaffold assisted design -β-lsad) has proven to be a practical tool for the preparation of monotopic β-turn peptidomimetics (ii, r2 = r3 = h), according to the principle of separation of constraint and recognition elements1. in this work we report a short, general, and stereocontrolled synthesis of multitopic β-lactam scaffolds of type vi. α-alkyl serinates iii are converted into the corresponding enantiopure nnosyl-aziridines iv which undergo "in situ" ring-opening with amino acids v. subsequent base-promoted cyclization affords the n-protected α-alkyl-α-amino-β-lactams vii. incorporation of the novel scaffolds into linear and cyclic peptides and their conformational features are also presented, most of them showing stabilized β-and γ-turn conformations. poly(amino acids) are emerging as promising therapeutic carriers finding widespread application in the field of drug delivery. in this context, polyproline polymers have been used to solubilize poorly water-soluble proteins, in affinity chromatography for the purification of platelet profilin, and more recently, in the design of dendrimers. poly(amino acids) are most conveniently synthesized by polymerization of the corresponding amino acid n-carboxyanhydride (nca). in spite of the interest of polyproline, the preparation of proline n-carboxyanhydride (pro-nca) renders poor synthetic yields. in this work a new method for the preparation of pro-nca in high yields and purities is described. amino acid n-carboxyanhydrides are obtained by the method described by fuchs. but, in the case of proline, the n-carbamoyl chloride does not cyclise spontaneously as it takes place with other amino acids, and the use of a non-nucleophilic base is required for the cyclisation. a tertiary amine, such as triethylamine, is commonly used but it renders a low conversion of the n-carbamoyl chloride to the expected pro-nca, together with the presence of the pro-pro diketopiperazine byproduct. in the present work, polymer-supported bases have been used instead of triethylamine. higher yields of pro-nca, and very low percentages of diketopiperazine have been obtained. in addition, no tertiary amine contamination was observed. polymer-supported bases could also be recycled and pro-nca yields were reproducible. in conclusion, we have developed an efficient method for pro-nca preparation with polymer-supported bases. the introduction of novel nonproteinaceous heterocyclic amino acids into peptides results in new compounds with interesting structural, physicochemical and biological properties. the transformation of amino acid side chains after the peptide assembly is a convenient method of generating such modified peptides. taking into account the biological activity and complexing abilities of nitrogen-containing heterocycles, we investigated the formation of imidazole, benzimidazole and quinoxaline moieties using condensation with various aldehydes and α-dicarbonyl compounds after classical peptide synthesis on solid support. the imidazole synthesis utilizes the n-terminal or side chain amino group of amino acids, whereas a derivative of phenylalanine, β-(4-amino-3-nitrophenyl)alanine, was developed for benzimidazole and quinoxaline synthesis. the modified peptides were purified by preparative hplc and characterized by esi-ms, uv and nmr. in conclusion, we developed a straightforward method of synthesis of peptides with specific ion affinity and spectral characteristic. the broad range of commercially available aromatic aldehydes and dicarbonyl compounds makes possible the synthesis of combinatorial libraries of modified amino acids and peptides. part of this work was supported by a grant no. 3t09a 110 28 from the ministry of education and science. nmda receptors belong to the ionotropic group of glutamate receptors. the activity of the receptor can be altered by compounds acting at binding sites. the (r,s)-(tetrazol-5-yl)glycine (tg) has been shown to be a highly potent nmda (n-methyl-d-aspartic acid) receptor agonist with exitotoxic effects [1] . the aim of our studies was to investigate the chelating ability of tg towards copper(ii) ions. copper is widely distributed throughout the body with a distinct concentration in the brain. copper enters cells as complex and seeks out targets requiring it to function. for these reasons it was interesting to evaluate stability and structure of tg -copper(ii) complexes. the equilibrium and structural properties of complex species were characterized by ph-metric and spectroscopic (uv-vis and epr) methods. in the system, polymeric species are dominant at acidic ph range having { nh2, coo-} coordination with possible ntetr bridging elements. monomeric complexes were found at physiological ph. the two tg molecules are bound to copper ion via four nitrogen donors. the formation of two {nh2, ntetr} donor sets results in very strong metal-ligand interactions and the complex species are very stable over a wide ph region. we have also performed an investigation on similar tetrazole compounds in order to compare the chelating ability of the tetrazole moiety . the targets of our studies were 1,5-diamino-1h-1,2,3,4-tetrazole [2] and tetrazole aspartic acid. references continuing work in that field, we synthesized oxytocins containing tetrazole analogues of amino acids. the 5-tetrazolyl group is widely used in medicinal chemistry as an isostere of the carboxyl group. compounds containing tetrazole ring appear to be metabolically more stable than their carboxylic analogues and have comparable acidity. we synthesized derivatives of aspartic, glutamic, and alpha-aminoadipic acids containing 1h-tetrazole ring in side chains. these derivatives were then used for syntheses of oxytocin analogues substituted in position 4. apart from above we also obtained two analogues with tetrazole analogue of glycine in position 9. the first one contains 1h-tetrazole ring, the second one has tetrazole ring substituted with methyl group in position 1. oxytocin analogues possessing amino acids with tetrazole ring in side chains were synthesized on amide resin using fmoc methodology. in the case of analogues with c-terminal tetrazole ring, fragments 1-7 were synthesized on resin and then coupled with suitable dipeptides in solution. all obtained peptides show no pressor and rather low uteronic activity. however, for some analogues the uterotonic activity when measured in the presence of magnesium ions was several times higher. in humans, two classes of defensins, α-defensin and β-defensin, have been identified on the basis of tissue specificities and structural features including their modes of disulfide pairing. in general, particular combinations with disulfide bonding in cysteine-containing peptides are critical for expressing their intrinsic biological activities. in the case of human α-and β-defensins, however, disulfide isomers without the native pairing were demonstrated to exhibit similar antimicrobial activity to that of the native defensins. therefore, to assess the biological activities of defensins as well as defensin-based therapeutics, extreme care is required in the chemical synthesis of these peptides to avoid ambiguity in quality. in the present study, we synthesized human α-defensin-1, -2 and -4, and human β-defensin-1, -2, -3 and -4 by employing boc chemistry, and determined the optimal conditions for folding the respective reduced peptides preferentially into a native conformation. among the factors affecting the oxidative folding in the presence of reduced and oxidized glutathione, the buffer concentration and reaction temperature were essential. all the synthetic human α-and β-defensins were confirmed to have the respective native disulfide pairing by sequential analyses and mass measurements with cystine segments obtained by enzymatic digestion. all the human α-and β-defensins could be efficiently oxidized to the α-and β-defensin-type disulfide structure, respectively, under several conditions determined in the present study. these synthetic peptides of high homogeneity were used to accurately assess the antimicrobial activity. native chemical ligation is based on the reaction of a peptide bearing a c-terminal thioester group with an n-terminal cysteinyl peptide, leading to the formation of an amide bond at the aa-cys junction. the key starting materials for native chemical ligation are unprotected c-terminal thioester peptides. thioester peptides are often prepared using boc/benzyl solidphase peptide chemistry. however, the widespread use of the fmoc/tert-butyl chemistry for peptide synthesis, over the boc/benzyl method, has stimulated the development of methods allowing the preparation of thioester peptides that are compatible with the basic treatments used to remove the fmoc alpha-amino protecting group. we report here a novel method for thioester peptide synthesis that is based on the use of the sulfonamide safety-catch linker. once the peptidyl chain is assembled by fmoc/tert-butyl chemistry, the thioester function is generated on the solid-phase through an intramolecular n,s-acyl shift. the procedure seems to be insensitive to the bulkiness of the amino acid directly attached to the sulfonamide linker. the thioesters were successfully used for native chemical ligations in solution or on the solid support. we optimized the recognition sequence of the substrate and the reaction conditions with respect to the yield. the sortase-mediated ligation was successfully applied to the synthesis of cellpenetrating peptide-pna conjugates which showed enhanced activity in antisense experiments compared to pna alone. this ligation strategy was also employed for the coupling of a chemically synthesized construct of the extracellular loops of the crf-receptor with the corresponding n-terminal receptor domain, which was expressed in e. coli. this 23 kda protein behaves like an artificial receptor, binding specifically natural ligands. linear gramicidins represent the most investigated family of antibiotic peptides forming ionic channels. gramicidins produced by bacillus brevis are hydrophobic peptides composed of 15 amino-acids with d and l configuration strictly alternate. the presence of d-amino acids in the sequence of gramicidin a (hco-val-gly-ala-dleu-ala-dval-val-dval-trp-dleu-trp-dleu-trp-dleu-trp-nhch2ch2oh) should possible make the peptide highly resistant to proteolysis [1] . striking features like ethanolamine group in c-terminus, the n-terminal n-formylated valine and the high hydrophobicity of the peptide sequence, make the solid-phase synthesis of gramicidin a very tricky. therefore, we followed a new synthetic strategy for peptide chain elongation assisted by microwave energy. in fact, microwave energy has been demonstrated to produce highpurity compounds with more rapid reaction times, enhancing coupling rates and efficiency in difficult syntheses [2] . however, microwave-assisted solid phase peptide synthesis (mw-spps) has not been yet extensively investigated. in this context, we synthesized gramicidin a by mw-spps in high yield and purity, enhancing reaction rate compared to the traditional spps. thermal disruption of peptide aggregation, induced by microwaves, is possible favorable for obtaining this particularly difficult sequence. gramicidin a was incorporated in synthetic lipid bilayers, self-assembled on mercury electrodes, characterized by hydrophilic spacers interposed between the metal and the lipid bilayer. we tested the behaviour of gramicidin a in biomimetic membranes using electrochemical impedance spectroscopy (eis), ac voltammetry and other electrochemical techniques [3] . [ csf114(glc) is an n-glucosylated peptide to be produced in large scale by peptlab because it is the active molecule of the first specific diagnostic/prognostic test for monitoring disease activity and guiding therapeutic treatments of multiple sclerosis patients [1] . in order to develop a synthetic protocol by an automated instrumentation, increasing yield, purity of the crude, and reaction time, a microwave-assisted solid phase peptide synthesis was validated comparing the use of the new generation of triazine-based coupling reagents (tbcrs) with a series of commonly used ones. activation of carboxylic acids by tbcrs is particularly effective because of formation of triazine "superactive esters". the usefulness of tbcrs as coupling reagents has been recently confirmed in the synthesis of z-, boc-, and fmoc-protected dipeptides, sterically hindered amino acids, in the synthesis of esters, in manual and automated spps of difficult peptide sequences, and head-to-tail constrained cyclopeptide analogues [2] . moreover, we also demonstrated tbcrs efficient in a microwave-assisted solution synthesis of the n-glucosylated building block fmoc-asn(glcoac4)-oh using a manual monomode microwave instrument [3] . this building block was used to obtain csf114(glc) comparing the efficacy of a monomode microwave automatic instrument with the traditional solid-phase peptide synthesizers such as the manual and automatic in batch systems, as well as the continuous-flow one. it is known that enzymatic peptide synthesis is more advantageous than chemical synthesis in many aspects; it is highly stereoselective, racemization-free and requires minimal side-chain protection. the method is, however, limited to the use of amino acid derivatives which meet the enzymatic specificity as a coupling component. this problem may be solved using enzymes which have wide specificity of substrate. but in this case, secondary hydrolysis of the resulting peptide may arise from the inherent nature of the protease. in this matter, ficin and ficin-like enzymes were used as cysteine protease to analyze the diminishment of specificity for the substrate. the cysteine protease-catalyzed peptide coupling reaction has been studied by using synthetic fourteen boc-amino acid phenyl and naphthyl esters as acyl donor. the reaction conditions were optimized for organic solvent, ph, and concentration of acceptor. the coupling reaction was carried out by incubating an acyl donor (1 mm) with an acyl acceptor (ala p-nitroanilide, 35 mm) and enzyme (0.1u) in a mixture of gta buffer (50 mm, ph 9.0) and dmso (3:2) at 37ْc. the progress of the coupling reaction was monitored by rp-hplc. the products were obtained in satisfactory yields. non-enzymatic glycosylation, also called glycation, is a common modification in living organisms formed by the reaction of carbohydrates with free amino groups of peptides and proteins. it is a slow chemical reaction yielding amadori products undergoing further oxidation and degradation reactions finally leading to advanced glycation end-products (age). amadori products are early markers for ageing, diabetes mellitus and alzheimer's disease. despite the clinical importance of these amadori products, universal protocols to synthesize amadori modified peptides are still missing. here we describe a solid phase strategy for the glycation of specific amino groups on partially protected resin bound peptides using a global post synthetic approach. the peptides were synthesized by standard fmoc/tbu-chemistry using carbodiimide activation. the lysine position to be modified was incorporated with a methyltrityl protected ε-amino group, which can be selectively cleaved after completion of the peptide synthesis with 1% tfa in dichloromethane. the partly deprotected peptide was glycated in methanol using a ten-fold molar excess of 2,3-4,5-di-o-isopropylidene-aldehydo-β-d-arabino-hexos-2-ulo-2,6-pyranose and nabh3cn for 18 h at 70°c. after cleavage the overall yields were in the range of 50 -70 % for the tested octapeptides. all byproducts were well separated by rp-hplc allowing a simple purification strategy even for medium-sized peptides. thus the general strategy presented here allows routine synthesis of amadori peptides at reasonable yields and purities using standard protocols established in most laboratories synthesizing peptides. 2-chlorotrityl chloride resins are recommended for the synthesis of c-terminal proline peptide acids to overcome diketopiperazine formation during chain assembly. however, we have found these (and similar) resins to be unsuitable for the synthesis of peptides greater than 20 residues. for example, the chemokine guinea pig eotaxin, (73 residues c-terminal proline) assembles poorly if not at all on a 2-chlorotrityl resin. we sought to circumvent these problems in the chemical synthesis of peptides and proteins, through the development of a resin-swap procedure. whereby the initial c-terminal protected tripeptide is assembled on a 2-chlorotrityl resin, liberated from the solid-support, then reattached to a resin that is suited for long chain peptide / protein synthesis. using this approach, the synthesis of guinea pig eotaxin is reported. the tripeptide fmoc-thr(but)-lys(boc)-pro-oh was assemble on 2-chlorotrityl resin, cleaved with 20% tfe in dcm and attached to wang resin using standard protocols. peptide assembly gave the gp eotaxin in 53% overall yield (as determined by uv monitoring). fmoc-on cleavage, purification and tag removal followed by folding gave the native chemokine in good yield. choice of resin is one of the most critical factors in ensuring a successful peptide synthesis, we have shown the superiority of wang resin over chlorotrityl resin in the synthesis of medium and long peptides and developed a method for the synthesis of c-terminal proline containing peptides which overcomes the problem of diketopiperazine formation. the technique is being applied to the synthesis of other c-terminal proline peptides e.g. human eotaxin and ip10. dimerization of cell receptors, involved in antigen presentation, is an essential step in several cellular signal transduction processes, therefore substances that are able to modulate such a process are of potential therapeutic value. dimeric peptide ligands could represent useful tools to cause dimerization of such receptors. a similar strategy applies dimerization of ligands, interacting with dimeric proteins or proteins with multiple binding sites, to design molecules with enhanced affinity. dimeric analogs of the immunosuppressory hla class ii fragments were synthesized using suitably modified, standard fmoc solid-phase protocols and mbha-resin. the dimerization was achieved by crosslinking n-terminal amino groups of the peptides with the commercially available mixture of poly(ethyleneglycol)biscarboxylic acid (average mw 600, length range 30-45å), activated by esterification with pentafluorophenol. the same procedure was applied to synthesize a series of dimeric analogs of c-terminal fragments of plexin-b, consisting of two undecapeptides, linked by the polyethyleneglycol spacers. other biand polyvalent linkers were also investigated. our results demonstrated that the amino-terminal dimerizations of the tested hla-fragments resulted in enhanced immunosuppressive activities, whereas interaction of pdz dimer with the plexin fragments led to about 20-fold increase in affinity, as compared to their monomeric counterparts. [background and aims] elucidation of alzheimer's disease (ad)-related aß1-42 dynamic events is a difficult issue due to uncontrolled polymerization. [methods] based on the "o-acyl isopeptide method" (chem. commun. 2004, 124; j. am. chem. soc. 2006, 128, 696), we have developed a novel photo-triggered "click peptide" of aß1-42 (1), e.g., "26-n-nvoc-26-aiaß42 (2)", in which a 6-nitroveratryloxycarbonyl (nvoc) group was introduced at ser26 in 26-o-acyl isoaß1-42 (26-aiaß42, 3). [results] i) the click peptide 2 did not exhibit the self-assembling nature under physiological conditions due to one single modified ester; ii) photo-irradiation of the click peptide 2 and subsequent o-n intramolecular acyl migration afforded the intact aß1-42 (1) with a quick and one-way conversion (so-called "click"); and iii) no additional fibril inhibitory auxiliaries were released during conversion to aß1-42 (1). [conclusions] this method provides a novel system useful for investigating the dynamic biological functions of aß1-42, such as the self-assembly and aggregation processes in ad. several insulin analogues have recently been introduced clinically for improved treatment of diabetes. industrial productions of such insulins are based on microbial expression systems, which are highly efficient, but generally limited to the 20 proteogenic amino acids. also, some sequences form inclusion bodies or fail to express. the total chemical synthesis of insulin in research scale was a landmark achievement in peptide science. however, the most commonly used method relies on recombination of a-and b-chains under "random" folding and pairing of the three disulfide bridges. this folding/oxidation step is difficult and low yielding. a general approach using a removable auxiliary which can direct correct formation of disulfide bridges is highly desirable. in the pancreas as well as in microbial expression systems, insulins are prepared and folded as single chain precursors, with a c-peptide connecting the a and bchains. the c-peptide helps direct the orientation of a and b-chains in obtaining the correct disulfide pairing and overall peptide folding. upon folding, the c-peptide is removed enzymatically. we report here a new method for total chemical synthesis of insulin by use of fmoc-based step-wise solid-phase synthesis of single-chain precursors followed by cpeptide directed folding and cleavage of c-peptide, thereby allowing total chemical synthesis of novel insulins with unnatural substitutions. 2-chloro-4-methoxy-1,3,5-triazines 1a-c anchored on cellulose, silica or wang resin were prepared by the treatment of 2,4-dichloro-6-methoxy-1,3,5-triazine with appropriate solid support in the presence of a base. immobilized, environmentally friendly triazine coupling reagents 3a-c were obtained by treatment of 1a-c with n-methylmorpholinium p-toluenesulfonates 2 in the presence of hcl acceptor. the loading of the solid carriers were calculated from n, s contents, determined by microanalysis. all prepared immobilized n-triazynylammonium toluenosulfonates 3a-c have been found stable at room temperatures. activation of carboxylic components afforded triazine activate esters 4a-c connected to the support. treatment of 4a-c with appropriate amino components gave amides or peptides. the final products, chromatographically homogenous amides and peptides, were isolated by filtration or extraction from the solid support. mutter's pseudoproline dipeptides and sheppard's hmb derivatives are powerful tools for enhancing synthetic efficiency in fmoc spps. they work by exploiting the natural propensity of n-alkyl amino acids to disrupt the formation of the secondary structures during peptide assembly. their use results in better and more predictable acylation and deprotection kinetics, enhanced reaction rates, and improved yields of crude products. however, these approaches have certain limitations: pseudoproline dipeptides can only be used for sequences containing serine or threonine, and the coupling of the amino acid following the hmb residue can be extremely difficult. to alleviate some of these shortcomings, we have prepared fmoc-ala-(dmb)gly-oh and fmoc-gly-(dmb)gly-oh. these dmb-dipeptides can be incorporated into peptides in place of ala-gly and gly-gly, resulting in peptides containing structure breaking (dmb)gly residues. by introducing the (dmb)gly residue as part of a dipeptide unit, the need to acylate the highly hindered secondary amino group of (dmb)gly is avoided. on treatment with tfa the dmb group is cleaved regenerating gly. to test the efficacy of our new derivatives in expediting the synthesis of hydrophobic peptides, we undertook the preparation of the challenging neurotoxic prion peptide 106-126 1; this peptide reportedly can not be made using fmoc spps methods. the dipeptides marked in bold were systematically substituted with the appropriate dmb peptides. the effects of the substitution were evaluated using conductivity monitoring and lc-ms analysis of the crude peptides. h-lys-thr-asn-met-lys-his-met-ala-gly-ala-ala-ala-ala-gly-ala-val-val-gly-gly-leu-gly-oh 1 th120 efficient dipeptide production form unprotected l-amino acids with the novel enzyme l-amino acid α-ligase. k. tabata 2 , h. ikeda 1 , m. yagasaki 1 , s. hashimoto 1 background and aims: application of α-dipeptides has been limited due to the lack of cost-effective manufacturing methods. the known methods require the protection of amino acid(s) to fix the order of the amino acids ( fig. 1) . furthermore, they usually accompany the formation of longer peptides. to establish the costeffective manufacturing method, a novel activity which synthesizes α-dipeptides from two unprotected l-amino acids was screened. methods and results: a gene was found in the genome of bacillus subtilis by in silico screening based on a putative reaction mechanism. the purified protein coded on the gene, i) catalyses α-dipeptide formation from unmodified l-amino acids with a specific order in an atp-dependent manner, ii) never forms tri-or longer peptides, and iii) takes a wide variety of l-amino acids but no d-amino acids. the enzyme was tentatively named l-amino acid α-ligase (lal). the whole cell reaction of a recombinant e. coli strain expressing lal and polyphosphate kinase (ppk) with two l-amino acids and polyphosphate (polyp) enable the efficient production of many dipeptides with a certain order of the constituent amino acids through the coupling reaction of lal and ppk (fig. 2) . conclusion: a novel enzyme, lal, enables to synthesize dipeptides cost-effectively directly from unmodified l-amino acids. t. ye 1 marine organisms continue to provide rich sources of structurally unique and pharmaceutically active compounds. due to the difficulties in the isolation of significant quantities of these natural products, synthetic chemistry serves an important role in their structural assignment and biological evaluation. antifungal agents have received considerable attention recently since the spread of hiv has left many people open to fungal infections, and there is a rapidly growing number of drug resistant strains of fungus emerging. ll-15g256gamma is a cyclodepsipeptide isolated from the marine fungus hypoxylon oceanicum and structurally assigned in 1998 by schlingmann. the structure of ll-15g256gamma was determined by a combination of chemical degradation, chiral chromatography and spectroscopic analysis. ll-15g256gamma uniquely combines a beta-ketotryptophan and a polyketide portion within a macrolactone ring. ll-15g256gamma has exhibited potent activity against fungal strains and as such, is an attractive compound to develop as a future therapeutic agent. to date, there have been no reported studies towards the synthesis of ll-15g256gamma. we have completed the total synthesis of ll-15g256gamma by employing the macrolactamization followed by a c-h oxidation as the key step. aspartimide (aminosuccinimide, asu) formation is the first step in the degradation of asp/asn containing peptides and proteins. the reaction is especially prevalent at asx-gly sites and results in a variety of rearranged and racemized products. the bases used in fmoc-tert-based spps promote the formation of asu and related products. we recently found that the dmb backbone protection efficiently prevents secondary structure formation at gg sites and is orthogonal with respect to standard fmoc spps. here we explore the use of dmb, tmb and nbzl groups (z) for the synthesis of "difficult"/asu-prone peptides, in three different schemes: a) fmoc-asx-(z)gly-oh dipeptide building blocks; b) fmoc-(z)gly-oh monomer building blocks, and c) two steps "submonomeric" approach for synthesis of substituted n-benzyl glycines on the resin. we tested the new methods on two model peptides vkd/ngyi and ha21-20hiv-tat48-57 (h-g1lfgaiagfi engwegmidg20grkkrrqrrr30-oh) fusion peptide. the yield and purity of the products reach and even exceed the level in control experiments obtained with hmb protection and the peptides were found free of asu/piperidides. the acid removal of the dmb protection is ~30% faster than that of hmb. the submonomeric route (strategy c) is especially simple, efficient, cost effective and it allows the use of different amines for halogen-displacement. the backbone protecting groups used were in many respects superior to the commercial reagents and applicable for synthesis of both peptide acids and peptide amides. the use of nbzl-nh2 for halogen displacement represents a new method for preparation of backbone-caged peptides. alkyl bonded silica gels historically have been the standard in reversed phase (rp) purification of biomolecules such as synthetic peptides, small proteins, and oligonucleotides. silica gels provided the resolving power needed for challenging separations and the mechanical stability required to be operated industrially under high pressure conditions. the chief disadvantage of silica gels is poor chemical stability under alkaline conditions, which limits their capability to withstand rigorous clean and sanitization -in-place (cip/sip) protocols. as a result, polymeric media have gained recent market attention because of their excellent chemical stability, which enables full compatibility with modern cip/sip protocols. however, first generation polymeric gels lacked both the resolving power and the mechanical stability to be compatible with industrial high pressure dynamic axial compression (dac) hardware. rohm and haas' advanced biosciences division recently introduced a new, monospheric, 10 micron, high performance polymeric rp material. unlike existing softer polymeric gels, this product has higher mechanical stability which enables it to be used effectively with industrial dac / hplc hardware. in addition, this material provides high resolving power for the most challenging industrial separations, because of its unique and selective pore structure, as well as its small monospheric particle size. finally, because of its excellent chemical stability, the media is not limited in the range of ph that can be used. the combination of mechanical stability for high throughput, chemical stability for long lifetime in use, and high resolution for high yield, together translate to an effective cost-in-use solution for industrial polishing processes. we have developed new types of peptide nucleic acids with improved water solubility by introducing ether linkages and pyrrolidine rings in the main chain; pyrrolidine-based oxy-pnas (popnas). in this work, cellular uptake and endosomal release of the trans-l-popna oligomers, one of stereoisomes of the popna, were investigated. the cellular uptake was achieved by combining the popna oligomer with an n-terminal 23-mer peptide of an influenza virus hemagglutinin protein (ha2) that is labeled with a rhodamine fluorophore at the n-terminal and covalently linked with a hepta-arginine unit at the c-terminal (rho-ha2-r7). the fluorescence images of the cho cells after incubation with fam-po(13) [fam-o-cag tta ggg tta g-gly-nh2] in the absence and presence of rho-ha2-r7 were observed with confocal laser-scanning microscopy. incubation with fam-po(13) alone, no internalization of the oligomer was observed. in the presence of rho-ha2-r7, however, fam-po(13) was successfully internalized into cho cells and, more importantly, the fluorescence spread over the whole cell. the fluorescence image indicates that the popna oligomer in combination with the ha2-r7 peptide was transferred into cytoplasm within 1 h. since both the red (rho) and green (fam) fluorescence spread over the cytoplasm, the popna oligomers that were taken up into endosomes together with the rho-ha2-r7 were released into cytoplasm as the disruption of the endosomes by the ha2 peptide. in summary, the popna oligomers were readily taken up into cytoplasm of cho cells, when combined with a ha2-r7 peptide. most of functional rnas have post-transcriptional modifications, some of which are quite important for their structure and function. thus, for studying such rnas, it is necessary to use purified raw rnas obtained from living organisms. isolation of native rna is necessary also in the case of analyzing the sequence and modifications of mature rna, which may be different from simple transcript of its gene. therefore, rna isolation method is required. many previous reports demonstrated isolation of rnas, especially trnas. most common and traditional purification methods are based on successive column chromatographies. it seems difficult to apply such method to every trna because effective combination of columns varies among individual trnas. to overcome the difficulty, a sequence-specific selection method using a solid-phase dna has been devised. in this method, a trna can be purified from rna mixture by a single step. however, this method needs high temperature treatment, which might assist hydrolysis of rna strand and might impair heat labile modifications. pna-rna hybrid has been known to be much more stable than dna-rna hybrid. thus pna-based rna purification method seems to be possible for wider variety of rnas in lower temperature, in comparison with dna-based method. in this study, we attempted to purify a single rna, such as a trna and a noncoding rna, from rna mixture by using immobilized pna. r. pipkorn 1 , w. waldeck 2 , h. spring 3 , j. jenne 4 , k. braun 5 background and aims: safe drug delivery technologies are pivotal for genetic interventions, but viral vectors baer the risk of inflammatory reaction. questions concerning the efficacy of delivery of the genetic substances, the desired topical gene activation and targeting must be answered. therefore we attempted to develop a membrane non-perturbing delivery system for transport of inactive functional genes into cells and tissues. genes can be subsequently activated at the target site. our concept bases on the use of peptide-nucleic-acids (pnas) resistant against proteases and nucleases, oligonucleotide derivatives, in which the phosphate-backbone has been replaced with ethylen-amin connected alpha-amino-ethylglycine-units. methods: peptides conjugates were composed and synthesized according to the solid phase synthesis and protecting group chemistry strategies. pna sequences were conjugated covalently, non cleavable, with a capronic acid spacer to the nls, pkkkrkv. pnas have gained broad attention in antisense/antigene experiments and as diagnostic tools. in principal, they can be synthesised with several activating reagents known from peptide synthesis. namely, hatu or pybop are often used. synthesis with hatu is more laborious, because preactivation is needed in order to avoid guadinylation of the n-terminus of the growing pna-chain. we wanted to use pybop, because preactivation should not be needed in this case, which is especially useful in automated synthesis. surprisingly, in the pybop-mediated syntheses of 18mer pnas we obtained products showing molecular masses approx. 67 da above the expected ones. detailed analysis revealed, that the modification occurred at the only guanine residue in the sequence. in order to further characterise the side reaction, a short pna fragment was synthesised using hatu and pybop activation, respectively, and cleaved from the resin with and without the n-terminal fmoc-group. while synthesis with hatu gave the desired products, pybop partly activates the aromatic carboxy group of the guanine residue, which is substituted by piperidine during subsequent fmoc cleavage. the modified sequences could be further characterised by ms/ms-fragmentation. our results show that care must be taken when synthesising pnas with pybop activation. on the other hand, this reaction possibly opens an opportunity to synthesise guanine derivatives. the opioid receptor system in the central nervous system (cns) controls a number of physiological processes including pain, reward, gastrointestinal and cardiovascular functions. as a consequence, most pain modulating compounds currently available cause a variety of side-effects. the endogenous ligands for the opioid receptors are a series of peptides that includes endomorphin-1. endomorphin-1 has been shown to elicit potent anti-nociception through the highly selective activation of µ-opioid receptors. it is this receptor that mediates supraspinal analgesia and thus, selectivity for this receptor results in analgesia without affecting other processes. therefore, endomorphin-1 is considered a promising lead compound for the development of a new, safer pain medication. we have synthesized a large number of lipid-and carbohydrate-modified endomorphin-1 analogues and screened these compounds for their binding and activation of µ-and δ-opioid receptors in sh-sy5y cells as well as caco-2 cell monolayer permeability and plasma stability. compounds conjugated with either a lipoamino acid or sugar moiety on the c-terminus lost binding affinity by several orders of magnitude, whilst n-terminal conjugations resulted in minimal loss of binding affinity. a number of analogues showed pm binding affinity and high apparent permeability, and of these compounds, one has been selected for assessment in nociceptive and neuropathic pain models. in addition to these pre-clinical studies, internalization and tolerance formation of these compounds has also been measured in an effort to synthesise a non-tolerant opioid agonist. endomorphin-1 analogues with a high degree of amphiphilicity cause increased receptor internalization and subsequently less tolerance formation. a. marcinkowska 1 , l. borovičkova 2 , j. slaninowá 2 , z. grzonka 1 carbohydrate moieties of glycopeptides and glycoproteins play different decisive roles in various biological phenomena. conformation and solubility of proteins are influenced by the oligosaccharide chains, which can also inhibit the proteolytic degradation. as a result, the synthesis of glycopeptides is an attractive field that contributes to understanding of mutual interactions between both moieties and for their biological interest. the synthesis of glycopeptides requires a combination of synthetic methods from both carbohydrate and peptide chemistry. moreover, this synthesis needs stereoselective formation of the glycosyl bond between a carbohydrate and a peptide (amino acid) part, and also an appropriate protecting group methodology that allows selective deblocking of only one functional group in these polyfunctional molecules. in the present work we modified the oxytocin and vasopressin structure with glycoamino acids. transformations of fmoc-protected serine and threonine derivatives into appropriate o-glycosylated precursors suitable for solid phase peptide synthesis were worked out. the -and -o-glycosides were synthesized from fmocserine and fmoc-threonine allyl esters and appropriate glycosyl bromide using hanessian's modification of the koenigs -knorr reaction. these n--fmoc-protected glycosides were used in synthesis of glycopeptides. eight analogues of oxytocin modified in position 4 were obtained. we have also prepared two types of lysin-vasopressin analogues modified with glycoamino acid, in which the glucuronic acid was attached to the ω-amino group of lysine in position 8 through the amide bond. glycosylated analogues of oxytocin and vasopressin display an increased stability towards enzymatic degradation, and retain some hormonal activities. supported by grants: ds/8350-5-0131-6 (zg) and z40550506 (js) according to many authors the formation of amadori products is a key stage in the glycation process. glycated proteins may show allergenic properties and potentially initiate autoimmunological processes. they may also serve as the markers of diabetes. to our best knowledge, all procedures concerning the synthesis of peptide-derived amadori products reported in literature are based on "in solution" approach which makes them tedious and time consuming. a modified method of the solid phase synthesis of peptide-derived amadori products based on direct alkylation of the deprotected ε-amino groups with 2,3:4,5-di-oisopropylidene-β-d-arabino-hexos-2-ulo-2,6-pyranose in the presence of sodium cyanoborohydride was proposed. isopropylidene groups, protecting the sugar moiety in the obtained conjugate, were removed with trifluoroacetic acid containing 5% water. studies on optimization of the reaction performed on the model peptide attached to a wang resin, fmoc-lys-leu-leu-phe-(resin), showed that the best yield of the product is attained with a two-fold excess of 2,3:4,5-di-oisopropylidene-β-d-arabino-hexos-2-ulo-2,6-pyranose and a five-fold excess of sodium cyanoborohydride. the identity of the product was confirmed by high resolution ms. the several side products were isolated and their structures will be discussed. our results prove that the synthesis of glycated peptides in the solid phase is feasible. the lack of homogeneous glycoproteins in sufficient quantities is an ongoing challenge in glycobiology. in order to solve this problem researchers have turned to a variety of approaches ranging from mutant eukaryotic strains to the highly demanding total synthesis of glycoproteins. [1] using rnase b as a model nglycoprotein [2] we have searched a path to assemble this enzyme employing a combination of chemical and recombinant methods. native chemical ligation [3] allows the coupling of protein segments of unrestricted size in a chemoselective manner. we have developed solid phase methods to produce the required thioester building blocks 1-25-sr (a) and glycopeptide thioester 26-39-sr (b) containing an n-glycan at asn34 on a dual linker pega resin. [4] the remaining segment 40-124 (c) was expressed in e. coli as a fusion protein and released by intein mediated protein cleavage. [5] sequential coupling of the three rnase segments requires the use of a protective group at the n-terminus of segment b compatible with the oligosaccharide part. dysfunctional mutations of antitrypsin can result in a loss of elastase inhibitory activity or allow self-aggregation to occur and cause emphysema and cirrhosis, respectively. insights of the mechanism of disease provide strategy to cope with the aberrant protein aggregation and may bring potential therapeutic agents. in the present work, we describe our effort to identify effective anti-protein polymerization ligands by the employment of combinatorial technology. antitrypsin from human plasma was purified by glutathione sepharose and mono q-sepharose column chromatography. both ala-scanning and peptide shortening were carried out systematically to explore the structural requirements necessary for binding. combinatorial chemistry was then employed to conduct the library screening experiments. assessment of peptide binding was achieved through an unique gel electrophoresis assay. the structural requirements and the minimal peptide length required for binding were revealed by our systematic approach. this information was critical for the design of combinatorial library and the discovery of antitrypsin binding peptides with much improved affinity and specificity. there is currently no effective cure for z antitrypsin related cirrhosis and emphysema. the synthesis and screening of combinatorial libraries offer avenues to increase throughput and ultimately lead to the discovery of inhibitory peptides to the polymerization of pathogenic antitrypsin. with the rapidly increasing number of biopharmaceuticals in the industrial pipeline the need for efficient and expedient purification procedures is growing ever greater. affinity chromatography is one of the most promising technologies in this regard, as it offers very high selectivity and can often replace lengthy and expensive traditional chromatographic procedures. the use of combinatorial split-and-mix libraries is a powerful tool for discovering new affinity ligands but the technique has been limited by the laborious spectroscopic and chemical analysis needed to identify the binding ligand. we have previously introduced a novel bead encoding technology based on a 3-dimensional image recognition of patterns made by fluorescent particles randomly distributed inside larger beads. [1] the beads are read prior to each chemical transformation by an instrument featuring three fluorescence microscopes at a rate of 5,000 beads per hour. we here present the development of small peptidomimetic affinity ligands for the human growth hormone (hgh) by the use of this technology. the library was sought enriched prior to synthesis by in silico screening of a virtual combinatorial library using a large number of diverse building blocks. binding ligands were identified by incubation with fluorescence tagged hgh. [ the cinnamic acids and their derivatives have been found to possess a variety of biological effects, including antiviral, antimicrobial, antitumor and antioxidant activity. for example, several hydroxycinnamic acid conjugates with amino acids, isolated from plant sources showed enhanced antioxidant activity. the synthesis of cinnamic acid amides and their opioid activity was also cited in the literature. however the synthesis and pharmacological properties of sinapoyl-peptide amides continues to be virtually unexplored. on the other hand, the synthesis and opioid activity of analogs of tyr-mif-1 has been well documented by us. herein we present a synthesis of a series of sinapoyl -peptide amides where sinapic acid were attached consecutively to both c-and n-end of the tyr-mif-1 peptide chain: sa-pro-xaa-gly-nh2; sa-tyr-pro-xaa-gly-nh2; pro-leu-gly-nh(ch2)nnh-sa sa=sinapic acid; xaa=leu, unusual aminoacid; n=2,3 to obtain the sinapoyl-peptide-amides, both fmoc-and boc-based spps approach were used. analgesic activity was determined by the randall-sellitto paw-pressure test. the antioxidant effects were examined by dpph test as well. studies to establish the importance of introducing the sinapoyl moiety in the tyr-mif-1 molecule for the antioxidant and opioid activities are underway. several proteins are involved in the transcription of dna to mrna, among which the basic leucine zipper (bzip) proteins. these transcription factors bind specific dna sequences by dimerization and inserting short alpha-helices into the dna major groove. because the dimerization domain is only required to obtain the correct geometrical positioning of the alpha-helices, we will replace it by a dipodal steroid scaffold with defined stereochemistry. due to orthogonal protecting groups, a unique feature of this scaffold is the possibility to design not only homodimers, but also heterodimers. therefore this strategy allows for the construction of both major/major groove and major/minor groove binding peptides, either mimicking naturally occurring proteins or designing peptides with new binding properties. native chemical ligation and staudinger ligation are both suitable for the construction of these peptide dimers. moreover, a combination of solution-and solid-phase chemistry allows for the generation of combinatorial libraries. the increasing number of antibiotic-resistant bacteria is a global health problem. therefore the development of new highly efficient drugs is one of the major tasks of this century. as an example of peptides, which inhibit the growth of e. coli, we demonstrate an easy and rapid method for finding peptides with optimized antimicrobial properties. as a first step we built a modular construct. this construct consists of a constant cationic and a variable module. the cationic module was choosen to achieve cellpenetrating properties. the variable module was expected to act as the virtual active part of the peptide. to increase the proteolytic stability of the peptide we synthesized them in cyclic form. in the first step we used the combinatorial approach to screen approximately 64.000.000 peptide sequences in the variable region in order to find highly active peptides against e. coli. to optimize the identified sequence, we substituted all amino acids of the sequence with other amino acids and building blocks. additionally, in order to increase stability we modified the bridging. in this way we were able to uncover peptides with high antimicrobial activity as well as proteolytic stability and reasonable solubility. a series of melanocortin active core tetrapeptide hfrw nonpeptide imitations has been prepared using a combination of solution and solid phase synthesis. most of them included residue of 3-(1-imidazolyl) propylamine or histamine as substitutes of histidine. phenylalanine residue, which is included in melanocortins was replaced by residues of derivatives of 4, 4'-disubstituted isopropylidenedicyclohexane, 4, 4'-disubstituted bicyclohexane, 1,4-disubstituted cyclohexane, 1,5-disubstituted cyclooctane, and 1,2-, 1,3-or 1,4disubstituted benzenes. instead of arginine, residues of oligomethylene diamines, 2-butyl-2-ethyl-1,5-pentanediamine, 4,4'-methylene-bis(cyclohexylamine), and 4,4'-diaminodiphenylmethane were introduced. 2-naphtyloxyacetyl-, (4-1h-indol-3-yl)-butyryl-, 2-phenyl-ethanesulfonyl-and naphthalene-2-sulfonyl-groups served as replacement of tryptophan residue. tested on binding assay on melanocortin receptors, active core imitations exhibited a micromolar affinity to them. isopropylidenedicyclohexane and bicyclohexane derivatives showed about 10 fold higher affinity compared with corresponding derivatives of cyclohexane, cyclooctane or disubstituted benzene. obestatin is a novel endogenous ghrelin-associate peptide, which is involved in the regulation of food intake and weight gain. it was shown to be anorexigenic, able to decrease food intake, gastric emptying and jejunal motility. although obestatin and ghrelin originate from a common prepropeptide of 117 residues, they are reported to exert opposing physiological roles, by binding distinct receptors belonging to the subgroup of type a gpcrs [1] . obestatin was found to be the natural ligand of the orphan gpr39 receptor, a gpcr, expressed in jejunum, duodenum, stomach, pituitary, ileum, liver and hypothalamus. as many other peptides involved in the obesity process, it is a new and interesting drug target for the discovery of new anti-obesity molecules. in particular, the first step for the design of new molecules with potential improved anti-obesity activity, is the elucidation of the obestatin conformational features. here, we present the synthesis and the conformational analysis by nmr and cd spectroscopies of obestatin and its related 13-mer c-terminal sub-fragment, in aqueous solution and in membrane mimicking environment. the data outline the obestatin c-terminal portion as the region characterized by significant conformational features potentially opened to interesting future developments. [ a total of 50 isolates of rhizobium were collected from root nodules of medicago sativa and melilotus officialis plants in different regions of isfahan province .all of isolates on ty medium formed white ,slimy colonies with smooth margins and their inoculation on to roots of young alfalfa plants produced spindly nodules . the nodules developed with some of the isolates were big and pinkish ,although the rest of isolates produced small and white nodules .the speed of nodulation for all the isolates was almost similar and the related nodules were appeared within two weeks . the production of brown pigments on aged colonies of some isolates on ty or ty supplemented with l_tyrosine and copper sulfate revealed that these isolates of s. meliloti are melanin-producing rhizobia.based on the motility and sensitivity to antibiotics tests ,all of the isolates formed a reasonably homogenous group .however a few of them were able to produce an anti-microbial compound which was found to inhibit a number of isolates of s. meliloti .the compound did not suppress the growth of other bacteria . partial purification and spectrophotometery of the compound suggested that it likely belong to the antimicrobial polypeptides .considering on their physiological and biochemical properties ,none of the isolates were selected as a superior and competitive strain ,although based on nodulation efficiency , melanin and antimicrobial compounds production capability the isolate s. meliloti sm2 and sa23 were nominated to investigate in details. cyclotides are a fascinating family of plant-derived peptides characterized by their head-to-tail cyclized backbone and knotted arrangement of three disulfide bonds. this conserved structural architecture, termed the cyclic cystine knot, is responsible for their exceptional resistance to thermal, chemical and enzymatic degradation. cyclotides have a variety of biological activities but their insecticidal activities suggest that their primary function is in plant defense. in this study we determined the cyclotide content of the sweet violet viola odorata, a member of the violaceae family. we identified 30 cyclotides from the aerial parts and roots of this plant, 13 of which are novel sequences. the new sequences provide information about the natural diversity of cyclotides and the role of particular residues in defining structure and function. as many of the biological activities of cyclotides appear to be associated with membrane interactions, we used hemolytic activity as a marker of bioactivity for a selection of the new cyclotides. the new cyclotides were tested for their ability to resist proteolysis by a range of enzymes and, in common with other cyclotides, were completely resistant to trypsin, pepsin and thermolysin. the results show that while biological activity varies with the sequence the proteolytic stability of the framework does not, and appears to be an inherent feature of the cyclotide framework. the structure of one of the new cyclotides, cycloviolacin o14, was determined and shown to contain the cyclic cystine knot motif. this study confirms that cyclotides may be regarded as a natural combinatorial template that displays a variety of peptide epitopes most likely targeted to a range of plant pests and pathogens. furthermore, the inherent stability of the framework makes it an excellent scaffold for protein engineering applications. warfarin is the most widely prescribed anticoagulant drug for the prevention and treatment of arterial and venous thromboembolic disorders.because of large interpatient variability in the dose-anticoagulant effect relationship and a narrow therapeutic index careful dosage adjustment based on inr is essential. warfarin is available as a racemic mixture of two enantiomers,(s)-and (r)-warfarin. in contrast to (r)-warfarin, which is metabolized by multiple cytochrome p450s(cyps), including cyp1a2 and cyp3a4,(s)-warfarin, is predominantly metabolized to 7-hydroxywarfarin by polymorphic cyp2c9. since the potency of (s)-warfarin is much higher than that of (r)-warfarin, about 3-to 5-fold,any change in the activity of cyp2c9 gene is likely to have a significant influence on the anticoagulant response. previous in vitro findings revealed that certain variants in the cyp2c9 gene are associated with large interindividual differences in the pharmacokinetic and pharmacodynamic outcomes of warfarin therapy. three major alleles have been found to date in humans:arg144/ile359, and cys144/ile359 and arg144/leu359, and arg144/leu359, which have been designated cyp2c9*1 (wild-type), cyp2c9*2, and cyp2c9*3, respectively. we have investigated this polymorphism in iranians that has not been described previously. genomic dna was isolated from whole blood. for detection of cyp2c9*2, and cyp2c9*3 variants, a protocol based on pcr technique and endonuclease digestion with kpni, ava ii was used. in this research work, we have studied a group of 56 patients, in which warfarin therapy was initiated. recently new 21-residue antimicrobial peptides -arenicins were isolated from coelomocytes of marine polychaeta arenicola marina and their sequences were determined [1] . there are two isoforms of arenicins which differ only with single amino acid. these peptides have no structure similarity to any previously identified antimicrobial peptides. we have synthesized and estimated the antibacterial properties of arenicin-1: rwcvyayvrvrgvlvryrrcw. the linear peptide was prepared by solid phase method using boc-technology without any problem. however the cyclization caused the appreciable difficulties. the following methods of oxidation were used: oxygen of air, k3fe(cn)6 and hydrogen peroxide in aqueous or organic media. the best results were obtained by using hydrogen peroxide in methanol, but and in this case the yield of the aim peptide did not exceed 5%. synthetic arenicin had the same hplc profile and maldi-tof spectra as a natural molecule. the peptide showed an antimicrobial activity against gram-positive bacteria: peptidergic hormones and neurotransmitters are known to be produced by the specific cleavage of their precursor proteins that per se have no biological functions. the neutrophil-activating peptides we recently identified, however, are the peptides cleaved from mitochondrial proteins by proteolysis. therefore, we named them "functional cryptic peptides" because they are hidden in protein sequences. some of these peptides activate gi type of g proteins directly, and neutrophils are suggested to be stimulated by the direct (i.e., not via gpcrs) activation of g proteins. these peptides had features, in common, in their distributions of charged and hydrophobic amino acid residues, but homologies in their primary structures were not apparent. in the present study, we predicted functional cryptic peptides that activate g proteins, based on the distribution of charged and hydrophobic residues. receptors for these peptides were also investigated by the direct cross-linking experiments between peptides and their targeted proteins. the finding of functional cryptic peptides is expected to lead to the identification of novel signaling mechanisms where such peptides are involved in the regulation of bio-functions. the fragment 81-88 of the precursor of human interleukin-1alpha (pil-1α) (gk-vlkkrr) appeared to have more than 80% homology with corticotropin fragment 10-18 (gkpvgkkrr). we have previously synthesized the octapeptide gkvlkkrr (referred to as leucocorticotropin, lct) and found its high affinity binding to corticotropin receptors on various immunocompetent cells in human and mouse. in this study we investigated the interaction of lct with rat adrenal cortex membranes and the effects of lct on the level of 11-oxycorticosteroids (cs) in rat adrenal glands and plasma in vivo. lct was labeled with tritium by the high-temperature solid-state catalytic isotope exchange reaction to specific activity of 22 ci/mmol. receptor binding studies revealed that tritium-labeled lct bound with high affinity and specificity to corticotropin receptor on rat adrenal cortex membranes (kd = 2.2 nm). lct at concentrations of 0.1 -1000 nм was found to have no influence on the adenylate cyclase activity in adrenal cortex membranes, while intranasal injection of lct to rats at doses of 10 -50 microg/kg was found to inhibit the secretion of cs from the adrenals to the bloodstream. thus, lct is an antagonist of corticotropin receptor. comarin derivatives such as warfarin are prescribed widely for treatment and prevention of thrombosis. warfarin is the widespread oral anticoagulant drug employed, but its required dose is highly variable both inter-individually and inter-ethnically. so it is desirable to develop strategies to predict the warfarin dose response in patients before initiation of anticoagulation. the vitamin k-dependent γ-carboxilation system, consists of the vitamin k-dependent γ-carboxylase, which requires the reduced hydroquinone form of vitamin k1 as a cofactor and the warfarin sensitive enzyme vitamin k1 2,3-epoxide reductase (vkor), which produces the cofactor. warfarin exerts its anticoagulant effect by inhibiting the vitamin k epoxid reductase enzyme complex (vkor) that recycles vitamin k1 2, 3-epoxide to vitamin k1 hydroquinone. a component of the vkor termed vkorc1, has now been identified as a therapeutic target site of warfarin. point mutations were identified within the gene encoding vkorc1 in individuals who required large doses of wafarin to maintain therapeutic anticoagulation. however the relationship between the primary structure of vkorc1 and the mechanism of action of warfarin is poorly understood. in previous works we have shown that naturally occurring functional protein fragments affect cell proliferation [1] . their mechanisms of action involve receptors of "classical" regulatory peptides, or are non-receptoric [1] . among protein kinases involved are pka, camkii, mapk [2] . in organism bioactive functional protein fragments could participate in maintenance of tissue homeostasis. in present work, homeostatic potential of functional protein fragments was studied in compare with classical regulatory peptides. the panel of test substances was formed, including signal transduction modulators (pka, pkc, ca2+-channel activators), classical regulatory peptides (bradykinin, somatostatin, met-enkephalin, endothelin, neurotensin) and fragments of functional proteins (β-actin fragments from (75-90) and (69-77) segments; valorphin (β-globin (33-39)); neokyotorphin (α-globin (137-141); short acidic peptides from multiple precursors). their activity was tested in cultures derived from similar sources but differing in transformation degree (e.g., mouse embryonic vs. tumor fibroblasts) and/or culturing conditions. the factors most affecting cell sensitivity to the test substances were (in order of the importance decrease): (1) cell type; (2) transformed vs. normal phenotype; (3) cell density; (4) serum supply. activity of fragments of functional proteins, showing general correlation with other test substances, was more influenced by the culturing conditions (i.e., cell population status). thus, fragments of functional proteins could be regarded as partners of "classical" homeostasis regulators, playing role of finer tuners of tissue proliferative status. the study was supported by ras presidium programme "molecular and cell biology" histone-like proteins in bacteria are small basic proteins that contribute to the control of gene expression, recombination and dna replication. they are also an important factor in compressing the bacterial dna in the nucleoid. among the hlps, hu protein is attracted to dna containing structural aberrations such as four way junctions or single stranded lesions. this protein plays an important role in binding as a dimer and bending dna. it also contributes to the beginning of the dna replication. in this study we showed that a 10-kda protein, probably hu exists in halobacillus karajiensis which is a novel gram positive moderate halophile bacteria that was recently isolated from surface saline soil of the karaj region, iran. since hu is purified and characterized in e.coli we used this bacteria as the control in this study. the 10-kda protein extraction was carried out by using pca 5% which is normally used for extracting histones from eukaryotic cells. the results of running the protein extracts on sds-page demonstrated a band around 10-kda which was seen in protein extracts of this protocol. these results supported the hypothesis of the existence of a 10-kda protein in halobacillus karajiensis. sufficient oxygen and nutrient delivery is a necessity for tumors. when oxygen supply decreases, tumors initiate growth of new blood vessels. low grade astrocytomas, a class of malignant brain tumors, grow along the existing vessels in a process called co-option. hypoxia is induced in the progression from grade iii to grade iv astrocytomas (glioblastoma multiforme, gbm) which in turn triggers the formation of a new and distinct tumor vasculature. the new vessels formed by tumor-triggered angiogenesis differ by molecular composition from their normal vascular counterparts. we are utilizing phage displayed peptide libraries to identify peptides that specifically home to either co-opted or angiogenic brain tumor vessels. furthermore, we aim at characterizing differentially expressed endothelial markers (receptor molecules) to get a better understanding of the molecular changes in the vasculature. several rounds of in vivo biopanning was performed in mouse models of astrocytomas to isolate a phage pool that has up to a hundred-fold homing to low grade tumor lesions. out of the selected pool we discovered peptides capable of homing and accumulating to the tumor islets and co-opted vasculature. the homing potential of our newly identified peptides has shown to be highly specific for clusters formed by the tumor cells and co-opted "early" vessels within these palisades. these homing peptides represent promising candidates to selectively target co-opted vessels and tumor lesions in the brain and act as lead compounds in identification of surface molecules (receptors) differentially expressed by co-opted tumor vessels. the αvβ3 integrin receptors play an important role in human tumor metastasis and growth. the inhibition of these receptors by antibodies or by cyclic peptides containing the arg-gly-asp( rgd) sequence may be used as selectively treatment to suppress the disease. our research group has previously described that the formal introduction of a single carbon atom to bridge the cα (i) and n(i+1) contiguous residues of a linear or cyclic peptide leads to α-amino-β-lactam peptidomimetics containing predictably placed β-turn and γ-turn motifs, respectively. the combination of these results with the well-known capacity of rgd tripeptide for inhibition of the biological answer in integrin led us to the design of the following cyclic peptide. the adhesion and cell-growth "in vitro" assays using human umbilical vein endothelial cells (huvec), as well as "in vivo" assays with xenograph mice revealed that the rgd peptidomimetic was active to micromolar concentrations, slightly better than the reference compound in this field: cilengitide®. whole saliva is composed of secretions from parotid, submandibular and sublingual glands, and smaller ones from saliva of minor salivary glands (e.g. palatal and labial). saliva contains a variety of proteins and polypeptides. one of them is statherin, a multifunctional 43-amino acid residue, phosphominiprotein. this peptide is present in human parotid and submandibular saliva . the aim of our study was to investigate the stability of statherin in extracts of the major salivary glands. submandibular, sublingual and parotid gland tissues were obtained at autopsy 12 h after death. samples of the gland tissues were homogenized, centrifugated (30,000 g, 30 min., 4 c) and the supernatants were frozen and stored at -70 c prior to analysis. synthetic statherin was added to the supernatants before analysis (45 microgram/ml). the samples were analysed for the presence of the peptide by the matrix-assisted laser desorption/ionization-time-of-flight mass spectrometry (maldi-tof ms}technique and sodium dodecyl sulfate-polyacrylamide gel electrophoresis (sds page). statherin has been found to be decomposed in extracts of parotid and submandibular glands and also in the extract of sublingual glands. the dramatic increase in research for new anthrax therapeutic approach was prompted by potential use of the causative agent of anthrax bacillus anthracis as a biological weapon. anthrax toxin consists of three proteins, the protective antigen (pa) and the two enzymes lethal factor (lf) and edema factor (ef) that are carried through the membrane of the target cell upon binding to specific site on the membrane receptor-bound pa. lethal factor and edema factor were found to cooperate to promote immune evasion of the bacterium. here we describe the production of peptide inhibitors of pa-lf binding, obtained by selecting pa-binding peptide by a competitive panning of a phage peptide library, using recombinant lf. we selected several 12 mer peptides, which were synthesized in tetra-branched multiple antigen peptide (map) form, inducing resistance to proteolytic degradation (1) and maintaining biological activity of phage peptides. lead tetra-branched peptides were systematically modified by progressive shortening and residue randomization, to obtain an increase of peptide affinity and inhibitory efficiency. affinity maturation of lead sequences enabled selection of a peptide which has an ic50 at least one log lower that any other lethal-toxin-inhibiting peptide described so far and is effective for in vivo neutralization of anthrax toxin activity (2) . the same peptide can also efficiently inhibit the binding of ef to pa and ef-induced camp increase in different cell lines. microtubules are dynamic polymers that have important roles in eukaryotic cellular processes such as signal transduction, cell polarity, vesicular transport and chromosomal movement. the dynamic behavior of microtubules has been studied both in vivo and in vitro. the effect of arsenic trioxide on microtubule polymerization has been studied under in vivo experimentation shown that it inhibits formation of mitotic bundles. we studied the mechanism of arsenic trioxide effect on polymerization of microtubule protein purified from sheep brain in vitro. microtubule polymerization has been conducted by adding 1mm gtp to purified tubuline in pem buffer at 37oc for 30 minutes and simultaneously followed by measuring turbidity (350 nm). the results shown that lag time of polymerization (nucleation step) is affected by increasing concentrations of arsenic trioxide from 0-5 micromolar. moreover the rate of elongation step was decreased exponentially by increasing arsenic trioxide concentration. electron micrographs also showed microtubules length decrement due to arsenic trioxide. the results have shown the inhibitory effect of arsenic trioxide on microtubule polymerization via its effect on nucleation step as well as elongation rate. background and aims: alzheimer's disease (ad) is the major cause of dementia among the elderly. the increase in life expectancy worldwide demands new therapies for ad urgently. self-association of the amyloid ß-protein (aß) into neurotoxic assemblies, a seminal event in the etiology of ad, is considered to follow interactions of the c-terminus of the 42-residue form of aß (aß42). we hypothesized that molecules with high affinity for the c-terminus of aß42 will disrupt aß42 oligomerization. a series of c-terminal fragments (ctfs) of aß42, aß(x-42) with x = 28-39, has been prepared to study their potential to inhibit aß42 oligomerization and neurotoxicity. methods: attenuation of aß42 assembly by ctfs was studied by quantitative analysis of oligomer size distributions using a photo-cross-linking assay followed by sds-page. biological activity of ctfs themselves and as inhibitors of aß42-induced neurotoxicity was assessed by mtt reduction assay using differentiated pc-12 cells. the structure of the ctfs was studied by circular dichroism (cd) spectroscopy and ion mobility spectrometry-mass spectrometry (ims-ms) coupled with molecular dynamics (md) simulations. results: ctfs were found to inhibit aß42 oligomerization in a length dependent manner with minimal or no toxicity of the ctfs themselves. certain ctfs were found to inhibit aß42-induced neurotoxicity. cd spectra indicate that increasing peptide length results in growing ß-sheet content. structures based on experimentally determined cross-sections support the existence of a previously proposed turn around residues gly37-gly38. the data suggest that aß42 ctfs can serve as lead compounds for development of peptidomimetic drugs for treatment and prevention of ad. background and aims: human kallikrein hk2 is a prostate specific serine protease, which expression level is elevated in aggressive human prostate cancer suggesting a possible role in a tumour growth and spreading. since hk2 protease is highly prostate specific, inhibition of its activity is a possible method to prevent tumour growth without interfering the function of the other proteases. we have identified hk2 specific linear peptide inhibitor by using phage display techniques. in order to design peptide for in vivo studies we tested the protease stability of the linear and the cyclic forms of the peptide. methods: the prerequisite of the binding was studied by using conventional ala-replacement method and the most optimal sequence was selected for further studies. the stability of the original linear form, acetylated form, peptide with cystein bridge and head-to-tail cyclic peptide was tested with modified trypsin (sequencing grade) and with human plasma. results: both linear versions and peptide with cystein bridge were unstable and were degraded during the first 30 minutes in both stability tests. head-to-tail form of the peptide was stable in both tests during the first 180 minutes. conclusions: since our peptide contains arginine there was a possibility that our peptide is sensitive to trypsin and other serum proteases. indeed both linear and one cyclic from degraded in our tests. only head-to-tail peptide was stable during the first 3 hours suggesting protease resistant folding. background and aims: a large number of anticancer agents has been developed in recent years. however, these agents have very little or no specificity which leads to systemic toxicity. among them paclitaxel is considered to be one of the most important drugs in cancer chemotherapy; however, this agent has also lack of selectivity to the tumor tissue. therefore, development of tumor-targeting prodrug is highly promising. methods: to activate cytotoxic agent specifically at the tumor tissue, we developed a new prodrug strategy based on o-n intramolecular acyl migration, which is a well-known reaction in peptide chemistry, and photodynamic therapy. results: we synthesized a prodrug which has a coumarin derivative conjugated to the amino acid moiety of isotaxel (o-acyl isoform of paclitaxel). the prodrug was selectively activated by visible light irradiation (430 nm) leading to cleavage of coumarin. finally, paclitaxel was released by subsequent o-n intramolecular acyl migration. conclusion: we synthesized and evaluated a novel type of paclitaxel prodrug. this prodrug showed promising kinetic data. therefore, we believe that photoactivation can be promising novel strategy for design of tumor-targeting prodrugs. the search of new immunosupressants, exhibiting the mechanism of action characteristic for cyclosporine a (csa) and fk-506 is an important challenge for medicinal chemistry. cyclolinopeptide a (cla) natural cyclic nonapeptide [cyclo(leu-ile-ile-leu-val-pro-pro-phe-phe)] possesses a strong immunosuppressive activity comparable with that of csa in low doses. the possibility of practical application of cla as a therapeutic agent is limited due to its high hydrophobicity. it has been suggested that the tetrapeptide sequence pro(6)-pro(7)-phe(8)-phe(9) is responsible for the interaction of the cla molecule with the proper cellular receptor. in order to evaluate the role of this tetrapeptide unit for biological activity of native peptide, we decided to modified this fragment. in this communication we present linear and cyclic cla analogues in which phenylalanine residues in position 8 and/or 9 have been replaced with amphiphilic; alphahydroxmethylphenylalnine 1 or homophenylalanine 2. the synthetic strategy and biological activity will be evaluated. resistance to currently used small molecule antibiotics develops at an alarming rate. while resistance to β-lactams in clinical isolates is primarily due to hydrolysis of the ring by β-lactamases, when bacteria develop resistance to fluoroquinolones or aminoglycosides, the sequences of the target biopolymers are altered. earlier we developed a family of antibacterial peptide derivatives that kill bacteria by inhibiting protein folding and are active in animal models of infection. in the current study we examined the synergy between antibiotics acting by different modes of action. inhibition of properly folded active resistance enzymes was completely efficacious to recover the activity of amoxicillin, a β-lactam antibiotic against strains that were originally resistant to this molecule. some activity of ciprofloxacin was also recovered by reducing the load of the induced self-defense dnak protein, but the synergy between the antibacterial peptide and the fluoroquinolone did not yield full bacterial killing. the mode of action of the synergy is indeed inhibition of protein folding because no such effect could be observed with kanamycin where resistance involves changes in the target protein sequence. as opposed to current β-lactamase inhibitors and combination therapies that work against only a limited number of strains, inhibition of all protein folding in bacteria is a universally applicable treatment option. elimination of resistance to β-lactams by proline-rich peptide derivatives may represent a viable avenue to give second life to these antibiotics for which large stockpiles are available for pharmaceutical companies in both patented and generic forms. the integrin αiibβ3 is the major integrin-adhesion receptor on platelets. in unstimulated platelets αiibβ3 is present in a resting conformation state. upon platelet activation by agonists, αiibβ3 receives intracellular signals (inside-out signaling) that allow its rapid conversion to a high-affinity state capable of binding soluble ligands, resulting in platelet aggregation. the intracellular signals include proteins that bind to the cytoplasmic tails of the two subunits α and β of the integrin, or integrin-associated membrane proteins. in vivo charge swapping mutation studies suggested that αiib and β3 tails have a direct site of interaction between αiib (r995) and β3 (d723). peptides derived from the cytoplasmic tail sequences can specifically induce or block αiibβ3 activation in platelets. the aim of this study is to develop peptide analogues based on the cytoplasmic tail sequences of both αiib and β3 subunits that could inhibit platelet thrombus formation by specifically disrupting the inside-out signaling pathway. peptide analogues of the αiib and β3 subunits spanning the sequences αiib-989-1008, αiib -997-1003, αiib -997-1008, αiib-1000-1008 and β3-743-750, β3-743-756, β3-749-756 were synthesized in their free state, palmitoylated and/or tagged with the tat fragment 48-60 and carboxyfluorescein-labeled, in order to investigate their membrane permeability, as well as their inhibition of the platelet aggregation. inflammatory pain begins when noxious stimuli (thermal, chemical or mechanical) excite sensorial neurons called nociceptors. the activation of nociceptors leads to the opening of some ionic channels and depolarization of the cell membrane. one of these channels is trpv1, which is directly implied in thermal hyperalgesia associated to inflammation. in previous work it has been found that peptoid h-arg-15-15c ( fig. 1) inhibits the activation of trpv1 by blocking the pore entrance. however, this compound showed toxicity in vivo. the aim of our work is the design and synthesis of new compounds, based on the structure of h-arg-15-15c, with better therapeutic properties. we synthesized some new non-competitive antagonists of trpv1 that exhibit notable anti-inflammatory and analgesic activity in vivo. th. skarlas, e. panou-pomonis, d. krikorian, m. sakarellos-daitsiotis, c. sakarellos erf is a transcriptional repressor with tumor suppressor activity regulated by the ras/erk signaling pathway. it has been shown that erf interacts with, and is phosphorylated by erks in vitro and in vivo. this phosphorylation determines its subcellular localization and biological function. erf exhibits a high degree of specificity and sensitivity for erks. the major objective of our study is to provide proof of principal for a specific anti-cancer approach targeting the ras-pathway, which is commonly activated in human tumors, via the stimulus of the downstream effector erf. this will be attained by modeling specific peptide inhibitors that block the erf phosphorylation and inactivation by the ras/erk signaling pathway. we present the design and synthesis of peptide inhibitors incorporating the fsf and fkf motifs, known to play a critical role in the erf/erk interaction, in their free forms or conjugated to a carrier. ubiquitinium is a well known mechanism in protein degredation of eukaryotic cells ,in which many obsolte and corrupted three dimentional structure protein ,become marked by covalent attachment of ubuquitin through a multi-step enzymatic pathway.ubiquitin is a small ,8.5 kda peptide of 76 amino acid residues that targets such substrtes for proteolysis in proteasome .recnt studies showed that an extra cellular ubiquitination process also taking place in the epididymes of humans and other animals marks protein on the surface of the defective sperm .it appears that structurally and functionally defective sperm become surface ubiquitinated by epididymal epithelial cells. a certain portion of ubiquitin -labeled sperm is phagocytosed and the remaining is ejaculated .hence ubiquitin on the sperm surface could be a good marker of semen quality control in men. the aim of present study is to purify ubiquitin from packed blood cells , to produce and purify antiubiquitin antibodies,to design an immunofluorescence assy for detection of defective sperm, to compare the percentage of ubiquitinated sperm in oligoasthenotertozoospermia and normozoospermia and finally to determine correlations between sperm parameters and sperm ubiquitination. p. vakalopoulou, ch. anastasopoulos, g. stavropoulos, s. yiannis c-terminal analogues of substance p (sp) have been studied for their ability to prevent tumor growth or the proliferation of several cancer cell lines. the incorporation of damino acids into the sequence of sp and n-methylation of peptide bonds have shown to protect sp from the action of plasma and tissue peptidases. aiming to design and prepare more potential antagonist of cancer cells proliferation and taking into account that all the metabolites of the c-terminal hexapeptide analogue [arg6, d-trp7,9, mephe8]sp6-11 (antagonist g) possess the n-me group and d-trp residue, we proceeded to the synthesis of peptoid-peptide hybrids. they are oligomeric peptido-mimetics containing the residue [-ν(bzl)-ch2-co-]=(nphe). the incorporation of n-substituted glycine in peptide chains has been proved to improve their stability against proteases and give biologically active peptides. thus, the tetrapeptoid-peptide hybrids h-arg1-d-trp2-nphe3-d-trp4-oh and h-d-trp1-nphe2-d-trp3-leu4-oh, corresponding to metabolites of antagonist g and also the hexapeptoid-peptide hybrids glp1-d-trp2-νphe3-d-trp4-leu5-glu(obzl)6-νη2 and glp1-d-trp2-νphe3-d-trp4-leu5-glu(obzl)6-oh have been synthesized. the latter have incorporated the amino acid residues glp at the n-terminal and glu(obzl) at the c-terminal of the analogue, which have shown to give to the analogues increased resistance and biological activity. all the products were purified (hplc), identified (esi-ms) and set about for study their biological properties and activity against cancer cells proliferation. a chemokine receptor cxcr4 has multiple critical functions in normal and pathologic physiology. cxcr4 is a gpcr that transduces signals of its endogenous ligand, cxcl12 (stromal cell-derived factor-1, sdf-1). the cxcl12-cxcr4 axis plays an important role in the migration of progenitors during embryologic development of the cardiovascular, hemopoietic, central nervous systems and so on. this axis has recently been proven to be involved in several problematic diseases, including hiv infection, cancer cell metastasis, leukemia cell progression, rheumatoid arthritis (ra) and pulmonary fibrosis. thus, cxcr4 is a great therapeutic target to overcome the above diseases. fourteen-mer peptides, t140 and its analogs, were previously found to be specific cxcr4 antagonists that were identified as hiv-entry inhibitors, anti-cancer-metastatic agents, anti-chronic lymphocytic/acute lymphoblastic leukemia agents and anti-ra agents. cyclic pentapeptides, such as fc131 [cyclo(d-tyr-arg-arg-l-3-(2-naphthyl)alanine-gly)], were previously found as cxcr4 antagonist leads based on pharmacophores of t140. in this symposium, we would like to report the development of low molecular weight cxcr4 antagonists involving fc131 analogs and other compounds with different scaffolds including leaner-type structures. erythropoietin (epo) controls the proliferation and differentiation of red blood cells. it activates epor by inducing dimerization and reorientation of two receptor chains. peptides mimicking the action of epo, epo mimetic peptides (emp), have been discovered by phage display, interacting with the receptor on the active site and competing with the hormone [1] . another peptide, epor derived peptide (erp), was reported to activate the receptor through an alternative site distant from the hormone binding site, and to have synergic action with epo [2] . we report the design of new synthetic epo-r agonists by dimerization of active peptides. pegbased polyamide linkers of precise length were used to link the molecules, using oxime chemistry [3] . these peptides include emps that have been homo-dimerized through their n-or c-terminus. a hetero-dimer of one emp and one erp peptides was also created. biological characterization of the molecules is currently under investigation. [ the envelope spike (s) glycoprotein of the severe acute respiratory syndrome associated coronavirus (sars-cov) mediates the entry of the virus into target cells. recent studies point out to a cell entry mechanism of this virus similar to other enveloped viruses, such as hiv-1. as it happens with other viruses peptidic fusion inhibitors, sars-cov s protein hr2 derived peptides are potential therapeutic drugs against the virus. it is believed that hr2 peptides block the six-helix bundle formation, a key structure in the viral fusion, by interacting with the hr1 region. it is a matter of discussion if the hiv-1 gp41 hr2 derived peptide t20 (enfuvirtide) could be a possible sars-cov inhibitor given the similarities between the two viruses. we used fluorescence spectroscopy techniques to test the possibility of interaction between both t20 (hiv-1 gp41 hr2 derived peptide) and t-1249 with s protein hr1 and hr2 derived peptides. our biophysical data show a significant interaction between a sars-cov hr1 derived peptide and t20. however the interaction is only moderate (kb = (1.1±0.3) × 105 m-1). this finding shows that the reasoning behind the hypothesis that t20, already approved for clinical application in aids treatment, could inhibit the fusion of sars-cov with target cells is correct but the effect may not be strong enough for application. [1] were used to investigate the structure, dynamics and thermodynamics of the known complex between erythropoietin mimetic peptides (emp) and erythropoietin receptor (epor). with gromacs 3.2 bioinformatics software, we have obtained from the known emps about the key functional amino acids required for effective epo mimetic action. then we systematically altered the amino acids in those peptides, and simulated the complex to observe the differences between the altered peptides with the original ones. based on these results, we designed new emps of potential significance. in order to fast identify the mimetic action of these new peptides, we synthesized these peptides and labeled the epor binding peptide (ebp) with quantum dots [2] , to study the binding of these new emps to epor. our results illustrate a principle for fast identifying receptor-specific sites importance for receptor internalization, and for enhancing sensitivity to hormones and other agonists. blood vessel formation largely contributes to the pathogenesis of numerous diseases, including ischemia and cancer [1] [2] . in this regard therapeutic strategies aim to stimulate vascular growth in ischemic tissues and suppress their formation in pathologies like in tumour and diabetic retinopathy. placental growth factor (plgf), an homolog of vascular endothelial growth factor (vegf), (42% amino acid sequence identity), stimulates angiogenesis and collateral growth in ischemic heart and limb. whereas vegf exerts it biological function through the binding to both vegf receptor-1 (vegfr-1or flt1) and vegfr-2 (or kdr) plgf binds specifically to flt1. the complex plgf/flt1 constitutes a potential candidate for therapeutic modulation of angiogenesis and inflammation [3] . the binding between plgf and flt-1 has multipunctual features [4] and potential antagonist must have a sufficient molecular surface to spatially distant contact points. we have used an elisa-like screening assay to select antagonists of plgf/flt-1 complex from a large random library of tetrameric unnatural peptides (complexity: 3^30=27.000 molecules) identifying two active molecules with an about 10 m ic50. the relative stability of identified peptides were assessed in human serum and their inhibitory properties were tested in a capillary-like tube formation assay performed with human umbilical vein endothelial cells (huvec the αvβ3 integrin is a cell adhesion receptor involved in angiogenesis and tumor cell invasion. the tripeptide motif rgd is the αvβ3 minimal recognition sequence and many rgd-containing peptides have been investigated as radiopharmaceuticals for targeting angiogenesis and tumor metastatic phenotype. since rgd sequence binds also to other integrins, the aim of the present study was to develop and characterize a selective αvβ3 ligand suitable for imaging. a novel peptide containing the rgd loop covalently linked to an echistatin domain (crgdechi) was designed, synthesized and then tested for selective binding to αvβ3 integrin. a panel of peptides were used for comparison. adhesion assays showed that the novel peptide was able to inhibit adhesion of αvβ3 overexpressing cells but not αiibβ3 and αvβ5 overexpressing clones. in conclusion the novel peptide showed a high affinity and specificity for αvβ3 integrin. the design of new molecules, based on the lead compound presented here, is currently ongoing with the aim at developing novel anticancer drugs and/or new class of diagnostic noninvasive tracers as suitable tools for αvβ3 -targeted therapy and imaging. background and aims: short peptides like leu-pro-phe-phe-asp (lpffd) and leu-pro-tyr-phe-asp-amide (lpyfda) can influence the structure and aggregation of ß-amyloid peptides. soto's pentapeptide lpffd has been published as a ß-sheet breaker (bsb). it is necessary to gain more information about the nature of the interaction of aß and the pentapeptides mentioned above for the understanding of their action and for the possible development of future therapeutic agents. methods: in this study radioligand binding assay, diffusion ordered nmr spectroscopy, dynamic light scattering, circular dichroism and ft-infrared spectroscopy was used. results: it was shown by radioligand binding assay and diffusion ordered nmr spectroscopy that both pentapeptides bind to aggregated aß. dynamic light scattering, circular dichroism and ft-infrared spectroscopy revealed, that after the treatment of the aß with the pentapeptides aß fibrils are still present. conclusion: both peptide can bind to aß and can cause small conformational changes of aß, however, they cannot prevent completely the formation of aß fibrils in 50-100 micromolar concentration using 1:1 molar ratio of aß and the bsb peptide. peptide arrays are convenient tools for the analysis of antibodies, protein binding domains and to address other biological questions. here we present a new method to produce identical copies of arrays on microscope slides. the peptides are synthesized on modified cellulose-discs, using a variation of the spot-method introduced by ronald frank more than 10 years ago [1] . the new array format overcomes several limitations of the spot-method, e.g. the low throughput with only one copy of the library and the large sample volumes that are needed for membrane incubations. for the presented arrays modified cellulose discs with covalently bound peptides are dissolved after synthesis. the resulting solutions can be spotted onto glass slides by conventional spotting techniques. three dimensional layers of cellulosepeptide molecules are formed on the surface of the supports used for spotting. a virtually unlimited number of identical arrays can be printed and assays are performed with a sample volume of 100 µl or less. as application example we show mapping experiments of the streptavidin recognition site with a peptide library containing histidine-proline motives. because of the much higher peptide loading compared to conventional arrays, the formed 3-dimensional structure might be superior for protein-interaction studies with even low binding constants. [ the interaction between the cap binding protein, eukaryotic initiation factor (eif) 4e, and the scaffolding protein eif4g is critical for the formation of the heterotrimeric eif4f translation initiation (ti) complex (eif4e/eif4g/eif4a). elevated levels of eif4e and eif4g found in several human solid tumor cancers and the induction of malignant transformation in animal models by overexpression of eif4e and the reversal of this phenotype by treatment with anti-sense rna, suggest the importance of the eif4e/eif4g interaction in the excessive translation of oncogenic proteins. eif4g binds to eif4e through a conserved eif4e-binding motif yx4l (non-specified (x) and hydrophobic ( ) amino acids) that interacts with an hydrophobic hot spot on eif4e. we report here the identification of a putative eif4e anionic exosite that is distinct from the hot spot and contributes to the binding of eif4g-derived ligands. our strategy focuses on in situ eif4e-templated click reaction-mediated assembly of hybrids comprised from an anchoring minimal eif4g-derived peptide fragment, which binds to the hot spot, and a series of complementing positively charged fragments targeting the anionic exosite. we synthesized a training set of [1, 2, 3] triazole-containing hybrid peptides that are potent inhibitors of eif4e/eif4g interaction. moreover, we achieved in situ eif4e-templated assembly of these hybrids from the corresponding fragments via click reaction in the absence of cu(i) catalysis. as such, we demonstrate a proof-of-concept for a new paradigm in the development of inhibitors of protein-protein interaction merging click reaction with fragment-based and in situ target-templated approaches. goodpasture disease is an autoimmune pathology caused by the accumulation of reactive autoantibodies against the alpha-3 of collagen iv. goodpasture antigenbinding protein (gpbp) is a ser/thr protein kinase that phosphorylates the alpha-3 chain and might be important in human autoimmune pathogenesis [1] . we are carrying out in our laboratories the biophysical and functional characterization of gpbp protein. in the presence of some proteins and at specific experimental conditions, gpbp participates in structurally ordered intra-and inter-protein aggregation processes. structure prediction programs identify four different domains for gpbp: an n-terminal domain showing pleckstrin homology (ph domain); a central domain with high tendency to form coiled-coils; a domain with ww features; and a c-terminal start domain ('star-related lipid-transfer'). using the tango algorithm [2] , we have identified several aminoacid sequences in the gpbp start domain of vertebrates with high tendency to participate in protein aggregation. in this work we present the synthesis and structural characterization of a collection of peptides derived from the sequences described above. we recently developed a combinatorial library screening protocol to identify hpq-containing cyclopeptides that bind streptavidin more tightly than its linear analogues. the relative affinities in ic50 of these structurally constrained ligands and its linear counterparts were measured by a captured enzyme-linked immunosorbent assay. however, their intrinsic binding kinetics remained to be elucidated. in this work, surface plasmon resonance (spr) was employed to directly determine the kinetics and thermodynamics of the ligands binding to a streptavidin chip. solid-phase peptide synthesis was carried out using standard fmoc chemistry. spr experiments were carried out using biacore3000 optical biosensor. streptavidin was immobilized onto a cm5 sensor chip using the standard amine coupling procedure. the equilibrium dissociation constants and kinetic on/off rates of n-to-side chain and n-to-c cyclopeptides were deduced by scatchard analysis and computational simulation, respectively. it was found that both cyclopeptides exhibited similar binding kinetics and bound streptavidin far more avidly than its linear form (1000-fold). in addition, the reversed (qph) linear and cyclic peptidyl ligands were hardly recognized by streptavidin. not only the binding specificity was distinguished qualitatively, but also the entropic advantage acquired by the pre-organized conformation over its linear analogues was demonstrated quantitatively by spr in this study. the mutation of tumour suppressor genes in the progression of cancer is well characterized. for example, p53 is found to be mutated in approximately 50% of cancers and the loss of this proteins activity has been shown to lead to the deregulation of cell growth and apoptosis. the potential of peptide aptamers to inhibit protein/protein interactions in a highly specific manner makes them very attractive as research reagents or as target validation tools in anti-cancer drug discovery. more interestingly, these molecules have the potentially to inhibit the activity of proteins which are key regulators of cancer cell growth and therefore could act as synthetic tumour suppressor proteins. we used peptides based on known protein/protein interactions, as well as peptides isolated using display technologies, for the design of protein aptamers that were used to analyze pathways critical in controlling cancer cell growth. a range of scaffolds were used to present these peptides in an effort to optimize the peptides activities. data relating to the activity of these peptide aptamers in vitro as well as in cellular systems will be discussed. the cyclic undecapeptide urotensin ii (u-ii) is the endogenous agonist for the u-ii receptor (ut), a gq coupled gpcr. current views suggest that binding by agonist, but not antagonist, leads to induction of stabilization of an active receptor conformation. we have previously probed the interactions of urotensin ii with rat ut (rut) using a series of photolabile u-ii analogues containing p-benzoyl-l-phenylalanine (bpa). it was found that the c-jun n-terminal kinases (jnks) are important mitogen-activated protein kinases. these ser/thr protein kinases are activated by various growth factors, cytokines, and cellular stresses. jnks have been shown to play a key role in phosphorylation of proteins in signal transduction of different diseases including cancer, neurodegenerative, cardiovascular, and inflammatory diseases. therefore, these enzymes are considered as important therapeutic target proteins. the interactions of jnk with peptides are of special interest for development of novel specific atp-noncompetitive inhibitors. interactions of this kinase and its mutants with various substrates were demonstrated in vivo using yeast ras-recruitment system. bioinformatical tools have been developed to predict optimized binding peptides as well as to correlate sequence position and amino acid with binding effiency to extract binding determinants. biomolecular interaction analysis have been performed for selected peptide sequences using surface plasmon resonance (spr) technology. real time measurements of the binding of peptides to the different isoforms jnk2 and jnk3 resulted in the determination of affinities as well as kinetic constants for association and dissociation. experimental results and their bioinformatic analysis are discussed with respect to critical features of potential atpnoncompetitive inhibitors. the b-domain of is one of the five nearly homologous domains of staphylococcal protein a. this domain contains three alpha-helices which are assembled in an anti parallel three-helix bundle. the b-domain binds the fc region of mammalian immunoglobulins through the n-terminal fragment that contains two alpha-helices. the c-terminal helix does not interact with fc but it is necessary for the correct folding and immunoglobulin recognition of the b-domain. to search for new peptide analogues of the c-terminal helix that bind the n-terminal fragment, a ¨one-bead one-compound¨ library of 300 peptides was designed based on the sequence of the c-terminal helix. active peptides were obtained after incubation of the library with the n-terminal fragment and rabbit immunoglobulin g labelled with fluorescein. new peptides were found and their sequences identified by maldi tof-tof mass spectrometry. the synthesis of the two most active peptides was carried out and the binding with the n-terminal fragment was confirmed by cd spectroscopy. the nterminal fragment peptide showed an increase in helicity when the c-terminal wild type peptide or some analogues were present in solution. the complete domains with the c-terminal fragment mutations were synthesized and structurally characterized by cd and nmr spectroscopy. the wild type and the new mutants adopt predominantly an alpha-helical structure. the interaction between rabbit immunoglobulin g and the wild type b-domain and the new analogues was investigated using surface plasmon resonance. although compared with the wild type, the mutants exhibited different kinetics, they were able to bind the immunoglobulin with high affinity. a. jaśkiewicz, e. bulak, h. miecznikowska, k. rolka sfti-1, a strong trypsin inhibitor, was isolated in 1999 from seeds of sunflower. it is homodetic 14-amino-acid-residue peptide containing a disulfide bridge. because of its small size and strong trypsin inhibitory activity, this inhibitor became an interesting model for studying enzyme -inhibitor interactions. sfti-1 possesses one reactive site located at the lys5-thr6 peptide bond and therefore is able to interact with the enzyme in a 1:1 stoichiometry. in this report we describe chemical synthesis and kinetic studies of a series of sfti-1 analogues containing double sequences of the wild inhibitor. their structures contain combinations of disulfide bridges and/or head-to-tail cyclization. each of these analogues contains two trypsin-specific reactive sites. we expect that kinetic studies should answer the question whether such dimeric analogues are able to interact simultaneously with more then one trypsin molecule and how this fact affects their inhibitory potency. in addition, we alsopresenttwo analogues in which we substituted the disulfide bridge with a carbonyl one. since carbonyl bridge has not been previously introduced into molecules proteinase inhibitors, we decided to check its impact on the activity and proteolytic stability of such modified analogues. ubiquitination, the covalent attachment of one or multiple polymerized ubiquitins is a post-translational modification of proteins, which has manifold functions. it mainly determines the protein for degradation, but also activation, deactivation or substrate alteration. due to its ubiquitinous distribution in all eucarionts no high-affinity antibodies could be originated. highaffinity ligand peptides are of interest to study ubiqutination. based on bioinformatical considerations and investigations of ubiquitin-interacting proteins short peptide sequences were selected. by using a peptide array specific ubiquitin binding was monitored and quantified with label free detection based on reflectometric interference spectroscopy (rifs). the results from rifs were confirmed by detection of binding in solution with fluorescence correlation spectroscopy (fcs) using carboxyfluorescein and s0387-labelled peptide amides. binding constants were determined by isothermal calorimetry (itc) and rifs. finally 1h,15n-nmr chemical shift analyses of the peptides with the highest affinity were carried out, which allowed the localisation of the interaction site of ubiquitin with the peptide the results from all four methods correlated very good. they showed fast equilibria within 30 s and binding constants down to the low micromolar range. nmr results revealed hints for discrimination possibilities between lys48 and lys63 polymerized ubiquitins. ( 101f is a potent neutralizing mab that binds the human respiratory syncytial virus (hrsv) f protein and is a promising candidate for clinical development. the majority of neutralizing antibodies to hrsv f protein map to two regions of the protein designated site ii and site iv,v,vi. to further characterize the 101f epitope, we employed a trypsin digestion of a hrsv f protein-101f mab complex, followed by mass spectrometry analysis of the resulting recovered mab bound peptide. one peptide at m/z 3330 was captured by the 101f mab. sequence assignment was based upon mass and matched with the database from a virtual digest. this peptide was assigned as residues 420-445 [tkctasnknrgiiktfsngcdyvsnk] of the hrsv f protein which spans antigenic site iv,v,vi. to further delineate the epitope, the binding of 101f mab to a series of peptides corresponding to antigenic site iv,v,vi in the hrsv f protein was determined. based on the peptide elisa data, the 101f-binding region could be reduced to 422-436 sequence [ctasnknrgiiktfs] . as demonstrated by the substitution analysis, r429 and k433 significantly contribute to epitope binding, but another positively charged residue, k427 makes a minor contribution to the binding. both, the peptide elisa and proteolytic digestion of the mab-antigen complex approaches identified the same region of hrsv f protein as being critical for the binding of 101f. furthermore, these data confirm the results obtained using complementing genetic approaches using a panel of mutations in recombinantly expressed f protein and selection of antibody escape virus mutants (data not presented). the recently identified uracil-dna specific nuclease (ude) is the first representative of a new family of nucleases. the protein sequence has no detectable homology to other proteins except a group of sequences present in genomes of other pupating insects (vertessy et al, submitted). to analyze the physiological function of this protein, peptide conjugates were prepared to serve as synthetic antigens for the generation of antibodies against isoforms of dutpase, an enzyme inherently involved in preventing the synthesis of uracil-dna [1, 2] . we used poly[lys(seri-dl-alam)] (sak) as a synthetic branched polypeptide [3] or bovine serum albumin (bsa) as a natural macromolecular carrier. peptides were prepared by solid phase method utilizing syro2000 (mulltisyntech gmbh, germany) peptide synthesizer, using fmoc chemistry and dipci/hobt-mediated coupling on rink-amid mbha resin. a c-terminal cys(acm) was added to the native sequences for incorporation of sh group into the peptide. in case of sak choroacetylated polypeptide was conjugated with sh-peptide to form thioether linkage. the maleimidobenzoyil-n-hydroxyszukcinimid (mbs) derivative of bsa was used to introduce the peptide into the macromolecule. antibodies have been developed as diagnostic tools and therapeutics for many different diseases. however, the isolation and preparation of intact specific antibodies is often very tedious or even unfeasible. recent studies have shown that single paratope peptides might be well capable to mimic corresponding antigen ligands [1, 2] , suggesting that paratope peptides from a native antibody might have many advantages, e.g. for molecular vaccine design and targeting. we have developed a new method for identification of paratope-containing peptides by proteolytic affinity-proteome analysis in combination with high resolution fticr-mass spectrometry (fticr-ms) [3] . in the present study we used hen eggwhite lysozyme (hel) and a polyclonal rabbit anti-lysozyme antibody (helpab) as a model system. the direct determination of paratope peptides was obtained by selective binding of a dtt-cleavage mixture of the anti-lysozyme antibody to immobilised hel, followed by proteolytic digestion of the antibody-antigen complex (paratope excision, parexprot). two specific paratope peptides were identified by maldi-fticr-ms, and the corresponding peptide sequences were identified by database search within a 1-2 ppm threshold. additionally, the identified paratope peptides were synthesised and characterised by affinity mass spectrometry, which ascertained their full binding specificity to lysozyme. the propeptide blocks the active site of inactive zymogen of cathepsin d and is cleaved off during maturation. we have designed a set of peptidic fragments derived from the propeptide structure and evaluated their inhibitory potency against mature cathepsin d using kinetic activity assay. the mapping localized two segments in the propeptide involved in the inhibitory interaction with the enzyme core: n-terminus of propeptide plays a major role and the active site anchor plays a minor role according to their respective ki values. in addition, a fragment derived from the mature n-terminus of cathepsin d displayed inhibition, which supports its proposed regulatory role. the mechanism of interaction of both propeptide segments was characterized by the mode of inhibition and by spatial modeling of propeptide in cathepsin d zymogen. using fluorescence polarization measurements, kd in nanomolar range was determined for the n-terminal propeptide segment. the inhibitory potency of the active site anchor segment was modulated by ala38val mutation that was reported to be associated with cathepsin d pathology. . by comparing the resulting low-energy conformations using different sets of atoms, specific conformational features common only to the high/medium affinity compounds were identified. they included the spatial arrangement of the three most important pharmacophoric side chains tyr2, arg4, and nal5 as well as the orientation of the xaa3-arg4 amide bond, which together represent a "minimalistic" 3d pharmacophore model for binding of the cyclopentapeptide antagonists to cxcr4. this model rationalizes the data for the cyclopentapeptides as well as for the peptidomimetic cxcr4 antagonist krh-1636. automated docking of the pharmacophore model to the 3d structure of the tm region of cxcr4 revealed that the pharmacophoric groups of the cyclopentapeptide ligands were involved in favorable interactions with their counterparts in cxcr4. for instance, the hydroxyl group of tyr2 formed a hydrogen bond with lys38, the guanidino group of arg4 formed a salt bridge with glu288, and the backbone carbonyl of xaa3-arg4 formed a hydrogen bond with lys282. this finding gives additional support for the suggested 3d pharmacophore model, and also provides opportunities for rational design of cxcr4 mutants to map potential contacts with peptide ligands. with the successful completion of the human genome project, the next challenge is to assimilate enormous amount of genetic information generated and to assign functions to a large number of proteins encoded. although the dna chip technology to detect the abundance of mrnas has been established, it is known that the abundance of mrnas and proteins does not correlate. thus, protein detection methods for reproducible and quantitative investigation of protein networks are strongly required. we attempted to establish a novel protein detection system based on a fluorescent measurement that does not require labeling of target molecules and preparation of secondary antibodies. we focused on a steric hindrance caused by the interaction between a target protein and a specific capture agent. when a target protein interacts with a specific capture agent immobilized on solid surface, we assumed that a steric hindrance in the vicinity of a capture agent increases. in order to detect the differences in the steric hindrance, we utilized a fluorescent system with the staudinger reaction. this reaction is a chemical ligation between a phosphine and an azide group. these two functionalities are unreactive with protein surfaces under biological conditions. we incorporated an azide group into an immobilized capture agent and investigated the efficiency in the staudinger reaction between the azide and an external triphenyl phosphine derivative. it was found that a target protein bound to the capture agent immobilized onto the solid support interferes with the efficiency in the staudinger reaction. the major histocompatibility complex (mhc) has a crucial role to initiate the immune response via the binding of the peptide fragments (epitopes) of foreign antigens and their presentation to the t-cell receptors (tcr). the co-receptor molecule cd4 enhances the binding between tcr and mhc ii. small molecules that mimic surfaces of mhc-ii may lead to blockage of the autoimmune response and the development of drugs for immunotherapy. hla-dqa1*0501/dqb1*0201 (dq2) and hla-dqa1*0501/dqb1*0301 (dq7) are highly correlated to autoimmune diseases as sjogren syndrome (ss) and systemic lupus erythematosus (sle). the non polymorphic β regions of the modelled hla-dq7, which are exposed to the solvent and may disrupt the interaction of dq7 with cd4+ t lymphocytes were determinated using the getarea program. it was found that the regions 133-140 (arg-asn-asp-gln-glu-glu-thr-thr) and 59-66 (glu-tyr-trp-asn-ser-gln-lys-glu) display the highest solvent accessibility. peptide analogs of these regions were synthesized, by the fmoc/otbu solid phase strategy, purified by rp-hplc and characterized by mass spectrometry esi-ms. the dimeric analogs of the peptides, designed to mimic the superdimeric nature of the immunosuppressory fragments of hla class ii molecules were also synthesized and investigated. conformational studies were performed with cd spectroscopy and biological experiments are in progress. background and aims: aggregates of β-amyloid peptide (aβ) play central role in the etiopathology of alzheimer's disease (ad). short peptides like c. soto's pentapeptide lpffd and lpyfd-amide synthesized in our laboratory are neuroprotective agents against aβ assemblies both in vitro and in vivo. however, the mechanism of their neuroprotective effect has not yet been fully understood. methods: transmission electron microscopy (tem), cd, ft-ir, diffusion ordered nmr spectroscopy, dynamic light scattering, and radioligand binding assays were used. results: all the methods applied showed that the pentapeptides mentioned above do not break the fibrillar structure of aβ, that is these molecules are not real β-sheet breakers (bsb). the pentapeptides bind to aβ fibrils and cause small structural changes by intercalating into the aβ assemblies. fibrils of aβ survive one week treatment with the pentapeptides using them in 2 to 5-time molar excess. conclusion: all the results in our laboratory show that the short peptides have long-term interaction on aβ-assemblies. in the first step they bind tightly to the aβ surface and prevent further interaction of aβ fibrils with the neuronal membranes. after this step the short peptides can be built into the structure of aβ-assemblies with intercalation causing a less ordered β-conformation. proteolytic enzymes (neprilisin, ide) could cleave and hydrolyze aβ peptides after this structural change, therefore the short peptides are good drug candidates for the treatment of ad. cellular processes in normal and pathogenic cell states are regulated by external stimuli via complex networks of catalytic and non-catalytic protein-protein interactions. we have developed methodology for the synthetic variation of peptides and peptidomimetics using polymer reagents including linker reagents enabling polymer-supported cacylations. [1] in combination with the virtual screening of protein subsites, we have demonstrated the application of the novel synthetic methods to inhibitor optimization for various proteases including plasmepsin ii, hiv protease, and sars coronavirus main protease. [2, 3] moreover, multivalent peptide polymers have been developed for the intracellular targeting of proteins. [4] this methodology was now extended to the inhibition of peptide-protein interactions by small molecules. for this purpose, we have composed a library of 20,000 small molecules by algorithmic searching of a database of bioactive molecules with virtually designed substructures (fragments). high throughput assays were developed on the basis of fluorescence and fluorescence polarization detection. despite the scepticism regarding the inhibition of protein-protein interactions with small molecules, efficient hit molecules have been developed for several intracellular targets and were subjected to synthetic variation and cellular follow-up assays. the essential event in platelet adhesion to the blood vessel wall after injury or in thrombosis is the binding to sub-endothelial collagen of plasma von willebrand factor (vwf), a protein which interacts transiently with platelet glycoprotein (gp) ibalpha , slowing circulating platelets to facilitate their firm adhesion through other collagen receptors, e.g. integrin alpha2beta1 and gpvi. to locate thevwf-binding site in collagen iii; we synthesized 57 overlapping triple-helical peptides which comprise the whole native sequence of collagen iii . peptide #23 (gpogpsgprgqogvmgfogpkgnd (o is hydroxyproline)) alone bound vwf, with an affinity comparable to that of native collagen iii. immobilized peptide #23 supported platelet adhesion under static and flow conditions, processes blocked by an antibody which prevents the vwf a3 domain from binding full-length collagen. truncated and alaninesubstituted triple-helical peptides derived from #23 either strongly interacted with both vwf and platelets, or lacked both vwf and platelet binding. thus, we identified the sequence rgqogvmgf as the minimal vwf-binding sequence in collagen iii. the present work completes our understanding of the collagen-vwf interaction, providing information on crucial sequences in collagen that perfectly complements our existing knowledge of the collagen-binding site in vwf and may assist in targeting the collagen-vwf interaction for therapeutic purposes solid phase assay systems such as enzyme-linked immunosorbent assay (elisa), surface plasmon resonance (spr), and overlay gels are used to study processes of protein-protein and protein-peptide interactions. the common principle of all these methods is that they monitor the binding between soluble and surfaceimmobilized molecules. following the use of bovine serum albumin (bsa)-peptide conjugates or isolated synthetic peptides and the above-mentioned solid phase assay systems, we were able to demonstrate that positively charged peptides, which would be expected to repulse each other, can interact with each other. both the elisa and spr methods showed that the binding process reached saturation with kd values ranging between 1 and 14 nm. no interaction was observed between bsa conjugates bearing positively charged peptides and conjugates bearing negatively charged peptides or with pure bsa molecules, strengthening the view that interaction occurs only between positively charged peptides. however, interactions between the same peptides were not observed in solution when was monitored by nuclear magnetic resonance (nmr) or by native gel electrophoresis. thus, it appears that for positively charged molecules to interact one of the binding partners must be immobilized to a surface, a process that may lead to the exposure of otherwise masked groups or atoms. the relevance of our findings for the use of solid phase assay systems to study interactions between biomolecules will be discussed. the hematopoietic progenitor kinase 1 (hpk1), a mammalian hematopoiesis-specific ste20 kinase, contains a cluster of four proline-rich sequences called p1, p2, p3 and p4 located after the kinase domain. these pro-rich regions play an important role in the interactions of this kinase with different adapter proteins. previous studies showed that p1, which contains the canonical pxxpxr motif, and p2 and p4 with the canonical pxxpxk motif interact with the c-terminal sh3 domain of hematopoietic lineage cell-specific protein 1 (hs1) even if with different affinity. hs1 protein shares a high amino acid sequence and structural similarity to cortactin although their functions differ considerably. here we report the results of our investigation on the interaction between the c-terminal sh3 domain of cortactin and the four proline-rich motifs of hpk1. these interactions were analyzed by non-immobilized ligand interaction assay by circular dichroism (nilia-cd). upon peptide addition, the binding was monitored by the cd changes of the trp side-chains of the conjugate gst-sh3cort. the dissociation constants kd were determined analyzing the cd data at 294 nm using a nonlinear regression method. the results demonstrate that gst-sh3cort displays an affinity binding higher than that found with the corresponding hs1 domain and that the four hpk1 pro-rich regions are not equivalent. p2 appears to bind with the highest affinity (kd=0.5 µm), followed by p1 (kd=10 µm) and p4 (kd=33 µm), whilst p3 does not interact at all. the generation of a fibrin clot is mediated by the regulated activation of a series of serine proteases and their cofactors. factor viii in its activated form, fviiia, acts as a cofactor to the serine protease fixa, in the conversion of the zymogen fx to the active enzyme fxa. both fviii and fix are essential for normal coagulation, deficiencies of either are associated with the bleeding diatheses hemophilia a and b, respectively. the role of fviiia is to bind factor ixa, generating the phospholipid-dependent intrinsic factor xase complex. at least two interactive sites have been identified for the enzyme-cofactor interaction. the ser558-gln565 region within the a2 subunit has been shown to be crucial for viiia-ixa interaction. in an attempt to study this interaction, we synthesized a series of peptides of 558-565 loop of the a2 subunit. the syntheses of these peptides were carried out by using spps and fmoc/but methodology. the synthesized compounds were purified by rp-hplc and lyophilized to give fluffy solid, identified by ft-ir, nmr and es-ms spectra. these compounds were tested for inhibitory activity on human platelet aggregation in vitro, by adding common aggregation reagents to citrated platelet rich plasma (prp). the aggregation was determined using a dual channel electronic aggregometer by recording the light transmission. and 120 ci/mmol, respectively. both tritiated aβ peptides were used in cat brain( in vivo) experiments and it was found that the peptide aggregates enter the neurons within 30 min (electronmicroscopic autoradiography). this transport is most probably an endocytotic pathway. aβ aggregates could interact also with cytoplasmic proteins such as 3-phosphoglyceraldehyde dehydrogenase etc. we suppose that aβ assemblies can interact both with membrane receptors (nmda, ampa, ach) and with cytoplasmic proteins triggering neuronal dysfunction and death. background and aims: ww domains are the shortest known protein domains and contain a stable three-stranded b-sheet, which presents the binding site for prolinerich ligands. this interaction is mediated by hydrophobic interactions between aromatic and hydrophobic residues of the domain, and the polyproline core of the ligand. as part of our ongoing efforts aimed at synthetically mimicking conformationally defined protein binding sites (1), we have designed and synthesized linear and cyclic peptides covering the binding site of the ww domain of human yes-associated protein (hyap-ww), whose structure in complex with a proline-rich ligand had been solved by nmr spectroscopy (2) . methods: peptides were synthesized by spps, purified by hplc, and characterized by 2d-nmr spectroscopy, as well as by molecular dynamics calculations. affinities of the peptides to a hyap-ww ligand were determined in direct and competitive binding assays. results and conclusions: a cyclic peptide covering the sequence stretch of hyap-ww that contains its primary contact residues for proline-rich ligands, was found to compete with the domain for binding to a hyap-ww ligand. long-ranging noes identified in the nmr spectra of this peptide indicate a conformation, in which sequentially distant residues are brought into spatial proximity, likely through formation of a beta-sheet. these result demonstrate the feasibility of functional, as well as structural, mimicry of conformationally defined protein binding sites through synthetic peptides. the rockefeller university, new york, ny, usa integrins constitute a family of transmembrane cell surface receptors. they are involved in cell-cell and cell-extracellular matrix interactions. thus, they participate in many physiological and pathophysiological processes and are of crucial importance for the living organism. integrins possess two non-covalently bound subunits, α and β, that jointly participate in ligand binding. these dimeric proteins show very high specificity in recognition of natural ligands. for example, α4β1 integrin recognizes vcam-1 (vascular cell adhesion molecule 1) and fibronectin through binding amino acid motifs tqidspln and ldv, respectively. on the other hand, fibronectin is a classical ligand for α5β1 integrin with the recognition motif rgd. as shown, identification of the integrin ligands occurs through small recognition amino acid sequences (mostly tripeptides). thus, small cyclic peptides possessing a recognition motif in the appropriate three-dimensional conformation are able to interfere with the integrin-ligand interactions and act as inhibitors. the aim of this investigation is the characterisation of small cyclic peptides containing the rgd motif and the determination of the selectivity and specificity of these inhibitors. two new pentapeptides with 3-amino-cyclopropane-1,2-dicarboxylic acid monomethyl ester ((+/-)acc) were synthesized and tested. peptides were characterized in biological assays with living cells (k562 and wm115) and in surface plasmon resonance binding studies. experiments have shown that cyclic peptide cyclo-(arg-gly-asp-(+)acc-val) is a very potent inhibitor (ic50-value in nm range) of interaction between vitronectin and αvβ3 or αvβ5 integrins. when preparing biotin-labelled peptides as ligands for avidin-based assays, it is chemically most expedient to locate the biotin label on the n-terminal group of the peptide. this is done without any regard to how this may affect peptide-target interactions, biotin-avidin binding, and the solubility properties of the resultant peptide. in many instances, the products are poorly soluble, have little biological activity, and poor affinity for avidin. problems can also arise during the synthesis of such nterminally biotinylated peptides due to the poor solubility and reactivity of many of the reagents used for biotin introduction. to overcome these limitations, we have developed an extremely simple method for synthesising peptides c-terminally with biotin. peptides can now be easily prepared by standard solid phase techniques either n-or c-terminally labelled, and screened to determine the optimum presentation for the biotin. in cases studies using protein-protein interaction and kinase assays, peptides c-terminally labelled with biotin gave better sensitivity. y. yang 1 , j. eble 2 , n. sewald 1 many bacterial pathogens bind and enter eukaryotic cells to establish infection. invasin is an outer membrane protein required for efficient uptake of yersinia into m cells. invasin mediates its entry into eukaryotic cells by binding to members of β1 integrin family that lack insertion domains (i domains), such as α3β1,α4β1,α5β1,α6β1, and αvβ1. this type of peptide-protein interaction is an ideal subject for the rational design of inhibitors. the integrin binding motif consists of one loop region with a conservative asp911 residue and two synergistic regions. the aim of this project is to synthesize cyclic peptides based on the invasin binding epitope sdms. this sequence has to be positioned in a β-turn with asp in i+1 position for optimal activity of the peptide. also the arg883 and asp811 residue, which are about 27.29å and 31.54å respectively away from the asp911 residue of the sdms loop in invasin, should be investigated. peptides that mimic these recognition sites have been synthesized and tested as ligands for the integrin peptide-dna cross-linking is a very powerful tool for studying peptide-dna complexes. it transforms non-covalent complexes into covalent complexes, which renders characterization of the adduct by classical techniques (mass spectroscopy, nmr,…) much easier. the aim of our research is to develop a new method for peptide-dna cross-linking involving the incorporation of a furan moiety. the strategy is inspired by the naturally occurring process of oxidative furan ring opening by cytochrome p450. the resulting cis-butene-1,4-dial has been shown to react with amino-and sulfhydryl groups of macromolecules such as proteins and dna. in our research, dna binding peptides are modified with a furan moiety and then chemically oxidized into a reactive enal. this enal can react with dna to form a covalent peptide-dna complex. previous attempts to selectively oxidize furan modified minor groove binding peptides consisting of n-methylpyrrole building blocks failed. we are now applying the same strategy on major groove binding peptides consisting of natural amino acids. currently the oxidation conditions are being optimized so that the furan moiety undergoes selective oxidation. these optimized conditions will be applied to known dna binding peptides, in order to obtain a peptide-dna cross-link. we coupled octanoyl or palmitoyl group to the n-terminus of an analogue of sv40 nuclear localization signal (nls) peptide, sv126-133(ser128) to investigate the effect of fatty acid chain length on the conformation of the lipopeptide-antisense oligodeoxynucleotide (odn) complexes and to establish the optimal peptide/odn molar ratio (rm) for the effective delivery of odn into the cells. the odns used in this study were targeted towards either the green fluorescent protein (gfp) mrna and the junction sequence between ews and fli1 genes. the conformational change of odn at different rm values was followed by circular dichroism (cd), attenuated total reflection-fourier transform infrared (atr-ftir) spectroscopy and atomic force microscopy (afm). the sv40 peptide-mediated odn transfer into nih/3t3 cells was studied by epifluorescence microscopy. the interaction between the hiv-1 regulatory protein rev and rev responsive element (rre) of hiv-1 mrna has emerged in the last decade as an important target in antiviral therapy. the rev-rre interaction is essential for the replication of hiv. the rev protein binds to the rre site located in the env coding region of the full length viral mrna and facilities the export of the rna from the nucleus, while protecting it from the cell's splicing machinery. in the published nmr structure of the rre/rev-derived peptide complex, an -helical segment of rev binding domain recognises a specific region of rre. an approach is described to design a new class of -hairpin peptidomimetic ligands for hiv-1 rev protein, which inhibit its binding to the rre rna. a model -hairpin peptide served as a scaffold to pre-organise side chains into a geometry similar to that seen in a helical peptide. a library of peptidomimetics was prepared by grafting sequences related to the rna recognition element in rev onto a hairpin-inducing d-pro-l-pro template. the electrophoretic mobility shift assay (emsa) revealed that all of the designed peptidomimetics bind to rre and the best examples show affinities (kd) in a nanomolar range. these new ligands show a novel approach to designing rev peptidomimetics, represent interesting leads for the development of more potent hiv rre/rev inhibitors and permit more detailed studies of the mechanism of binding to rna. a. napiorkowska, a. sawula, m. olkowicz, p. mucha, p. rekowski tat (trans-activator of transcription) is the protein which controls the early phase of hiv-1 replication cycle. it is a potent viral trans-activator containing from 86 to 101 amino acid residues which binds to tar rna. the fundamental role of tat is promoting effective elongation of viral mrna (vmrna). binding of tat to tar is mediated by a 9-amino-acid, highly basic arg49-lys-lys-arg52-arg-gln-arg-arg-arg57 sequence of the arm (arginine rich motif); the key role in these interactions is played by arg52. the goal of our research was to investigate the interaction of 27-nucleotide tar rna with synthesized tat peptide analogues using capillary electrophoresis (ce), a powerful analytical technique of biochemical studies. changes in electrophoretic mobility of the tar peak are employed for monitoring tar-tat complex formation. ce experiments were performed using lpa-coated capillary and a physical gel containing buffer. native arm fragment tat(49-57)nh2, its analogues ac-tat(49-57)nh2 and ac-[lys52]tat(49-57)nh2 and analogues substituted in position 52 with alanine-, homoalanine and lysine-derived amino acids containing nucleobases (adenine, guanine, cytosine, uracil, thymine) and nucleosides (adenosine, guanosine, uridine and cytidine) in the side chain were studied. specific interactions and complex formation were observed for both the native arm peptide fragment and selected tat analogues. the research is aimed at improving our understanding of the molecular mechanism of peptide-nucleic acid interaction, as well as evaluating the usefulness of selected nucleobase-containing amino acids as point probes for investigating peptide-rna interactions. interactions between proteins and dna are important to all living organisms. the goal is to investigate the molecular recognition between dna and the transcription factor phob of e. coli on the single molecule level and to identify amino acids required for dna binding. phob is composed of a transactivation domain (amino acids 1-127) and a dna binding domain (amino acids 123-229) that binds to specific dna sequences (pho boxes) containing a tgtca sequence. [1] chemical synthesis of peptide epitopes present in the dna binding domain of phob and isolation of the whole dna binding domain of phob was performed. the protein was purified using intein mediated protein purification. an additional cysteine residue was ligated to the protein using intein mediated ligation reactions and will be used for immobilization and labeling. in single molecule force spectroscopy (afm) experiments it has been shown that both a peptide with a native phob-sequence and the recombinant protein bind to dna. competition experiments were performed to prove specific dna binding. [2] mutated peptides and proteins where strategic amino acids were replaced by alanine have also been examined to reveal the contributions of single residues to molecular recognition. the binding contribution of the proteins is determined by surface plasmon resonance, electrophoretic mobility shift assays and fluorescence correlation spectroscopy. we investigated the biophysical characteristics and the pore formation dynamics of naturally occurring and synthetic peptides forming membrane-spanning channels by using isolated rod outer segments (os) of reptilia and amphibia recorded in the whole-cell configuration. once blocked the two os endogenous conductances (the cgmp channels by light and the retinal exchanger by removing one of the transported ion species from both sides of the membrane, i.e. k+, na+ or ca2+), the os membrane resistance (rm) could be >5 gώ. therefore, any exogenous current can be studied down to the single channel level. macroscopic currents of amplitude of ~300 pa were recorded in symmetric k+ or na+ (>100 mm) and ca2+ (1 mm) from the commercially available alamethicin mixture, the synthetic alamethicin f50/5 (a major component of the natural mixture), and selected analogues applied at 1 µm concentration at -20 mv. once applied and removed the peptide, the current activates and deactivates with a time constant of about 160 ms. the synthetic analogues [glu(ome)7,18,19] and [glu(ome)18,19] produce a current of about 100 pa at 1 µm concentration, and they show a strong activation by hyperpolarization as alamethicin f50/5 itself. clear single channel events were observed when the concentration of all of the alamethicin peptides is reduced to <250 nm.
these results indicate that the three gln residues at positions 7, 18 and 19 of alamethicin f50/5 are not a key factor for pore formation and its conduction properties. in general, the pore assembly and disassembly are very fast and cooperative events. the translocation mechanism of penetratin (rqikiwfqnrrmkwkk) is not clear, but the involvement of cell membrane was supposed. recent studies with phospholipid model membranes have shown that penetratin interacts only with negatively charged liposomes. we aimed to analyse the effect of penetratin on liposomes composed of different phospholipids (dppc/dppg 2:8-8:2) by fluorescence spectroscopy. in the first set of experiments, liposomes labelled with fluorescent markers (dph, ans and tma-dph) were incubated with penetratin and the fluorescence polarisation was determined as a function of the temperature. in the range of 15-200 mol/mol phospholipid/penetratin ratio, no change in the transition temperature was observed indicating that penetratin has no influence on the membrane structure. next, we have analysed the interactions between phospholipids and penetratin through changes in the intrinsic fluorescence of the peptide due to the presence of two w residues in its sequence. comparing the emission spectra corresponding to penetratin in aqueous media or in presence of vesicles one can clearly appreciate a blue shift. this indicates that that tryptophan residues are mainly exposed to a hydrophobic environment. analysis of the main band shows low values of polarization suggesting a free motion of the peptide chain. on the contrary polarization measured for penetratin mixed with liposomes results in higher values. this indicates that hydrophobic residues, like trp, are inserted into the bilayer and their motion is restricted. these data suggest the presence of interation sensed strongly by trp properties. cyclopeptide antibiotic gramicidin s (gs) has antimicrobial activity against gram-positive and gram-negative bacteria and some fungi. but non-specific action of gs and its high lytic potential limits therapeutic application of gs. we attempted to elucidate in which way gs molecule could be modified to lose its haemolytic side effects. gs molecule interacts directly with membrane phospholipids due to electrostatic and hydrophobic interaction. naturally, changes in the state of a lipid bilayer cause changes in the gs molecule binding to a bilayer. we studied the effect of gs on human blood platelets and the effect of platelet membrane state on the gs-induced disaggregation of cells with the help of turbidimetric and microscopy techniques. we modified the membrane state by temperature, osmotic stress, ionizing irradiation, lipid oxidation. depending on concentration gs causes platelet shape change and activation. when added to preliminary aggregated (in response to physiological agonists -thrombin, epinephrine, adp) platelets, gs causes crumble of cells aggregates. the rate and extent of platelet disaggregation under the effect of gs non monotonously depends on temperature (range of 5-40°c) and irradiation dose (up to 200 gy). parameters of the gs interaction with membranes are determined by the mobility of membrane lipids. factors modifying the lipid bilayer change the degree and the speed of the gs interaction with platelet membranes. results obtained permit to use gs for testing the state of membrane lipids and on the other hand allow to suppose ways of gs molecule modifications to achieve its tolerance to blood cells. g. bai 1 , p. gomes 1 , r. seixas 1 , m. hicks 2 , m. prieto 3 , m. bastos 1 eukaryotic antibiotic peptides (eaps) have been widely studied for the past years as an alternative to conventional antibiotics due to emergence of multi-resistant microbial strains, and significant efforts targeting increasingly potent and specific antimicrobial peptides are being made. one interesting approach in peptide antibiotics is based on hybrid sequences derived from natural eaps, with ca ( resistance to conventional antibiotics has stimulated a search for alternative therapeutics for microbial infections, a possible source that has gained much interest in recent years are antimicrobial peptides. antimicrobial peptides target the cell membrane directly, which is a key feature as evolution has shown bacteria have had difficulty in altering their membrane composition and organization to mount a suitable defence against these peptides. a common theory is that peptides that bind strongly exhibit high biological activity, but our real-time quantitative binding studies via surface plasmon resonance (spr) have shown that this correlation does not always hold. as more information on the molecular details of membrane disruption is required, we have used atomic force microscopy (afm) to visualise peptide insertion and changes in membrane morphology by a range of antimicrobial peptides in situ. interaction studies were performed with a series of phospholipid mixtures that mimic either mammalian cells (high in phosphatidylcholine and cholesterol) or microbial cells (high in phosphatidylethanolamine and phosphatidylglycerol). the present study may assist in the design of new specific antimicrobial peptides with high antimicrobial activity and low host toxicity. proportions of popc and popg as models. very high molar ratio partition constants ((18.9+-1.3)x10^3 and (43.5+-8.7)x10^3) were obtained for the bacterial models (popg:popc 4:1 and 2:1, respectively), these being about one order of magnitude greater than the partition constants obtained for the less anionic mammalian model systems ((3.7+-0.4)x10^3 for the 100% popc system). at low lipid:peptide ratios there were significant deviations from the usual hyperbolic-like partition behavior of peptide vesicle titration curves, especially in the case of the most anionic systems. membrane saturation was shown to be related to such observations and mathematical models were derived to further characterize the peptide-lipid interaction under these conditions. the calculated peptide-to-lipid saturation proportions, together with the determined partition constants, suggest that the minimal inhibitory concentrations of omiganan pentahydrochloride could represent the conditions required for bacterial membrane saturation to occur. the hemolytic pore-forming toxin sticholysin ii (stii) produced by the sea anemone stichodactyla heliantus belongs to the actinoporin protein family. the n-terminal domain of these proteins is required for interaction with membranes. to investigate the role of stii´s n-terminal domain in membrane binding and in the molecular mechanism of hemolysis, peptides corresponding to residues 1 to 35, or shorter fragments from this region, were synthesized. in some peptides leu was replaced by trp. all peptides exhibited hemolytic activity, albeit to a lesser extent than the whole protein. moreover, peptides lacking the 1-14 hydrophobic stretch were less active. the longer peptides were also able to permeabilize phospholipid vesicles. conformational studies were performed in aqueous solution and in membranemimicking environments. cd spectra showed that, while the shorter, more hydrophilic peptides, displayed a random conformation, the longer peptides underwent aggregation with increasing concentration, ph, and ionic strength. in the presence of trifluoroethanol and upon binding to detergent micelles and phospholipid bilayers, the peptides showed a propensity to acquire -helical conformation, as expected for the sequence comprising residues 14 to 26. fluorescence spectra demonstrated that the first residues of stii´s n-terminus penetrate more deeply into the bilayer, whereas residues 14-26 are located more superficially. this is in agreement with the predicted amphipathic nature of the helix formed by these residues and corroborates the existing hypotheses for the role of the n-terminal domain in the process of membrane insertion and pore formation. among a great number of antibacterial peptides a group of trp-rich peptides is of special interest. taking into consideration, that in most of proteins tryptophan is not frequently occurring amino acid, the biological meaning of a high content of tryptophan in structure of these antimicrobial peptides is particularly interesting. in the present study we carried out the investigation of antimicrobial and hemolytic activities of selected trp-rich peptides and their action on microbial membrane: ilpwkwpwwpwrr-nh2 indolicidin (i) pitwpwkwwkgg-nh2 3b3 (ii) plswffprtwgkr-nh2 gsp-1a (iii) fpvtwrwwkwwkg-nh2 puroindoline (iv) vrrfpwwwpflrr-nh2 tritrypticin (v). sunflower trypsin inhibitor sfti-1 is the smallest and the most potent known peptidic trypsin inhibitor from the bowman-birk class of proteins [1] . this head-to-tailcyclized 14-amino-acid peptide contains one disulfide bridge and a lysine residue (lys5) present in the p1 position, which is responsible for inhibitor specificity.as was reported by us and other groups, sfti-1 analogues with one cycle only retain trypsin inhibitory activity. very recently we have shown [2] that introduction of nsubstituted glycine residues mimicking lys and phe (denoted as nlys and nphe) in the p1 position of monocyclic sfti-1 with disulfide bridge yielded potent trypsin and chymotrypsin inhibitors, respectively. in this novel class of proteinase inhibitors contains completely proteolytic resistant p1-p1' reactive site. in the present communication we report chemical synthesis and determination of trypsin and chymotrypsin inhibitory activity of a series of ten sfti-1 analogues modified in the p1 position by these peptoid monomers (nlys and nphe). each of the synthesized peptomeric (peptide-peptoid hybrid polymer) sfti-1 analogues contains one of the following cycles: head-to-tail, disulfide bridge formed by cys, by pen and by cys/pen residues. the impact of the different cycles introduced into sfti-1 structure on proteinase inhibitory activity will be discussed. s-protein contains a proteolytic processing site and two interacting heptad repeat regions denoted as hr-n and hr-c. following processing of s-protein mediated by host cellular protease/s, the c-terminal s2-fragment fuse with host cell membranes via its hr-n and hr-c domains that form coiled coil 6-helix bundle (trimeric of dimers)-crucial for its receptor-mediated viral fusion. our objective in this work is to study the proteolytic site using model peptides and also to examine the interaction of hr-n and hr-c domains using fluorescence microscopy and other techniques. thus we synthesized an intramolecularly quenched fluorogenic peptide containing the proposed cleavage site [abz-eqdrntr761 evfatyx, abz=2-amino benzoic acid and tyx=3-nitro tyrosine] and showed by kinetic measurements that this cleavage is mediated most efficiently by furin, followed pc5 and pc7. other potential substrates were also tested and compared. above cleavage can be blocked by specific-pc-inhibitors in a dose-dependent manner. in addition using fluorescent-labeled peptides derived from hr-n and hr-c domains, circular dichroism spectra and surface assisted laser desorption mass spectral interest has grown to develop specific and potent inhibitors of this enzyme. our objectives in this study are to generate soluble recombinant human (h)ski-1 enzyme, design potent inhibitors and study its 3dmodel structure. we have successfully expressed hski-1 enzyme lacking its transmembrane domain in hek-293 cells and purified the enzyme via chromatography. in addition we developed ski-1 inhibitors by using pseudo-and multi-branch peptide approaches. in first approach we inserted dipeptide isosteres amino oxy acetic acid (aoaa) or 8-amino 3, 6 dioxa octanoic acid (adoa) at scissile p1-p1' position ((r175 l) of hski-1. a typical example is 167gryssrrl(adoa)aip179. other dipeptide isosters were also incorporated at the cleavage site of either ski-1 prodomain or lassa virus glycoprotein. in second approach we prepared 2 and 4-branch peptides containing hski-1128-137 segment. these peptides inhibit ski-1 in competitive manners with varying degrees ranging from low m to high nm ic50. circular dichroism spectra indicated strong interactions of inhibitors with ski-1 consistent with observed inhibition profile. a 3d-model structure of catalytic domain of ski-1 indicated a broad catalytic pocket cysteine proteases are of great importance in biochemical processes and these enzymes are used in biotechnology, food industry and agriculture. in this connection synthesis of high selectivity and high specificity substrates for cysteine proteases is of importance. enzymatic synthesis of peptides is a good tool for obtaining different biologically active peptides. immobilized serine proteases, subtilisin carlsberg and α-chymotrypsin immobilized on poly(vinyl alcohol) cryogel (pvacryogel), proved to be a convenient biocatalyst for such kind of syntheses. the high specific chromogenic substrate for cysteine proteases assay glp-phe-ala-pna was obtained with high product yields (up to 88% in 24 h) using subtilisin and chymotrypsin immobilized on pva-cryogel. the reaction was carried out according to the following scheme: glp -the residue of pyroglutamic acid, pna -p-nitroanilide. the influence of initial concentrations of components, the reaction mixture composition, the biocatalyst content and time on product yield was studied. it was shown that the optimal conditions are: dimethylformamide-acetonitrile mixture 20:80 (v/v), initial concentrations -85 mm, and enzyme-to-substrate ratio -1:3900. this approach was used in order to synthesize analogous substrates, containing different fluorogenic and chromogenic groups as well as other amino acids in p1position. the obtained substrates were tested for the papain assay. peptidyl-a-ketoaldehydes 3 represent attractive lead compounds and intermediates in the development of potent protease inhibitors due to their structural similarity with peptide aldehydes, previously known to be excellent inhibitors of serine-and cysteine protease. recently, we demonstrated the application of polymer cyano methylene-and carboxylato methylene phosphoranes in the assembly of a-hydroxy-b-amino esters (norstatines), a,b-diketoesters, and a,b-unsaturated ketones. [1, 2, 3] we now present a further development of our reagent linker 2 approach employing peptidyl-a-ketoaldehydes 3 and diamino propanoles 4. carboxylato methylen phosphoranes 1 derived from bromo acetic esters which are readly acylated without racemization, play the key role in our synthetic concept. herein we show the oxidative cleavage to peptidyl-a-ketoaldehydes 3 using dimethyldioxirane (dmd) in acetone as oxidant, after saponification and decarboxylation on the solid support. diamino propanoles 4 were furnished via the reductive amination of resin-bound peptides. over the past few years nuclear magnetic resonance has emerged as a powerful means for lead molecular identification and optimization.on the other hand, the 19f nmr has been used succesfully in several structural studies, protein folding studies and for the identification of active compounds, using a very similar methodology that the one used in the present work. the methodology required the labeling of the substrate with a cf3 moiety. the enzymatic reaction is performed with the cf3 substrate and quenched, using an enzyme inhibitor. 19f nmr is then used to monitor the evolution of both substrate and product. only two peaks are observed, the starting substrate and the cleaved substrate. this nmr method has some advantages: fluorine nmr is very sensitive, 0,83 times that of the proton. there are no spectral interference from protonated solvents, buffers or detergents typically present in the enzymatic reactions.the 19f isotropic chemical shift is extremely sensitive to small structural perturbations resulting in different chemical shift for the signals of the substrate and product. isotopic labeling of the protein is not required. as a model, caspase-8, which play a critical role in the initiation of apoptosis process5 and hiv-1 protease were chosen. two different kind of libraries were screened: one based on natural products from plant and animal extracts used in tradicional chinese medicine and a second one corresponding of a synthetic library with two sublibraries of 160 and 144 compounds.with this methodology it has been possible to identify some compounds with very promising inhibitory properties. background and aims: human kallikrein 2 (hk2), a prostate specific serine protease, regulates the activity of several factors that may participate in proteolytic cascades promoting tumor growth and metastasis. thus, inhibition of its enzymatic activity is a potential way of preventing growth and metastasis of prostate cancer. moreover, specific ligands for hk2 have potential use for targeting and in vivo imaging of prostate cancer and for development of novel assays. methods: to find peptide ligands we panned several phage display peptide libraries against active recombinant hk2 captured by a monoclonal antibody exposing the active site of the enzyme. alanine scanning and amino acid deletion analyses were performed to elucidate the motifs required for hk2 inhibition. results: from libraries expressing 10 and 11 amino acid long linear peptides we isolated six different hk2-binding peptides. three of these peptides are specific inhibitors of the enzymatic activity of hk2. amino acid substitution and deletion studies indicated that motifs of 6 amino acids are necessary for the inhibitory activity. conclusions: we have developed specific hk2 inhibitors by phage display technology. these novel hk2 specific peptides are potentially useful for treatment and targeting of prostate cancer. peptidylarginine deiminase iv (padiv) catalyzes the citrullination of arg residues in various peptides and proteins, such as histone, resulting in the production of citrullinated proteins in granulocytes [1, 2] . the citrullination mechanism of histone subunits and its functional effects in cells are not well known yet in detail. recently, it has been reported that the protein deimination/citrullination by pad iv plays a role in rheumatoid arthritis [3] . this implicates that the citrullination of histone may be related to rheumatoid arthritis. in order to further study the citrullination mechanism of histone, we explored the citrullination sites of histone h2a and h3 by pad iv using a series of synthetic peptides. recently, hagiwara et. al. reported that pad iv only citrullinates the arg3 of histone h2a as well as the arg3 in histone h4 [4, 5] . in order to investigate the citrullination mechanism, the n-terminal peptides of histone h2a and h3 were chemically synthesized and examined the citullination by pad iv. the n-terminal acetylation effect of the n-terminal synthetic peptide was also estimated on the citrullination by padiv. the velocity of each arg residues in the n-terminal peptides were estimated in vitro. the results indicated that padiv recognizes the specific arg residues in the synthetic peptide, and that the n-terminal acetylation of the histone peptides dramatically affects on the substrate recognition of padiv. in addition, the cd spectra of the n-terminal peptides were measured to elucidate the structural specificity for the recognition of pad iv. background and aims. prolyl oligopeptidase (pop) is a serine peptidase that cleaves oligopeptides after prolyl residues. it has been associated with cognitive disorders. pop inhibitors have been shown to enhance cognition in monkeys (1) and to improve performance in verbal memory tests in humans (2) . in the present study, the p2 l-prolyl residue of pop inhibitors was replaced by two l-proline mimetics, the 5-t-butyl-l-prolyl group and the (r)-cyclopent-2-enecarbonyl group. the effect of the mimetics on in vitro potency, lipophilicity and binding kinetics were studied. methods. the l-proline mimetics were synthesized according to the published procedures (3, 4) with minor modifications. the ic50 and ki values and the binding kinetics were determined for porcine pop. the log p values were determined with the shake-flask method. results. the replacement of the p2 l-prolyl residue by the l-proline mimetics gave compounds which were equipotent with their parent structures. both l-proline mimetics increased lipophilicity but the effect of the 5-t-butyl-l-prolyl group was more pronounced. while the 5-t-butyl-l-prolyl group increased the dissociation half-life of the enzymeinhibitor complex, the (r)-cyclopent-2-enecarbonyl group decreased it. conclusions. both l-proline mimetics perfectly mimicked l-proline at the p2 position of pop inhibitors. these mimetics can be used to modify the lipophilicity and the binding kinetics of pop inhibitors. the proteasome is an essential multicatalytic protease of the ubiquitin proteasome pathway. as a prime executor of regulated proteolysis, the proteasome controls almost all aspects of cell metabolism from signal transduction to cell cycle and differentiation. pharmacological intervention into proteasome activity leads to cell apoptosis. this observation was applied to successfully treat multiple myeloma, since the cancer cells exhibit substantially higher sensitivity to competitive inhibition of proteasome than normal cells. however, the complete shutting down of the proteasome catalyzed proteolysis leads to serious side effects resulting from the disruption of proteolytic homeostasis even in noncancerous cells. here, we show an alternative approach to control the proteasome activity using peptide based noncompetitive regulators. the cathelicidins derived peptides rich in proline and arginine (pr) residues have been found to affect activity of all the proteasome complexes both in vivo and in vitro, likely by binding to the face of the enzyme. mechanism and structural constrains of the pr peptides dictating their influence on the proteasome remain elusive. our results indicate that there are three sequence related properties of the pr peptides controlling their effectiveness as proteasome regulators: length of the peptide, distribution of a set of positive charges at the peptide n-terminus, and positioning of proline residues. far uv cd spectroscopy demonstrates that these properties also correlate with the structure of pr peptides. in particular, it seems that structural propensity of the pr peptides to form beta-turns are required to bind to proteasome as regulatory competent molecules. our work is focused on the search of selective, low-molecular cathepsin b peptide inhibitors acylated with the (e)-3-(benzylsulphonyl)acroyl group (bsa). the double bond, embedded in the bsa moiety is activated by two electron-withdrawing groups and may be a good target for the michael-type addition of the catalytically active -sh group. three series of peptide derivatives possessing general structures: bsa-phe-asn(r)-oh, bsa-ile-x(oh)-n(ch3)2 and bsa-x-pro-oh were synthesized in solution and characterized by enzyme kinetic studies against papain, cathepsins b and k. it should be noted that all the investigated compounds were competitive and reversible inhibitors of the enzymes examined. using 2d 1h nmr (tocsy, cosy, roesy) and 13c nmr spectroscopy along with theoretical calculations (analyse program) we determined the conformational properties of two most potent and selective cathepsin b inhibitors. this work was supported by grant ds/8350-5-0131-6. background and aims: we have developed peptides inhibiting human kallikrein-2 (hk2) activity. as hk2 is overexpressed in prostate cancer tissue, these peptides are potentially useful for treatment and diagnosis of prostate cancer. two of the potential physiological substrates for hk2 are proform of prostate specific antigen (propsa) and insulin-like growth factor-binding protein-3 (igfbp-3). both of these might participate in the regulation of prostate cancer growth: igfbp-3 by inhibiting igf-dependent tumor growth and psa by degrading extracellular matrix. we aimed to study whether our hk2-inhibiting peptides inhibit also hk2 activity towards natural protein substrates, i.e. activation of propsa and degradation of igfbp-3. methods: the effect of the peptides on the activation of propsa by hk2 was studied by preincubating the peptides with hk2, followed by addition of psa and specific psa substrate. igfbp-3 degradation was studied by two specific immunoassays, one detecting only native igfbp-3, while the other one also detected proteolytically cleaved forms of the protein. results: hk2-inhibiting peptides were found to inhibit propsa activation and igfbp-3 degradation by hk2 in a dose dependent fashion. conclusions: we have developed new peptides inhibiting hk2 activity towards natural substrates, like propsa and igfbp-3. the peptides might be useful for treatment of prostate cancer and other diseases associated with increased hk2 activity. from the seeds of garden four-o'clock and spinach we isolated two serine proteinase inhibitors (mjti i -mirabilis jalapa trypsin inhibitor and soti i -spinacia oleracea trypsin inhibitor), which are probably representatives of a new family of inhibitors. the purification procedures of these inhibitors included affinity chromatography on immobilized methylchymotrypsin in a presence of 5 m nacl, ion exchange chromatography and/or preparative electrophoresis and finally rp-hplc on c18 column. their primary structures (fig. 1 ) differ from those of known trypsin inhibitors, but showed significant similarity to one another, as well as to the antimicrobial peptides isolated from the seeds of mirabilis jalapa (mj-amp1, mj-amp2), mesembryanthemum crystallinum (amp1) and phytolacca americana (amp-2 and pafs-s) and from hemolymph of acrocinus longimanus (alo-1, 2 and 3). the equilibrium association constants (ka) of mjti i and soti i with bovine -trypsin were found to be about 107-109 m-1. mjti i and soti i have been synthesized using solid-phase method. the synthesized inhibitors and inhibitors isolated from plants have similar properties. the disulfide bridge pattern in both inhibitors was established after digestion with thermolysine, followed by the maldi-tof: cys1-cys-4, cys2-cys5 and cys3-cys6. s. cosgrove, l. rogers, c. hewage, j.p. malthouse aspartyl proteases are required for the multiplication of the aids virus and for producing the amyloid protein which causes alzheimer's disease. hiv protease inhibitors have been highly effective in treating aids patients and it is hoped that potent inhibitors of the beta secretases will also prove effective in treating alzheimer's disease. therefore inhibitors of the aspartyl proteases have great therapeutic potential. we have shown that the peptide glyoxals are potent inhibitors of the thiol protease papain and of the serine proteases subtilisin and chymotrypsin. using 13c-nmr we have been able to show that glyoxal inhibitors react reversibly with an active site nucleophile in these enzymes to form a tetrahedral adduct which is tightly bound by the enzyme. in the present work we synthesise 13c-enriched peptide glyoxals, we assess their inhibitor potency, and use 13c-nmr to examine how the inhibitors interact with the aspartyl protease pepsin. z-ala-ala-[2-13c]phe-glyoxal was synthesised from [1-13c]phenylalanine which was converted to its methyl ester. this was then coupled with z-ala-ala to give z-ala-ala-[2-13c]phe-ome which was hydrolysed to the free acid. this was converted to the diazoketone and transformed into z-ala-ala-[2-13c]phe-glyoxal using dimethyldioxirane. nmr spectra at 11.75 t were recorded with a bruker avance drx 500 standard-bore spectrometer. we show that peptide glyoxal inhibitors can be potent inhibitors of pepsin and that pepsin only binds one of the four glyoxal forms (one non-hydrated, one fully hydrated and two partially hydrated forms). alzheimer`s disease (ad) is the most common cause of dementia in older people. a major factor in the pathogenesis of ad is the cerebral deposition of amyloid fibrils, consisting of amyloid β peptides (aβ), as senile plaque. the 40-to 42 amino acid long aβ is generated by the proteolysis of β-amyloid precursor protein (app) by β-and γ-secretases. since bace1, a unique member of the pepsin family of aspartyl proteases initiates the pathogenic processing of app by cleaving at the n-terminus it is a molecular target for therapeutic intervention in ad proteolytic activity was found to occur, to a variable degree, in digestive organs of all studied organisms over the entire ph range. the common feature was the existence of two activity peaks, in the acid (ph 2.5 -3.5) and alkaline (ph 7.5 -8.5) zones, as well as a similar protease set containing e and d cathepsins, a trypsin-like enzyme, elastase, and collagenolytic proteases. proteolytic activity in the hepatopancreas of crab and sea star was found to be an order higher than in other study objects. high protease activity in crab hepatopancreas is an evolutionary mechanism compensating for a poor differentiation of digestive system, low substrate specificity of enzymes, and cold environment. trypsin activity in digestive organs of invertebrates suggests that a trypsin-like enzyme is a genetically old one, an evolutionary origin of all serine proteases. a difference of kind between vertebrates and invertebrates is that the latter have cathepsine activity (absent in vertebrates) and no pepsin activity. it is of interest to develop enzyme inhibitors containing a light activated switch that can be used to control binding and inactivation of an enzyme. several inhibitors containing the azobenzene photoswitch group have previously been developed and have shown changes in activity of around two times on photoswitching. this study aimed to improve this switching by more extensive derivatisation of azobenzene to closer resemble the peptide substrates of proteases. a series of peptidomimetics containing the azobenzene photoswitch group were synthesized and assayed against the protease alpha-chymotrypsin. these compounds contained azobenzene, linked to a known chymotrypsin inhibitory group (either a trifluoromethylketone or boronate ester), and otherwise designed to be peptide-like. in some cases both ends of the azobenzene moiety were derivatized in order to increase the impact of photoswitching on the shape of the compound and thus its enzyme binding strength. assays showed that most compounds were reversible inhibitors of chymotrypsin, with low micromolar inhibition constants (ki or ic50). up to four times increase in enzyme inhibition on light activated switching of the azobenzene group conformation was obtained. a number of peptidyl derivatives structurally based upon the inhibitory sites of cystatins has been synthesized. these compounds are prone to proteolytic degradation, are rapidly excreted and poorly bioavailable. the majority of this problems might be overcome by use of peptidomimetics with structures resembling those of previously synthesized peptidyl derivatives. among the peptidomimetics are azapeptides, in which alpha-ch group of amino-acid residue is replaced by a nitrogen atom. the azapeptides have recently been demonstrated as potent and selective inhibitors of cathepsins b and k. it was shown that azapeptide inhibitors bind along the active site cleft of cathepsin b in a bent conformation. this bent structure is likely to result from the mobility of the bonds in the vicinity of the inserted azaamino acid residue as well as from the interaction with enzyme. in our present work we have studied the peptide of a sequence: z-arg-leu-arg-gly-ile-val-ome, which is characterized by one major and three minor conformations. the replacement of alpha-ch group in the gly residue of peptide chain of z-arg-leu-arg-gly-ile-val-ome by the nitrogen atom likely results in rigid conformation. our aim was a comparison of structure of the parent peptide z-arg-leu-arg-gly-ile-val-ome and a selective cathepsin b inhibitor z-arg-leu-arg-agly-ile-val-ome by using 1h-nmr. severe acute respiratory syndrome corona virus associated main protease (sars cov mpro protease), alternatively known as chymotrypsin-like protease (3clpro), is a mediator of virus infection cyclus and from there a therapeutic target. a peptide aldehyde library targeting the sars corona virus main protease (sars-cov mpro, alternatively known as 3clpro) was designed on the basis of three different reported binding modes and supported by virtual screening. a set of 25 peptide aldehydes were prepared by a newly developed methodology and investigated in an inhibition assay against sars-cov mpro. [1] protected amino acid aldehydes furnished by the racemization-free oxidation of amino alcohols with dess-martin periodinane were immobilized on threonyl resins as oxazolidines. following boc-protection of the ring nitrogen yielding the n-protected oxazolidine linker, peptide synthesis was performed efficiently on this resin releasing deprotected products under mild hydrolysis conditions. the library was tested in a new fluorimetric enzyme assay for sars cov mpro. via immobilization of the fluorophor, 2-(7-amino-4-methyl-3-coumarinyl)-acetic acid, the substrate actsavlq-amca was prepared, surprisingly displaying a higher affinity than the native substrate. several potent inhibitors were found with ic50 values in the low micromolar range. interestingly, the most potent inhibitors seem to bind sars-cov mpro in a non-canonical binding mode. currently, the initial screen is extended towards the discovery of small molecule inhibitors of sars corona virus main protease. literature: a method of bromelain cleavage of surface glycoprotein hemagglutinin (ha) from the influenza a virions was initially employed for ha ectodomains crystallographic study [1] . the remaining spikeless subviral particles were used by us earlier for ha2 c-terminal fragment extraction and mass spectrometric (ms) investigation [2] . now sds-page analysis of the subviral particle preparations revealed several additional bands in a range of 9-23 kda together with major viral proteins comparing to intact virions (figure, m1matrix protein, f1-f5-m1 protein fragments, np-nucleoprotein). maldi-tof ms analysis of the in-gel trypsin hydrolyzates has shown that the additional bands are fragments of м1 protein. this was confirmed by n-terminal sequencing of the protein fragments electroblotted from the bands. concentration of sh-reducing reagent in bromelain digestion reaction influenced on the m1 fragment bands intensity. we conclude that due to membrane destabilization during ha spikes removing, m1 protein localized under viral membrane inside intact virions becomes accessible to limited proteolysis by bromelain. [ dipeptidyl peptidases (dpp's) sequentially release dipeptides from polypeptides. among those enzymes, dppiv, fapα, dpp8, dpp9 and dppii cause the release of n-terminal dipeptides containing proline or alanine at the penultimate position. they are all members of clan sc, a group of serine proteases that contains proline-specific peptidases. dipeptidyl-peptidase iv (dppiv) is the best studied member of this group of enzymes and has become a validated target for the treatment of type 2 diabetes over the last years. the development of inhibitors for the related enzymes (i.a. dppii) has only started recently. this poster presents selected products synthesised to further elaborate the structure-activity relationship for dpp ii inhibitors with a 2,4-diaminobutyrylpiperidine basic structure. this class of compounds was described earlier by our group as the hitherto most potent and selective inhibitors of dpp ii. starting from n4-p-chlorobenzyl-substituted uamc00039, our lead compound, two types of modifications were proposed: • the synthesis of n4-(di)alkyl-and arylalkyl analogues; • the synthesis of 3-methyl analogues. in our previous study, we reported potent and small-sized bace1 inhibitors containing phenylnorstatine [(2r,3s)-3-amino-2-hydroxy-4-phenylbutyric acid; pns] at p1 position as a transition-state mimic. in developing more active compounds, we focused our attempts on the p1 position, where we replaced the pns by its thioderivative. herein, we present the synthesis of a novel phenylthionorstatine [(2r,3r)-3-amino-2-hydroxy-4-(phenylthio)butyric acid; ptns] as a p1 moiety with hydroxymethylcarbonyl (hmc) isostere, and then an application to the bace1 inhibitors design. we have synthesized ptns starting from readily available n-benzyloxycarbonyl-serine and after multistep reaction (including weinreb amide formation, thiophenyl group introduction, through cyanohydrin derivative the transformation into the 2-hydroxy ester and then acid). purification was done by column chromatography and rp-hplc. peptides were synthesized by the fmoc based solid phase method and characterized by maldi-tof ms. the peptide inhibitors were adopted to enzyme assay using a recombinant human bace1 and a fluorescence-quenching substrate. bace1 inhibitory activity was determined based on the decrease% of the cleaved substrate by the enzyme.we have synthesized ptns and then the (2r,3r)-enantiomer was applied to spps (solid phase peptide synthesis). we synthesized octa-and pentapeptide-type inhibitors of bace1 containing pns or ptns at the p1 position. these compounds were enzymatically tested and showed high bace1 inhibitory activity. a novel derivative of pns, ptns, was synthesized, and evaluated in comparison to corresponding pns. the inhibitors with ptns exhibited a slightly higher inhibitory activity against bace1 comparing to those with pns. this study suggests possibilities of the application of ptns to design other aspartyl proteases inhibitors. the αν β3 integrin receptor plays an important role in human metastasis and tumor-induced angiogenesis, mainly by interacting with matrix proteins through recognition of an arg-gly-asp (rgd) motif. inhibition of the αν β3 integrins with a cyclic rgd peptide impairs angiogenesis, growth and metastasis of solid tumours in vivo. the aim of this study was to investigate the effects of replacement of amino acids by aza-β3-amino acid analogs in cyclic rgdpeptides as αν β3 -integrin antagonist on angiogenesis, microcirculation, growth and metastasis formation of a solid tumour in vivo. the selectivity profile of these antiadhesive cyclopeptide is rationalized by a special presentation of the pharmacophoric groups. thr rgd motif resides in position i to i+2 of a regular γ-turn. we synthetized linear and cyclic aza-β3 rgd-peptide with the purpose to examine the effect on the conformation and the activity. are aza-β3 amino acids γ-turn mimetics? the preferred conformations were determined by nmr. prostaglandins are involved in a large number of biological activities mediated by their g-protein coupled receptors (gpcrs). the prostaglandins pgf2 alpha receptors are found specifically in uterine muscle, where they initiate parturition and labor. the pgf2 alpha receptor plays a key role in preterm labor, for which medical and social costs are estimated at $ 9 billion per year in the usa (the highest per patient cost of any disorder). peptide mimics have been developed in our laboratory (1, 2) , that serve as allosteric antagonists of the pgf2 alpha receptor. the importance of the turn geometry of the central residue in these peptide mimics has been investigated using enantiomeric indolizidin-2-one beta-turn mimics which can respectively induce type ii and ii' geometry. our presentation will discuss the synthesis and biology of these novel allosteric modulators of prostaglandin pgf2 alpha receptor activity. it was shown that luteinising hormone -releasing hormone (lhrh) receptors are overexpressed in the most of adenocarcinoma cells in contrast to their low content in normal tissues. these data create the basis for lhrh analogues application in therapy of breast, ovary, prostate, lung, intestine, liver and kidney cancers. both agonists and antagonists utility for the targeting of cytotoxic moiety to the tumor cells is well documented. however, the number of lhrh analogues possessed their own cytotoxic activity is still very limited. we nicotianamine (na) that was first isolated from the leaves of nicotiana tabacum l [1] , is known as a key biosynthetic precursor of phytosiderophores. various studies have proved that nicotianamine plays a significant role in plants as an iron, nickel, zinc ... transporter [2] . the aim of our study was to synthesize unnatural analogues of na via peptide intermediates, to investigate the mechanisms of metal transport and accumulation within the plant. we found that the strategy developed for na synthesis could not be applied when the azetidine ring was changed for pyrrolidine ring and we investigated a new route to synthesize such analogue. these synthetic pathways will be discussed. the primary physiological roles of arginine vasopressin (avp), [cycle1-6 (h-cys1-tyr2-phe3-gln4-asn5-cys6-pro7-arg8-gly9-nh2)], involve vasoconstriction of vascular smooth muscles, via v1a receptor, and antidiuretic action in kidney (blood osmolality regulation) via v2 receptor. binding of avp to the v1a receptor subtype also stimulates glycogenolysis in the liver and promotes platelet aggregation. in addition, activation of the v1b (also known as v3) receptor causes adrenocorticotropic hormone release from the anterior pituitary. v1b receptors are also present in the brain where avp functions as a neurotransmitter. in the recent years by the salivary glands of several bloodsucking animals like, teaks, leeches, vampire bats and so forth are isolated plenty of proteins and peptides with different molecular weight and well established anticoagulant activity. many of the strongest anticoagulants isolated by bloodsucking animals are found in the extract of salivary glands of different kinds of leeches. such leech is the haementeria officinalis, from which is isolated the most active inhibitor of factor xa -ats. in order to study the role of some amino acids in the process of interaction among peptides mimetics and the active site of serine proteinases, some fragment analogues of ats's active site by replacement of some amino acids with the other with similar structure or with unnatural amino acids were synthesized. in the present work the synthesis and the anticoagulant activity according to the aptt and ic50 of the newly synthesized peptides and structure-activity relationship will be discussed. rational design of peptides is a challenge which would benefit from a better knowledge of their rules of sequence-structure-function relationships. peptide structures can be approached by spectroscopy and nmr techniques but data from these approaches too frequently diverge. structures can also be calculated in silico from primary sequence information using three algorithms: pepstr, robetta and peplook. the most recent algorithm, peplook introduces indexes for evaluating structural polymorphism and stability. the method uses a de novo search of energy minima by an iterative boltzmann-stochastic procedure and using a combination of 64 phi/psi couples derived from the structural alphabet for protein structures proposed by etchebest et al. for peptides with converging experimental data, calculated structures from peplook and, to a lesser extent from pepstr are close to nmr models. the peplook index for polymorphism is low and the index for stability points out possible binding sites. for peptides with divergent experimental data, calculated and nmr structures can be similar or, can be different. these differences are apparently due to polymorphism and to different conditions of structure assays and calculations. the peplook index for polymorphism maps the fragments encoding disorder and the mean force potential score indicates which residues will be most available for interactions with partners. this should provide new means for the rational design of peptides. several diseases like cancer metastasis, rheumatoid arthritis and chronic lymphocytic b-cell leukemia are linked to the interaction of the cxcr4 chemokine receptor to its natural ligand, the 68 amino acid protein stromal cell-derived factor-1α (sdf-1α). [1] one strategy for the treatment of these diseases could be to block the interaction between cxcr4 and sdf-1α with small cxcr4 antagonists. furthermore, radiolabeling of suitable compounds with appropriate radioisotopes could provide agents for imaging of cxcr4 expression in vivo via pet. previous studies by fujii et al. on cxcr4 antagonists led to a high affinity cyclic pentapeptide with the sequence cyclo[gly-d-tyr-arg-arg-nal]. [2] to further improve this structure, different approaches have been chosen with respect to metabolic stability, bioavailability, conformational rigidity and chemical versatility for radiolabeling. first, an n-methyl scan of the backbone amides was performed to influence conformational freedom and to increase metabolic stability and bioavailability. n-methylation of arginine residues yielded peptides with moderate affinity (ic50-values: 23nm (n-me)arg3 and 31nm (n-me)arg4, resp.) whereas n-methylation of other amino acids significantly decreased the affinity (ic50>100nm). by substitution of arg3 by ornithine, the affinity was mostly retained. [3] the amino group of orn can be alkylated or acylated via radiolabeled groups containing short lived isotopes. moreover, the bioavailability should be improved as the high basicity of the two guanidino groups could be reduced. first ornithine-acylated derivatives showed ic50 values between 11-35nm enabling for the first time 18f-radiolabeling of small cxcr4 antagonists for pet imaging in vivo. binding of ligands to integrins plays a major role in cell adhesion, migration, and signal transduction of cells. these interactions are important not only for normal cell functions, but also in pathogenic processes. the v 3 integrin for example is involved in tumor cell adhesion and osteoporosis. the association of ligands is specific and requires minimal recognition sequences. therefore, suppression of integrin activity using competitive inhibitors bears great pharmacological potential. the tri-peptide sequence rgd is a prominent recognition sequence of integrin ligands. two new cyclic pentapeptides were synthesized containing the tripeptide sequence rgd as well as 3-amino-cyclopropane-1,2-dicarboxylic acid monomethyl ester (acc) and valine varying only with respect to the stereochemistry of acc. both the (+) (all r) and (-) (all s) isomers of acc were incorporated. acc is a cyclic -amino acid as well as a cyclopropyl analogue of aspartic acid. biological tests with cell lines expressing mainly v 3 and v 5 integrin show a higher inhibitory activity of cyclo-(-arg-gly-asp-(+)acc-val-). in order to derive a structure-activity relationship of these two isomers, solution structures in dmso-d6 were investigated by nmr spectroscopy. subsequently, structural information was obtained by applying distance restraints derived from the nmr spectra in distance geometry/simulated annealing and molecular dynamics calculations. due to the rigidity of the cyclopropyl unit in acc, the structure of the cyclopeptide is significantly influenced by the integrated propane ring, thus explaining the different biological properties. integrins are an important family of cell adhesion molecules. currently, 24 members are known. among other functions, integrin α 4 β 1 is involved in inflammatory processes, leukocyte migration and tumor angiogenesis. the structure of its natural ligand vcam-1, including the binding loop sequence tqidspln, has been determined by x-ray crystallography. therefore, it is possible to apply the concept of spatial screening: using small cyclic peptides with structure inducing building blocks, the binding motif is presented in different well-defined structural arrangements. for this study, a series of cyclic penta-and hexapeptides based on the tqidspln sequence has been synthesized. β-homoamino acids, i.e. β 3 -amino acids with proteinogenic side-chains, have been incorporated as structure inducers for spatial screening. although β 3 -amino acids are supposed to prefer the central position of ψγ-turns, less data exist than for e.g. d-amino acids. apart from the structural characterization of potential high affinity ligands for integrin α 4 β 1 , a major goal of this work is to provide a better understanding of the influence of β 3 -amino acids on the structure of cyclic peptides. the structures of the peptide library have been investigated by nmr spectroscopy, followed by dg/sa and md calculations. the results substantiate the γ-turn inducing capability of β-homoamino acids, but also prove the formation of different turn structures in certain cases. a comparison to the x-ray structure of vcam-1 shows that the structure of the binding sequence has been successfully approximated by some of the peptides. biological activity tests should lead to meaningful structure-affinity relationships. neuropeptide y (npy) is a 36-amino acid peptide amide and binds to the so-called y receptors. its most dominant element is the c-terminal alpha-helix spanning amino acid residues 12-36. residues 1-10 form a polyproline helix with highly conserved proline residues at positions 2, 5 and 8, followed by a loop structure. the importance of the polyproline helix strongly varies between different receptor subtypes. it obviously plays no role in ligand binding at the y2 receptor subtype, whereas the n-terminal segment is of importance for ligand binding at y1 and y5. in order to further study the importance of the polyproline helix we introduced a conformationally constrained pyridone dipeptide mimetic at different single positions by solid phase peptide synthesis using fmoc/tbu strategy. the resulting peptides have been investigated in cell lines that selectively express y1 and y5 receptor, respectively. different methods including radioactive competitive binding assay, cd and nmr have been applied to investigate conformation and interaction of receptor and ligand. loss of affinity at the y1 receptor is independent of the position and about 10-, 20-and 30-fold, respectively, when introduced once, twice and thrice. introduction of the building block in position 8/9 leads to the most reduced affinity at the y5 receptor subtype but, surprisingly, affinity can partially be regained by introduction of the dipeptide at two additional positions. the position of the dipeptide is of greater importance at y5. these novel peptides clearly indicate the importance of proline residues and the structure of the n-terminus for ligand binding. interactions of src homology 2 (sh2) domains with phosphotyrosine (py) containing ligands is critical for regulating cellular processes. the cytosolic protein tyrosine phosphatase shp-1 contains two sh2 domains. an intramolecular interaction of the n-terminal sh2 domain with the catalytic (ptp) domain renders the enzyme inactive in the native state. binding of a py-ligand to shp-1 n-sh2 leads to a conformational shift and the dissociation of the sh2-ptp complex [1] . in previous studies we investigated the topographical and conformational preferences of the n-sh2 domain of shp-1 using conformationally restricted linear and cyclic peptides derived from the natural interaction partner ros py2267 [2] . we identified peptides that showed an increased binding affinity for the n-sh2 domain and partially inhibited ros-mediated shp-1 activity. on the basis of these results we hypothesized that an imperfect fit of the py+1 and py+3 side chains might be responsible for the inhibitory effect. in order to confirm this hypothesis we synthesized a new series of peptides and evaluated their biological activity. to better understand the role of each individual sh2 domain in the activation process we also determined the binding affinity against the c-sh2 domain and the activation profile of different shp-1 mutants. pull-down assays of the interactions of the py-ligands with full length shp-1 confirmed the results obtained for the binding to the individual sh2 domains. proteins are targets for photo-destruction due to absorption of incident light by endogenous chromophores. mass spectroscopic data presented evidence that structural modification observed upon irradiation of goat alpha-lactalbumin at 290 nm results from tryptophan (trp) mediated cleavage of disulfide bonds [1] . the aim of the recent studies is to define structural elements that direct the destructive influence of near-uv light on the disulfide bridges of proteins. most of the proteins of the immunoglobulin superfamily contain a so called triad, consisting of two s atoms, forming a disulfide bridge, and a single trp in their close vicinity [2] . we have indications that this arrangement gives rise to a photolytic degradation similar to that described in our earlier studies for goat alpha-lactalbumin [3] . we therefore investigated the influence of uv light on the single chain variable fragment (scfv) of a monoclonal antibody (82d6a3) [4] which contains two triads. the results showed that after irradiation of the wild type scfv (i) new bands (degradation products) appeared in electrophoresis experiments and (ii) the affinity for its antigen, von willebrand factor decreased. by site-directed mutagenesis, we modified the critical trp-residues to perform a parallel study on these mutants. background and aims: it is known that thrombomodulin has important function which prevents thrombus. we found kmylcvckn (m, n >= 2) peptides derived from thrombomodulin had strong anti-thrombus activity in our recent studies. these peptides formed two structures, parallel and anti-parallel, as dimers, we examined the relation between structure and activity. methods: two peptides of kkkylc(acm)vckkk and kkkkylcvc(acm)kkkk were synthesized by fmoc chemistry. dimer peptides were made by removing acm with iodine, after dissolving in 0.1 m tris hcl buffer (ph 8.0) and oxidizing the mixture of these synthesized peptides spontaneously. then three peptides shown in figure were separated using rp-hplc. the peptide concentration in normal human pooled plasma was 10 micro moles / l when measuring aptt (activated partial thromboplastin time). results: the anti-parallel formed peptide, peptide b, was prolonged aptt approximately 2.7 times, although two parallel formed peptide, peptide a and c, were not significantly different from the aptt of normal plasma. conclusions: these peptides have structure-activity relationship, we observed that the anti-parallel formed peptide had strong anti-thrombus activity. insect kinins share a highly conserved c-terminal pentapeptide sequence phe-xaa-xbb-trp-gly-nh2, where xaa can be tyr, his, ser or asn and xbb can be ala but is generally ser or pro. they are potent diuretic peptides that stimulate the secretion of primary urine by malpighian tubules, organs involved in the regulation of salt and water balance (1). the insect kinins preferentially form a cis-pro, type vi β-turn. insect kinin analogs containing tetrazole (1) and 4-aminopyroglutamate (2), both cis-peptide bond, type vi β-turn motifs, demonstrate significant activity in the in a cricket diuretic assay. in this study, we compare the diuretic activity of insect kinin analogs incorporating the four stereochemical variants of the 4-aminoglutamate (apy) motif. three of the insect kinin analogs incorporating the stereochemical variants, ( the need for new effective and to mammalian cells non-toxic antifungal agents increases in parallel with the expanding number of immunocompromised patients at risk for invasive fungal infections. in our laboratory we have produced a serie of low-molecular peptide derivatives of the general structure: x-arg-leu-nh-ch(ipr)-ch2-nh-y (where x and y were acyl groups with aromatic carbocyclic system). we have found and earlier reported that some of these display high antimicrobial activities against several clinically important gram-positive pathogenic bacteria. in this study we have by solution methods synthesized a group of low-molecular compounds and investigated their antifungal activity. the study included both candida and aspergillus species. we have found that some of the compounds were highly fungicidal. we also made a conformational study in which the residues were separately replaced by selected hydrophobic amino acids and their equivalents. the conformational study showed that the desirable stable intramolecular structure could only be formed in the presence of some vital components. this work was supported by grant ds/8350-5-0131-6. increased resistance of bacterial pathogens to currently employed antibiotics has resulted in efforts to develop antimicrobial compounds with new mechanisms of action. previously, we have synthesized some high potent antimicrobial compounds based upon the n-terminal binding fragment of human cystatin c. some derivatives of the general structure: x-arg-leu-nh-ch(ipr)-ch2-nh-y (1) (where x and y were acyl groups with aromatic carbocyclic system) have displayed the broad antibacterial spectrum and high activity against several clinically important gram-positive pathogens, including multi-resistant staphylococci. herein, the synthesis and structure -antibacterial properties relationship for two series of analogues of 1 are presented. the x and y groups in 1 were replaced by selected substituents with various geometry and distance between aromatic moieties and carbonyl. we have established the general structural features which the discussed class of peptide derivatives should possess in order to displaying the particular antimicrobial activity. this work was supported by grant ds/8350-5-0131-6. we have synthesized beta-endorphine-like decapeptide immunorphin sltclvkgfy which corresponds to the 364-373 sequence of the heavy chain of human igg. immunorphin was found to be a selective agonist of non-opioid (naloxone-insensitive) beta-endorphin receptor. the purpose of this study was to prepare [3h]immunorphin and characterize by its using the non-opioid beta-endorphin receptor on mouse peritoneal macrophages and membranes isolated from various rat organs. by use of tritium-labeled immunorphin ([3h]sltclvkgfy) with specific activity of 24 ci/mmol, non-opioid beta-endorphin receptors were revealed and characterized on mouse peritoneal macrophages and rat myocardium, spleen, adrenal, and brain membranes. since dehydroamino acids are quite reactive and various thiol nucleophiles are known to add to their double bonds [1, 2] , we hoped that these compounds might act as alkylating inhibitors of cathepsin c (dipeptidyl-peptidase i). its main function is protein degradation in lysosymes, but it is also found to participate in the activation of neuraminidase and proenzymes of serine proteinases (leukocyte elastase, cathepsin g, granzyme a) [3, 4] . it is well known, that phosphonodipeptides structural analogues of synthetic substrates of cathepsin c are the model substances in designing the new inhibitors of this enzyme. for that reason we have undertook the synthesis, theoretical and structural investigations of phosphonic analogues of dehydropeptides. gly-∆zphe-abupo(ome)2 gly-∆zphe-alapo(oet)2 gly-∆zphe-leupo(ome)2 gly-∆zphe-valpo(oet)2 gly-∆zphe-glypo(ome)2 gly-∆zphe-nbupo(oet)2 the structure and conformational preferences in this group of peptides had been investigated by mean of nmr techniques. in order to find the interactions between compounds-enzyme (cathepsin c) and interpret the results of biological test, the molecular modelling methods had been used. the interaction of v 3 integrin receptor with its ligands is selectively implicated in various processes, like angiogenesis, bone-formation, tumor genesis and tissue-genetic migration of embryonic cells. several cyclic rgd pentapeptides are known as selective ligands for v 3 integrin receptor. the aim of this study was to prepare a new conjugate, composed of the cyclo[rgdfc] derivative and a branched chain polycationic polypeptide, poly[lys(dl-alam)] (ak). the cyclopeptide was prepared on 2-cl-trityl chloride resin by fmoc/tbu strategy. the "head-to-tail" cyclisation was achieved in a diluted solution of dmf in the presence of bop and hobt coupling reagents and diea base. coupling of the cyclopeptide to ak polymer was carried out by thioether linkage. adhesion properties of soluble cyclic rgd peptides and their plate-immobilized forms were studied. free cyclopeptides evoke aggregation of cultured primary neural and cloned neural stem cells, while their plate-immobilized forms fail to support cell adhesion. on the contrary, in case the newly synthesized ak-c(rgdfc) conjugate such induction of cell aggregation was not observed. whereas immobilizing this derivative to either glass, or plastic was found to support cell-attachment in case of various cell types. in addition, all cell lines investigated -including also the primary neural cells -attached to ak-c(rgdfc) coated surface and survived, grew or differentiated even in the absence of serum. our data suggest that cyclic rgd -polypeptide conjugates represent a new tool to investigate selective cell adhesion and may provide a novel scaffold-material for directed cell-seeding. in the ph-induced channel closure in combination with the pip2 interactions. however, their detailed regulations are still remaining unclear. therefore, in the present study, we investigate these crucial residues with electrophysiological recordings and rationally designed mutagenesis based upon our structural analysis of kir1.1 tetramer. lys-80 is located fairly close to the intracellular channel gate and protrudes its long side chain positive charge into the pore. this may interfere with the potassium flow by providing repulsion charge while ph is lowered, which pushes the channel towards its closed state. mutation to met-80 therefore reduces such ph-sensitivity. on the other hand, arg-188 is supposed to be responsible for the maintenance of channel opening in the presence of pip2. loss of positive charge at this site may lead to the enhanced ph-sensitivity due to an abolished or reduced pip2 interaction. more interestingly, the double mutant for both sites reveals a compensation scenario. in combination with the discussion for the role of previously known r-k-r triad, our data provide very clear structural explanation for the exact functional roles of these basic residues in the regulation of ph-sensitive channel gating. mouse obese cart peptides are neurotransmitters involved in feeding, stress and endocrine regulation. leptin, a long-term adiposity signal, upregulates expression of cart in the hypothalamus. recent findings of co-localization of cart and cholecystokinin (cck)-a receptor (responsible for satiety effect of cck) in brain and gastrointestinal tract suggest a neurochemical link between cart peptides and cck. in normal fasted mice, cart(61-102) peptide decreased food intake after intracerebroventricular (icv) administration in a dose-dependent manner. anorectic effect of cart peptide was enhanced by peripherally administered cck-8, while cck-a receptor antagonist, devazepide blocked the effect of cart peptide on food intake. we used two mouse obesity models in this study: monosodium glutamate (msg) and diet-induced obese (dio) c57bl mice. both dio and msg mice had substantially increased fat to body mass ratio compared to their controls and were hyperleptinemic. msg mice were hypophagic and neither cart peptide nor cck-8 and devazepide had any effect on food intake of these mice. dio mice fed high-fat diet showed slightly decreased sensitivity to central administration of cart peptide, effect of cck-8 on food intake was preserved. in conclusion, cart peptide and cck-8 showed a synergistic effect on feeding in control mice that pointed to their probably integrated action in the central nervous system. analogously, devazepide suppressed cart anorectic effect. in msg obese mice, effects of both cart peptide and cck-8 on food intake were diminished due to disrupted signaling in hypothalamus. in dio mice, additive effects of cart and cck-8 were partly preserved inspite of hyperleptinemia and increased adiposity. b. chini 1 , s. stoev 2 , l.l. cheng 2 , m. manning 2 , were subsequently shown not be selective for the rat v1b receptor [2] . peptides a-d served as excellent leads to the design of selective agonists for the rat vp v1b receptor [3] . replacement of the arg8 residue in a-d by lys, orn, dap and dab, led to the first potent and selective agonists for the rat v1b receptor [3] . we now report that three of these; d the aim of the de novo peptide synthesis and the incorporation of cofactors is the construction of artificial protein models. these model systems can be used for understanding the structure-function relationship of native proteins and might open a way for possible applications. protegrin-1 (pg-1) is an 18-amino acid peptide with an amidated c-terminus, which forms an antiparallel beta-sheet, constrained by two disulfide bridges. the native sequence of pg-1 is highly cationic, containing six positively charged arginine residues. it was found that the structural features such as amphiphilicity, charge and shape are important for the cytolitic activity of pg-1. in this study we investigate the sar (structure activity relationship) of two pg-1 analogues: rglcycrgrfcvcvg-nh2 (bm-1) and rglcyrprfvcvg-nh2 (bm-2). our antimicrobial activity studies of these peptides show that the bm-1 peptide is active against microbe species as well as the native pg-1, whereas the bm-2 is completely inactive. the bm-1 analoque is shorter than native pg-1 and contains only three arginine residues, therefore is much cheaper in the chemical synthesis, what could be an advantage of this antimicrobial peptide. the conformational studies of both analogues were performed by using 2d 1h-nmr technique (in dmso-d6) and molecular dynamics studies. the 3d solution structure of both analogues was established using interproton distances and torsion angles. for simulated annealing calculations the xplor program was used. our conformational studies show that the bm-1 forms a regular beta-hairpin structure, which is very similar to that of the native pg-1 peptide, whereas the bm-2 analogue is very flexible, what could be a reason of the antimicrobial inactivity. copper amine oxidases (ec 1.4.3.6) catalyze the oxidative deamination of primary amines to the corresponding aldehydes, ammonia and hydrogen peroxidase. these enzymes are ubiquitous, occurring in micro-organisms, plant and animals. activity of this enzyme increases under various stress conditions including thermal and water stresses. although lsao is not a thermostable enzyme, it is in maximum stability and activity above physiological temperatures. in this study we have investigated the kinetics of thermal denaturation of lentil seedling amine oxidase (lsao) by measuring its denaturation constant (kden) at various temperatures from 37 to 67 degrees centigrade in 100 mm phosphate buffer, ph 7.0. the results of thermal inactivation curves as well as measuring of a280 at various temperatures were used to calculate kden. moreover, activation energy (ea) for denaturation reaction was obtained from corresponding arrhenius plot. our results showed that unfolding process started to occur at 56 degree centigrade and ea of denaturation was changed at 65 degree centigrade proving a dominant conformational change of the enzyme at this temperature. the results of the kinetic study are coincident with previously reported equilibrium studies denoting the optimum and melting temperature of the enzyme are 56 and 65 degree centigrade, respectively. development and advancing of enzymatic processes used for production and modification of natural polysaccharides are now major biochemistry challenges. the paper investigates enzymatic systems in invertebrates, in particular, an enzymatic complex obtained from the hepatopancreas of red king crab paralithodes camtschaticus, and clarifies its effect on the mechanism of chitin and chitosan hydrolysis. chitinolytic activity was estimated with spectrophotometer using 4-(dimemylamine)-behzaldehyde method by the concentration of n-acetyl-d-glucosamine which is educed under chitinolysis. total glycolytic activity was defined by the sum of n-acetyl-d-glucosamine and d(+)-glucosamine in the reaction with potassium hexaferricyanide (iii). content of d(+)glucosamine in the hydrolysates of chitin and chitosan was estimated by highly effective reverse-phase liquid chromatography (helc) of aminosaccharides with ortho-phthalaldehyde. the paper studies the process of chitin and chitosan glycolysis and the effects of different factors (ph, temperature and time of incubation, enzyme/substrate ratio) on the total glycolytic activity of the enzymatic complex from crab hepatopancreas, which is compared with a previously studied proteolytic and exochitinase activities. a mechanism of enzymatic hydrolysis of chitin and chitosan is suggested. study results allowed the following conclusions concerning glycolytic and deacetylase activity of ep: 1) ep induces the formation of a monomer (n-acetyl-d-glucosamine) and oligomers (chitin and chitosan) with low deacetylation. thus, ep is characterised by a marked endochitinase (endochitosanase) activity; 2) n-acetylglucosamine deacetylase and, apparently, exochitosanase activity was not revealed; 3) it was found that chitinase and protease activities of ep are associated with different enzymes. [background] in opioids, the n-terminal amino acid 2',6'-dimethyl-l-tyrosine (dmt) enhances bioactivity by orders of magnitude. c-terminal modification of the dmt by a methyl group, h-dmt-nh-ch3, exhibited µ-opioid receptor affinity (kiµ = 7.5 nm) equivalent to that of morphine; however, antinociception was only 0.64-0.85% [1] . dmt plays an important role in the message domain to anchor opioid ligands into the active site of opioid receptors, specifically to trigger biological activity by the µ-opioid receptor. [methods] dimerization of dmt through diaminoalkanes [2] or 3,6-bis-(aminoalkyl)-2(1h)-pyrazinone produced potent opioidmimetics with high affinity for µ-opioid receptors (kiµ = 0.02-0.115 nm), agonism (gpi, ic50 = 1.3-1.9 nm), and antinociception in mice after systemic and oral administration, which verified passage through the epithelial membranes of the gastrointestinal tract and blood-brain barrier [3] . (1-aminocycloalkane-1-carboxylic acid) . in this case the ring consists of five atoms. knowing that acylation of the nterminus of several known b2 blockers with a variety of bulky groups has consistently improved their antagonistic potency in the rat blood pressure assay, the apc substituted analogues were also synthesized in n-acylated form (with 1-adamantaneacetic acid (aaa)). the activity of eight new analogues was assayed in isolated rat uterus using a modified holton method in munsick solution and in rat blood pressure tests. the results clearly demonstrated the importance of the position in the peptide chain into which the sterically restricted apc residue was inserted. apc at positions 7 led to preservation or reduction of antagonistic qualities, respectively. acc at position 8 enhanced antagonistic qualities in blood pressure test and led to preservation of activity in antiuterotonic test.. in most cases acylation of the n-terminus led to enhancement of antagonistic potencies. our findings offer new possibilities for designing new potent and selective b2 blockers. background: during the course of developing opioidmimetic analgesics, data revealed that the n-terminal residue 2',6'-dimethyl-l-tyrosine (dmt) plays an important role in anchoring opioid ligands in the active site of opioid receptors. as a single residue c-terminally extended with an aminomethyl group exhibited µ-opioid receptor affinity (kiµ = 7.5 nm) similar to morphine; however, antinociception was only 0.64-0.85% [1] . in order to develop potent µ-opioid agonists, the dimerization of tyr or dmt through diaminoalkanes [2] or 3,6-bis-(aminoalkyl)-2(1h)-pyrazinones [3] resulted in production of unique opioidmimetics with high receptor affinities and potent biological activities [3] . methods: the synthesis of opioids and opioidmimetics and the determination of their receptor binding characteristics were performed as described previously [1] [2] [3] . results and conclusion: newly synthesized 3-(tyr-nh-butyl)-6-(tyr-nh-propyl)-2(1h) pyrazinone and 3-(tyr-nh-propyl)-6-(tyr-nh-butyl)-2(1h) pyrazinones (i and ii) exhibited fairly high binding affinity towards µ-opioid receptor (kiµ = 7.6 and 27.4 nm, respectively). replacement of tyr with dmt in i and ii gave opioidmimetics iii and iv (kiµ = 0.021 and 0.051 nm, respectively); they exhibited 361-and 537-fold higher binding affinity than the tyr derivatives. while iii is a dual µ-/δ-opioid agonist, iv is only a µ-opioid agonist. these findings pave the way to design additional µ-opioid receptor agonists and antagonists for therapeutic application. divalent cations have been known for a long time to influence significantly binding to receptors and biological activity of the peptide oxytocin (ot). there is very low binding of 3h-ot to the receptors in the absence of these ions. it has been speculated where the divalent cations work. recently an article appeared showing formation of a complex divalent cation-ot and stressing the importance of n-terminal amino group for binding and activity [1] . however deamino analogues of ot are also very active and their binding is also influenced by divalent cations. we have studied ot, deaminooxytocin (dot) and an ot antagonist (antag) by means of electrospray ms and we have observed that all these compounds form molecular adducts with zn2+, mg2+, mn2+ and ca2+. in binding experiments using 125-i antag, the quantity of tracer bound to membranes of hek cells having stable expressed human ot receptor strongly depends on the character and concentration of divalent ions. displacement curves using unlabelled antag do not change in the absence or presence of 10 mm of tested divalent ions. on the other hand, displacement curves using unlabelled ot and dot are shifted to the left in the presence of mg2+ and mn2+, and to much lesser extent by zn2+ and ca2+. all this points to the idea that the divalent ions do work on the site of membrane receptors. biologically active peptides exhibit multiple conformations in solution. thus, the synthesis of conformationally restricted analogues is a valuable approach for determining structure -activity relationships. restrictions can be imposed e.g. through the formation of cyclic structures within the peptide framework by disulfide bridges, or by substitution of chosen amino acid residues that limit conformational freedom, thus forcing the peptide backbone and/or side chains to adopt specific orientations. in recent years, conformationally constrained analogues of bioactive peptides seem to be a feasible approach to providing useful informations concerning threedimensional structure of such compounds which, in turn, could rationalize our knowledge about structure-biological activity relationships and thus help to design peptides with desired pharmacological properties. steric restrictions can be introduced by the formation of cyclic structures within the peptide backbone or by incorporation of amino acids with limited conformational freedom, which in turn results in specific orientations of the peptide backbone and its side chains. another approach to reduce the flexibility of the analogue is substitution of chosen amino acids with various types of pseudopeptides prepared trough short-range cyclizations. the present work is a part of our studies aimed at clarifying the influence of sterical constraints in the n-terminal part of arginine vasopressin (avp) and its analogues on the pharmacological activity of the resulting peptides. we describe the synthesis of four new analogues of avp substituted at positions 2 and 3 or 3 and 4 with two diastereomers of 4-amino-pyroglutamic acid and four peptides in which we combined the above modification with the placement of 3mercaptopropionic acid (mpa) at position 1. all the peptides were tested for their in vitro uterotonic, pressor and antidiuretic activities in the rat. different strategies to modulate shp-1 activity protein tyrosine phosphatase shp-1 consists of two sh2 domains n-terminal to the catalytic (ptp) domain and a short c-terminal tail. the binding of a py-ligand to the n-sh2 domain is required for an efficient activation of shp-1 phosphatase activity. the specificity of the shp-1 sh2 domains is determined by the py-residue (position 0) and residues at positions -2, +1 and +3. combinatorial peptide library methods revealed different classes of consensus sequences for both sh2 domains [1, 2] . in addition, the importance of residues c-terminal to py+3 (+4 to +6), in particular for binding to the n-sh2 domain, has been demonstrated [3] . together with investigations of the determinants for optimal sh2 domain binding and stimulation/inhibition of shp-1 activity [4] , these informations were useful for the generation of different strategies for effectors of shp-1 activity. peptides cyclized between different positions of the general consensus py-2 to py+3 were synthesized and evaluated with respect to n-sh2 domain binding and stimulation of phosphatase activity. structure-activity studies have revealed that the specificity of an integrin towards its rgd-containing ligands can be evaluated through the distances between the cβ atoms and/or the distance between the charged centers of arginine and aspartic acid as well as, the pseudo-dihedral angle (pdo), composed by the r-cζ, r-cα, d-cα and d-cγ atoms, which defines the relative orientation of the arg and asp side chains. in a previous study [1] , the antiaggregatory activity of rgd peptide analogues, i.e. their ability to act as fibrinogen receptor αiibβ3 antagonists, was correlated with the above structural criteria. our results suggested that the fulfillment of the criterion -45ο < pdo < +45ο is a prerequisite for an analogue to exhibit activity. in the present study, we examine the above criteria to rgd-containing 15peptides, derived from the active sites of the ecm proteins fibrinogen, fibronectin and vitronectin, as well as, from the cryptic rgd site of von willebrand factor. the correlation of the structural data with the biological activity of compounds, are in good agreement with the previously mentioned -45ο < pdo < +45ο criterion. furthermore, our results show that the differences in activity of compounds, which display similar distances between the charged centers of arg and asp, can be better evaluated by the pdo structural criterion. acknwolegments : this work was supported by grants from eu and the hellenic ministry of education ( heraklitos). references : the gpiib/iiia receptor, which is a member of the integrin family, is the most abundant receptor in the surface of platelets and can interact with a variety of adhesive proteins including fibrinogen, fibronectin and von willebrand factor. fibrinogen binding on gpiib/iiia is an event essential for platelet aggregation and thrombus formation. mapping of the fibrinogen binding domains on gpiib subunit suggested the sequence 313-332 as a putative binding site [1] . this region was restricted to sequence gpiib 313-320 (ymesradr) using synthetic octapeptides overlapping by six residues [2] . the ymesradr octapeptide inhibits adp stimulated human platelets aggregation and binds to immobilized fibrinogen. in this study we present the conformational analysis of three synthetic analogues yaesradr (a2) ymesaadr (a5), and ymesraar (a7), using nmr spectroscopy and distance geometry calculations. common structural characteristic of peptides a2 and a7 is the interaction between the side chains of arg5 and glu3, however in a2 the guanidino group of arg5 seems to form salt bridges with both glu3 and asp7. peptide a5 is stabilized only by a week interaction between arg8 and glu3 side chains. the interactions between the residue side chains provoke different overall shape of the three molecules. the most populated structural family of a2 exhibits a π backbone shape, a5 a turn around -s4a5-, while a7 an almost extended shape. background and aims: endomorphin-2 (em-2: h-tyr-pro-phe-phe-nh2), endogenous opioid peptide isolated from bovine and human brain, has high affinity and selectivity for the mu receptor and produces potent and prolonged analgesia in mice [1] . in this presentation, the incorporation of ethylene-bridged phe-phe unit (eb[phe-phe]) or piperidine carboxylic aid (pic) in position 2 was carried out to obtain more potent agonist or antagonist with stability against dipeptidyl peptidase iv (dpp iv). methods: the synthesis of eb[phe-phe] unit was achieved according to the procedure of lammek b. et al. [2] protected peptides were synthesized by a solution method using boc-chemistry. the final products were identified by maldi-tof mass spectrometry and elemental analyses. the receptor binding affinity of peptides was assessed by radio-ligand receptor binding assay using mu and delta opioid receptors from rat brain membranes or cos-7 cell membranes expressing each opioid receptors. muc2 glycoprotein, produced by the epithelium of the colon, built up mainly of repeat units of 1ptttpitttttvtptptptgtqt23, can be underglycosylated in colon carcinoma. we have been studying the epitope structure of the muc2 repeat unit with the mucin peptide specific mab 996 monoclonal antibody. this antibody recognizes the 18ptgtq22 sequence as minimal, and 16ptptgtq22 as optimal epitope. our interest lies in the modification of this epitope with maintained or enhanced specificity, and we aim to clarify the effect of different epitope modifications on mab 996 antibody binding: a) amino acid changes in the flanking region, b) glycosylation in the epitope core and in the flank. for this we have prepared a) libraries of ax(1)ptgtqaa and atptgtqx(2)a peptides, and x(1)ptgtqx(2) heptapeptides based on the antibody binding properties of the libraries; and b) glycopeptides pt(galnac)ptgtq, ptpt(galnac)gtq and ptptgt(galnac)q. the peptides were prepared by solid phase synthesis; after purification, esi-ms and amino acid analysis characterisation their antibody binding properties were studied by competitive elisa. our results show that a) although all amino acids in positions x(1) and x(2) resulted in antibody binding; in position x(1) hydrophobic, in x(2) aromatic residues provided stronger binding than that of the native peptide; b) glycosylation on thr(17) did not influence the binding of mab 996, but on thr(19) the presence of n-acetyl-galactosamine, interestingly, slightly increased the antibody recognition. these findings could be useful in designing synthetic peptide vaccines for tumour therapy. histidines play essential role in binding of biological metal ions, either in small or macromolecular chelating molecules, e.g. in metalloenzymes. therefore the low molecular weight polyhistidine type ligands are of potential importance as model substances. continuing our investigations on a novel branched oligopeptide type ligand -(his)4(lys)2lys-nh2 -prepared by solid phase peptide synthesis, we investigated the metal ion binding properties with zinc(ii) and copper(ii). the eight primary metal-binding sites are the four imidazole and four ammine groups on the ligand. phpotentiometric titrations revealed, that up to ph 8 all these donor atoms loose their protons on increasing ph. the competition between the protons and the metal ions results the decrease of pka values to about 1-3 in the case of copper(ii) and to about 4-6 in case of zinc(ii) ion. this reflects the higher stability of the complexes formed with copper(ii) in spite of the weak axial coordination that seems to occur in zinc(ii) complexes. combined potentiometric, spectrophotometric, cd and nmr spectroscopic methods were utilized to investigate the speciation and the structure of the complexes formed in aqueous solution. the prepared cu(ii) complexes cleaved dna, but it is not known whether in oxidative or in hydrolytic manner. because of this ambiguity further studies with zn(ii) complexes will be undertaken. this work has received support through sapstclg97697 nato collaborative linkage grant and from the hungarian science foundation (otka t43232). lgr8. further studies have shown that in both male and female gonads, insl3 and lgr8 represent a paracrine system important for meiosis induction in the ovary and male germ cell survival in the testis. thus insl3 may have clinical applications in fertility management. we undertook to determine the key structural elements responsible for its unique actions. methods: alanine-scanned analogues of human insl3 and mimetics of the b-chain alone were prepared by solid phase peptide synthesis. each was subjected to cd spectroscopy for secondary structure analysis and assayed for in vitro lgr8 binding and activation activity. the tetrapeptide h-dmt-d-arg-phe-cbp was found to be a selective µ agonist [ic50 (gpi) = 78.8 ± nm] with 40-fold lower potency than the corresponding, highly potent tiq4-tetrapeptide, but with still 3-fold higher potency than leu-enkephalin. in conclusion, we developed selective, cbp-containing δ antagonists and µ agonists with significant potency. recently, we described the syntheses and biological activities of several opioid peptide analogues that contained the n-terminal sequence 1-4, common to dermorphin and deltorphin. some of them showed very high agonist potency both in the gpi assay and in the mvd assay [1, 2] . in this work, we designed new analogues in which the sequences were elongated at the c-terminal to obtained the full sequences of dermorphin (a) and deltorphin (b). the syntheses of compounds and their biological activity profiles will be discussed. background and aims: endomorphin-2 (em-2: tyr-pro-phe-phe-nh2) is very potent endogenous opioid peptide, which exhibits high affinity and selectivity for the mu-opioid receptor [1] . previously, we had reported that [ac3c2]-em-2 containing 1-aminocyclopropane-1-carboxylic acid (ac3c) exhibited higher affinities than em-2 for the mu-opioid receptor [2] . in order to clarify that the substitution of 1-aminocycloalkane-1-carboxylic acids (acnc: n indicates the number of carbon atoms in a ring) for pro in position 2 of em-2 is efficient to obtain higher affinity for the mu-receptor, we synthesized . therefore, the replacement of pro to ac3c and ac4c will be efficient to make these analogs adopt bioactive conformation and exhibit high affinity for mu-receptor. in the past few years, many attempts have been made to prepare a synthetic insulin. the biological activity of insulin is known to be closely related to the c-terminal octapeptide fragment of its b-chain.it was found that b23 gly and b24 phe were present in all insulins so far obtained from various animal species indicating the significance of these two residues.it would therefore seem desirable to study the effect of each of these two amino acid residues or both on biological activity of the octapeptide fragment of the b-chain. a heptapeptide arg-phe-tyr-thr-pro-lys-ala-och3, corresponding to (b22-b30) insulin des gly23-phe24, and an octapeptide arg-phe-phe-tyr-thr-pro-lys-ala-och3, des gly23 were synthesized using the solid phase method. the c-terminal ends of both peptide were converted to methyl ester by transesterification cleavage from the resin. the side chain protecting groups were removed by hf. manual counter current distribution method was used for purification of the free peptides. the way to solve the evaluation of tyrosine containing peptide was studied. the free methyl ester peptides were administered for insulin-like activity test by glucose metabolism in the rat fat cells technique in vitro. aim of this study is to develop peptides as useful tools for degradation of synthetic dyes, which are often pollutants. we focused our interest in peroxidases, a class of enzymes reported to efficiently degrade azo and anthraquinonic dyes. in particular, the fungus versatile peroxidase (vp) of pleurotus eryngii can perform this degradation. therefore, our goal is the synthesis of a peptide based on this peroxidase able to emulate its biological function. the linear and cyclic peptide sequences were derived by the theoretical model of vp [pdb: 1a20], which determined the amino acids fundamental for the desired function of the active sites. in particular, the residues instrumental for the coordination of the heme, the mn binding site, and the long range electron transfer pathway [1] , were pin-pointed. moreover, we calculated the radius of the heme cavity. the next step was the synthesis of these peptides in order to verify the coordination of the heme and optimize their sequences. the syntheses were carried out by solid-phase following the fmoc/tbu strategy. because of purification difficulties of the fully-protected peptide, we undertook an alternative synthetic pathway, based on a solid phase head-to-tail cyclisation strategy, following the fmoc/tbu/allyl three-dimensional protection scheme [2] . next steps will be to test the coordination properties of the synthetic peptides, with respect to the heme, and further computational studies based on the new model of pleurotus the calcium plays an important role in biochemical pathways. it binds to enzymes and proteins in a different process. aspartic (asp, d) and glutamic (glu, e) acid side chains are the main ligands of calcium, but the contribution of the backbone carbonyl groups in the binding is also important. generally the binding places in the proteins are an unstructured loop between two helixes (310-or alpha-helix). the common sequence is the so-called ef-hand motif, which contains 12 amino acids [1] . it is already known that some proteins also bind calcium with a non-ef-hand loop. for example alpha-lactalbumins have a ten amino acid long sequence for binding [2] . it is an asp rich sequence where 5 asps are closer to each other than in ef-hand motif (-k79fldddltdd88-) but only 3 asps side chains take part in calcium coordination. we constructed a series of cyclopeptides to mimic the loop structure of alpha-lactalbumin [3] . in this study we focus on determining the importance of conservative amino acids within the ca2+ binding loop of this protein, using microcalorimetry (itc). the itc measurements were performed in different organic solvents and at different temperature. the synthesis of fatty acids in adipose tissue. in this article, we present the solution structure of gip in water and tfe/water determined by nmr spectroscopy. the calculated structures are characterised by the presence of an -helical motif between residues ser11-gln29 and phe6-gln29 respectively. the helical conformation of gip is further supported by cd spectroscopic studies. six gip(1-42)ala1-7 analogues were synthesised by replacing individual n-terminal residues with alanine. alanine scan studies of these n-terminal residues showed that the gip(1-42)ala6 was the only analogue to show insulin secreting activity similar to that of the native gip. however, when compared with glucose its insulinotropic ability was reduced. for the first time, these nmr and modelling results contribute to the understanding of the structural requirements for the biological activity of gip. a knowledge of the solution structure of gip and of the role of its individual residues will be essential in the understanding of how they interact with the gip receptor. efrapeptins are pentadecapeptides produced as a mixture of six closely related analogues (efrapeptin c-g) by the fungus tolypocladium niveum and other members of this species. they consist predominantly of the nonproteinogenic amino acids -aminoisobutyric acid (aib), isovaline (iva), -alanine ( ala) and pipecolic acid (pip), have an acetylated n-terminus and bear an unusual cationic c-terminal headgroup derived from leucinol and 1,5-diazabicyclo[4.3.0]non-5-ene. efrapeptin c is a competitive inhibitor of the f1-atpase and active against the malaria pathogen plasmodium falciparum. an anti-proliferative effect was also reported. conformational analysis of efrapeptin c in trifluoroethanol and dimethylsulfoxide was conducted to obtain structure-affinity relationships. the absence of amide-and -protons resulted in an imperfect assignment and unsatisfying conformational study. specific deuteration of methyl groups in aib did not simplify the assignment. cd and ft/ir spectra hint to helical or beta-turn secondary structures as main structure elements. residual dipolar couplings (rdc) were measured in a stretched cross-linked poly(dimethylsiloxane) gel in dichloromethane. the impact of the rdc on the conformational analysis led to an improved high resolution structure from simulated annealing protocols and consolidated the formation of a helical structure of efrapeptin c in nonpolar solution which is comparable with the binding pocket of the f1-atpase. finally, the dynamics of the resulting structures was studied using the gromos96 force field in explicit solvent. serotonin selective reuptake inhibitors (ssris) are currently among the most frequently prescribed therapeutic agents of depression. their therapeutic use includes also obsessive-compulsive disorder, panic disorder, bulimia. the serotonin transporter (sert) is the target of serotonin selective reuptake inhibitors (ssris). altough the inhibition is the proximal event in antidepressant action, the clinical benefit of antidepressant medications requires weeks of continuous dosing, indicating that their mechanism of action involves events downstream from acute transporter blockade. long-term effects of ssri treatment may be due to changes in intrinsic properties of sert structure, function, or regulation. thus, understanding the mechanism of action of sert remains a primary goal in the search for developing novel treatments for diseases associated with serotonergic dysfunction. in the present study experimentally determined ligand selectivity of the buspirone analogues toward the serotonin transporter was theoretically investigated on the molecular level. the model of serotonin transporter based on the crystal structure of bacterial homologue from aquifex aeolicus (leutaa) was constructed using the traditional homology modelling approach. a series of docking experiments with ssri's were conducted, using interactive molecular graphics techniques combined with energy calculations and analysis of the transporter-ligand complexes. structural information about the serotonin transporter and its molecular interactions with ssri's is important for understanding the mechanism of action of these drugs and for development of drugs with improved potency and selectivity. the protein kinase c (prkc) is a member of a super-family of the eukaryotic receptor protein kinases. it forms dimers and is anchored in the membrane, with a cytoplasmic kinase domain and an external domain, presumably acting as a sensor. prkc enables formation of biofilms of bacillus subtilis which show a high degree of spatial organization. they colonize various surfaces and produce complex antibiotic resistant communities. prkc acts as a ser/thr kinase with features of the receptor kinase family of eukaryotic hanks kinases. our current study involved theoretical modeling of the protein kinaze prkc complexes with the modified atp. the ligands were selected from a set of molecular probes developed by k. shah and coworkers [1] . each modified atp molecule was docked to the active site of the kinase molecule using autodock genetic algorithm procedure. the optimized structures of the complexes were submitted to the molecular dynamics simulations in the amber force field. we obtained four optimized structures of prkcc complexes in water. the results suggest the great similarity of our complexes with human cyclin-dependent kinase 2 [1] complexes. background and aims. indolicidin is a 13-residue antimicrobial peptide, which was isolated from bovine neutrophils. this molecule possesses a wide spectrum of antibacterial, antifungal and antiviral activity, furthermore it has also haemolytic effect. data derived from structural investigations led to considerably diverse conclusions regarding the secondary structure of this peptide, therefore the aim of this study was to examine the effect of cis-trans isomerization on the conformational properties of this antimicrobial peptide. methods. the conformational analysis of indolicidin containing cis or trans xxx-pro peptide bonds was performed by simulated annealing calculations with the use of amber force field. results. for the conformers of indolicidin with cis or trans xxx-pro peptide bonds, the evolving secondary structural elements were examined and poly-proline ii helix and type vi beta-turn were identified. in the case of this peptide, various intramolecular interactions may play an important role in stabilizing the structure of conformers. therefore the presence of the h-bonds between backbone atoms, the aromatic-aromatic interactions between the side-chains of trp amino acids and the proline-aromatic interactions between the side-chains of trp and the pyrrolidine rings of pro amino acids was investigated. conclusions. the conformational comparison of the peptides possessing cis or trans xxx-pro peptide bonds resulted in different secondary structural elements for both isomers, which are the poly-proline ii helix and type vi beta-turn for the trans and cis isomers of indolicidin, respectively. the occurrences of various intramolecular interactions are in agreement with the observed secondary structures. we have shown the monte carlo conformational search using macromodel is useful for conformational study of oligopeptides prepared from alpha, alphadisubstituted alpha-amino acids. moreover, we have studied conformational analysis of oligopeptides containing chiral alpha, alpha-disubstituted alphaamino acids to predict the helical screw sense of helical structures. here we report computational study on conformation of oligopeptides containing cyclic alpha, alpha-disubstituted alpha-amino acids with side-chain chiral centers. background and aims. the homopolymeric amino acids (hpaas) are polypeptides consisting of the same amino acids. some of them play a relevant role in the formation of several neurodegenerative diseases. most probably the poly-(ala) and poly-(gln) are the best representatives of these peptides because of their important biological effects. our aim was to perform conformational analysis and structural investigation of these two hpaas. methods. to explore the conformational spaces of the peptides, simulated annealing (sa) and random search (rs) calculations were carried out using amber force field. two different forms of the hpaas were modelled: either with charged n-terminal amino group and c-terminal carboxyl group, or with the n-and c-termini blocked by acetyl and n-methyl amide groups, respectively. results. for the conformers obtained by sa and rs calculations, the occurrences of various secondary structural elements like different types of beta-turns, gammaand inverse gamma-turns, alpha-helix, 310-helix, poly-proline ii helix and beta-strand were investigated. in the cases of various helices and beta-strand, segments with different lengths characterized by these secondary structures were determined along the entire sequence of peptides. for the conformers of the hpaas, the intramolecular h-bonds formed between the backbone atoms as well as between the backbone and side-chain atoms were identified. the vasopressin and oxytocin receptors (v1ar, v2r and otr) are membrane-embedded proteins belonging to the large family a g protein-coupled receptors (gpcrs). they are involved in crucial physiological functions as the regulation of water metabolism, control of blood pressure and stimulation of labor and lactation, mediated via v2r, v1ar and otr, respectively. as such, they are involved in a number of pathological conditions and are important drug targets. understanding their inhibition and activation mechanisms may improve design of ligands capable of selective stimulation or blockade of the respective receptors presenting the therapeutic targets. to investigate the otr, v1ar and v2r interactions with agonists and antagonists thirty computer models of receptor-ligand complexes have been modeled via docking and molecular dynamics (md) and analyzed in details. the receptor models were built on rd crystal structure template or using the coordinates of mii-gtα(338-350), for non-active and activated models, respectively. the ligands (arginine vasopressin, oxytocin, desmopressin, atosiban ([mpa1,d-tyr(et)2,thr4,orn8]ot) and barusiban (mpa1,d-trp2,ile3,allo-ile4,asn5,abu6,mol7) were docked into the receptors. the complexes have been embedded into the hydrated popc bilayer and submitted to 1ns unconstrained md in the amber force field. the relaxed systems have been obtained and analyzed in details. the receptor residues responsible for agonists/antagonists binding have been identified and mechanism of binding involving the highly conserved residues has been proposed. a three-dimensional models of the neurohypophyseal hormone receptors were constructed using a multiple sequence alignment and either the crystal structure of bovine rhodopsin or the complex of activated rhodopsin with gta c-terminal peptide of transducin rd*-gt(338-350) prototype to obtain nonactive or activated receptor models, respectively. analogs were docked to v1ar, v2r and otr, both non-active and activated models. the low-energy receptor-ligand complexes, with properly docked analogs were submitted to the constrained simulated annealing (csa), in vacuo. the relaxed receptoranalog models were obtained. the residues responsible for analogs binding to v1ar, v2r and otr have been identified and presumable biological activity of these compounds was determined. n-methyldehydroamino acids belong to non-standard amino acids found in nature. n-methyl-(z)-dehydrophenylalanine was found in tentoxin, a selective weed killer, having been produced by several phytopathogenic fungi of the alternaria genus. n-methyl-(z/e)-dehydrobutyrine and n-methyldehydroalanine are components of nodularins and microcystins, families of hepatoxins produced by species of freshwater cyanobacteria, primarily nodularia spumingena and microcystis aeruginosa. the simplest n-methyl dehydropeptides, ac-delta(me)xaa-nhme (where xaa = ala, (z/e)-abu, (z/e)-phe, and val) and, for comparison, the saturated ac-l-(me)ala-nhme analogue were investigated using computational methods. cis-trans b3lyp/6-31+g**//hf/3-21g ramachandran potential energy surfaces were created. the conformers found were optimised at the b3lyp/6-31+g** level. the effect of the electrostatic solute/solvent (water) interaction on the solute energies was investigated within the scrf method using the polarisable continuum model (pcm) on the geometries of solutes in vacuo. it was found that for all the studied dehydropeptide molecules the lowest conformer (phi, psi = ~ -109°, 10°) has the cis n-methyl amide bond. this feature seems to be independent of the dehydroamino acid moieties, the c-beta substituent and the z/e configuration. the pi-electron conjugation as well as the n-h···n hydrogen bond play the dominant role in the stability of this conformer (see figure) . the preliminary nmr investigations into the conformational preferences of the studied molecules in solution confirm the theoretical results obtained. the strong tendency of the n-methyl amide bond to adopt the cis configuration seems to be the reason why n-methyldehydroamino acids are found in small natural cyclic peptides, where they ensure the conformational flexibility necessary for biological action. the purpose of this study was to determine the potentials of mean force (pmf) of the interactions between models of nonpolar amino acid side chains in water. the potentials of mean force (pmf's) dependent on orientation were determined for systems forming hydrophobic and diagonal complexes composed of side-chain models of alanine, valine, leucine, proline and iso-leucine, respectively, in water. for each hydrophobic pair in water a series of umbrellasampling molecular dynamics simulations with the amber force field and explicit solvent (tip3p water model) were carried out and the pmfs were calculated by using the weighted histogram analysis method (wham). in all cases a characteristic shape of pmf plots for hydrophobic association were found, which was manifested as the presence of contact minima and solvent separated minima. depths of contact minima for all systems studied were about 1 kcal/mol. in this work we compared the ability of two theoretical methods of ph-dependent conformational calculations to reproduce experimental potentiometric-titration curves of two models of peptides: ac-k5-nhme in 95% methanol (meoh)/5% water (h2o) mixture and ac-xx(a)7oo-nh2 (xao) (where x is diaminobutyric acid, a is alanine, and o is ornithine) in water, methanol (meoh) and dimethylsulfoxide (dmso), respectively. in theory, in all three solvents, the first pka of xao is strongly downshifted compared to the value for the reference compounds, the water and methanol curves have one, and the dmso curve has two jumps characteristic of remarkable differences in the dissociation constants of acidic groups. the predicted titration curves of ac-k5-nhme are in good agreement with the experimental ones; better agreement is achieved with the md-based method. the titration curves of xao in methanol and dmso, calculated using the md-based approach, trace the shape of the experimental curves, reproducing the ph jump, while those calculated with the edmc-based approach, and the titration curve in water calculated using the md-based approach, have smooth shapes characteristic of the titration of weak multifunctional acids with small differences between the dissociation constants. quantitative agreement between theoretically predicted and experimental titration curves is not achieved in all three solvents. the poorer agreement obtained for water than for the nonaqueous solvents suggests a significant role of specific solvation in water, which cannot be accounted for by the mean-field solvation models m337 a. papakyriakou 1 , g.f. vlachopoulos 2 , g.a. spyroulias 2 , e. manessi-zoupa 3 , p. cordopatis 2 angiotensin-i converting enzyme (ace) belongs to the m2 family of the ma clan of zinc metallopeptidases and can act either as a dipeptidyl carboxypeptidase, which catalyses the proteolytic cleavage of dipeptides from the carboxy terminus of a wide variety of peptides, or as an endopeptidase, which hydrolyses peptides bearing amidated c-termini. among the former category of ace peptide substrates, the most distinguished are those involved in blood pressure regulation, such as angiotensin i (angi) and bradykinin (bk). in the latter category falls the gonadotropin-releasing hormone (gnrh) in an attempt to analyze molecular interactions at atomic level we simulated the ace-substrate complexes, using the recently determined 3d crystal structure of ace testis isoform and a knowledge-based docking method in order to insert the peptide substrate (angi, bk and gnrh) of ace into its catalytic cleft. in order to introduce the effect of protein mobility and gain information about enzyme-substrate recognition and interaction we have sampled the conformational space of these complexes via molecular dynamics simulations with explicit solvent representation. we have also performed molecular dynamics calculations with tace-inhibitor complexes, such as lisinopril, as well as with tace mutated at specific sites, such as the ligands of the two buried chloride ions that have been shown to affect substrate activity. our results provide new insights into the role of specific domains of tace and their implication in the enzyme activity, which is not readily apparent from the available crystal structures. two main mechanisms for the propagation of action potential in myocytes are: 1) the free flow of local circuit current through gap junctions and 2) the effect of electrical field. here we study effect of each mechanism and their importance during action potential propagation. method: we simulated the cardiac myocyte by the orcad software, then used the model of sinoatrial node to stimulate the myocytes model and studied the propagation of action potential with and without gap junction. result: our results show that, although gap junction solely is not able to mimic physiological condition, but it is necessary for normal cardiac functioning. on the other hand, electric field is not sufficient for successful propagation of action potential and the existence of gap junction is necessary. anthrax is a disease of animals and humans, caused by the bacterium bacillus anthracis. anthrax toxin (at) consists of three proteins, one of which is the anthrax lethal factor (alf). alf is a gluzincin zn-dependent highly specific metalloprotease (~90.000 kda), which belongs to the m34 family of the ma clan of zinc metalloproteases. alf cleaves most isoforms of mitogen-activated protein kinase (mapk)-kinases (meks) close to their amino termini, leading to the inhibition of one or more signaling pathways. no data are available on the enzyme-substrate interaction at the molecular level. therefore, we performed classical molecular dynamics simulations on the alf-mkk/mek complexes in order to probe protein-substrate interactions. the simulations pinpointed specific hydrophobic as well as electrostatic alf-peptide substrate interactions and these data were exploited in the building of virtual combinatorial libraries of di-and tri-peptides using the twenty native aminoacids. by applying docking simulations to anthrax zn-metalloprotease around 1.000 peptide substrates were virtually screened according to their binding affinity. data suggest that complexes of alf with peptides substrates bearing arg, trp, lys and phe aminoacids, exhibit the highest binding affinity providing evidence for electrostatic interactions between negatively charged residues of alf's active site and positively charged side-chains of di/tri-peptides. new libraries of substrates were built incorporating non-protein residues, organic moieties and chelating groups. alf-substrate complexes with the best score (in terms of binding energy) are further analysed. in the present studies we designed and synthesised seven new bradykinin (bk) analogues and evaluated them in the in vivo rat uterotonic assay using a modified holton method in munsick solution on a strip of rat uterus and in blood pressure test. we used [arg0, hyp3, thi5, 8, d-phe7]bk, the b2 antagonist of vavrek and stewart as a model, when designing our analogues. in all cases, the n-terminus of our peptides is acylated with bulky substituent. we previously reported that acylation of the n-terminus of several known b2 antagonists with various kinds of bulky acyl groups has consistently improved their antagonistic potency in rat blood pressure assay. on the other hand, our earlier results seem to suggest that effects of acylation on the contractility of isolated rat uterus depend substantially on the chemical character and size of the acyl group, as we observed that this modification may either change the range of antagonism or even transform it into agonism. the peptides were synthesized by the solid-phase method using the fmoc-strategy the modifications proposed either preserved or increased the antagonistic potency in the rat blood pressure test. on the other hand, the seven substituents, differently influencend the interaction with the rat uterine receptors and except one led to decrease of antiuterotonic activity. in both cases acylation of the n-terminus led to enhancement of antagonistic potencies. our results may be of value in the design of new b2 agonists and antagonists. the formations of amyloid fibrils have been reported as for various amyloidosis. several structural models of fibrils are proposed for respective proteins so far. however, their common basic structures and universal features to induce amyloid fibril formations are not known in detail. previously, we examined intermolecular interactions among the several amino acid residues in barnase, which is known to form amyloid-like fibril. based on the experimental results using a series of mutant barnase, we discovered that the interactions between hydrophobic side-chains are the most essential driving force to form the fibrils and that both intermolecular and inter-sheet interactions in the fibril maintain highly ordered molecular packing. in the present paper, we describe a novel prediction method for core regions of various fibril-forming proteins and show the verification of the above possible structural principle. at first, we calculated the interaction's score between side-chains in the antiparallel orientation of beta-strands. next, the peptides with predicted sequences of fibril cores, a couple of high-scored regions with a designed turn moiety to induce a hairpin-like form, were chemically synthesized by spps. as a result, the formation of amyloid fibrils was confirmed for most of high-scored sequences. in addition, we also applied this method to prion protein, we could predict 4 possible beta-strands with hetero-paired orientation. some synthetic peptides involving these strands were proved to have fibril-forming ability. thus, we have developed the novel method to predict the core regions that induce amyloid fibrils. a principal factor analysis (pfa) is a very efficient way of identifying patterns in the data sets even if the patterns are hard to find (e.g. in the high dimensional data sets). this is the reason why the pfa method can be powerful tool for analyzing molecular dynamics (md) trajectories. it is possible to reduce dramatically the trajectory size without loosing significant structural information by applying the pfa procedure. we used this tool for interpretation of results from the molecular dynamics simulations of the model of the transcription factor nf-kb. nf-kb is a protein involved in the numerous biological processes such as regulation of immune response, inflammation, various autoimmune diseases and is used by many viruses, including human immunodeficiency virus (hiv), to activate transcription of their own genes. only the trajectory of the backbone atoms of the nf-kb were subjected to the further analysis. peptides contain many basic sites such as side chains of basic amino acid residues, oxygen and nitrogen atoms of amide groups, and terminal amino groups. these parts can interact with protons. this interaction can change conformational behaviour of peptides and, consequently, their biological functions. the interaction becomes even stronger in the gas phase. in that case, the stability of the peptide chain is influenced, which may have impact on peptide fragmentation during mass spectroscopy analysis of peptide structures. in this study, we will present the interaction of proton with carbonyl oxygens in the model of alanine tripeptide. quantum chemical calculations employing density functional theory using hybrid b3lyp functional and 6-31++g** basis set were used to describe this interaction and also to find possible pathways of proton transfer among interaction sites. two different mechanisms of proton transfer were found. the first mechanism is represented by an isomerization of the proton around the double bond of the carbonyl group. the second mechanism is based on the large conformational flexibility of the tripeptide model where all carbonyl oxygens cooperate. the later mechanism exhibits nearly half energy barrier of the rate-determining step compared to the first one. we focus our attention on situation, in which methyl groups attached to alpha atoms in tripepetide model influence the conformational behavior. results will be presented for all four possible stereochemical configurations. a. papakyriakou 1 , p. galanakis 2 , p. gazonis 2 , g.a. spyroulias 2 p53 protein is one of the most effective defensive weapons of human body against carcinogenesis, due to its tumor suppression properties. it has been noticed, in many types of cancer, that the functions of p53 are being downgraded or even suppressed and this fact is ought to the presence of mutated forms of p53 or to the complete absence of the protein. the suppression of p53 levels is being indirectly regulated by the protein itself, which activates the expression of a gene, the oncogene mdm2 (murine double minute 2), which expresses the mdm2 protein, known as human-mdm2 or just hdm2. hdmx protein is a homologue protein to hdm2 and is being implicated, through various biological processes, in the suppression of p53. however, recent experimental evidence suggests that hdm2 and hdmx proteins are not the only ubiquitin ligases that negatively regulate p53 through ubiquitin pathway. two recently discovered e3 ligases, cop1 and pirh2, are also proposed to promote p53 for degradation. all these proteins function as e3 ligases bearing a ring finger domain. these domains are characterized by their high content in cysteines and the binding of two zn(ii) ions while they catalyze the latter stage of protein signaling for proteolysis by the 26s proteasome, through the ubiquitin pathway. the structure variation and the stability of these ring fingers is studied through molecular dynamics simulations of 2-5 ns and structure variations are analyzed in a structure-function correlation basis. semax is a synthetic analogue of adrenocorticotropic hormone acth 4-10. it is a nootropic agent containing seven amino acids met-glu -his-phe-pro-gly-pro without hormonal (adrenocorticotrophic) activity. semax is neuroprotective via a mechanism involving the regulation nitric oxide (no) and lipid peroxidation. semax proved to be highly effective in abating the rise in no and restoring neurologic functioning [1] . it was found to improve intellect and memory in healthy human. it is effective in rehabilitation of people with memory and motor disorders, parkinson's and hantington's diseases, after cerebral stroke and head trauma [2] . to study conformation dynamics in connection with in vivo activity of semax the molecular dynamics method of standard protocol was applied [3] . semax and about twenty its analogs were studied. using cluster analysis method semax was found to be more labile among various synthesized analogous (met-gln -his-phe-pro-gly-pro; gly-glu-his-phe-pro-gly-pro; lys-glu-his-phe-pro-gly-pro; glu-his-phe-pro-gly-pro; his-phe-pro-gly-pro). because of collective degree of freedom it has one more stable configuration that is unreachable in analogs. singularities of semax and analogous were studied using 2-d, 3-d poincare maps, auto and crosscorrelation functions of special type in terms of topological structure of energy hypersurface. this work was supported by rfbr (pr. 04-04-49645), russian ministry of education and science, moscow government and crdf. rhodopsin (rd) is the only representative of g-protein coupled receptors (gpcrs) whose structure has been described with high resolution. thus, it has become the structural prototype for other gpcr. these receptors are involved in transduction of various signals into the cell and actions of many hormones and neurotransmitters. about 50% of all drugs act through gpcr. growing evidence that rd and related gpcrs form functional dimers/oligomers, followed by direct proof (using atomic force microscopy -afm) that in the retina rd associates into a paracristalline network of rows of dimers, need models of rd-transducin (g t -heterotrimeric protein) complex that would envision an optimal rd dimer/oligomer amenable to satisfy all well documented interactions with gt. current model includes tetramer built of two activated (metaii) and two inactive rd molecules, ligands stabilising metaii: gtα(ile338-phe350) and gtγ(asp60-cys71)farnezy, lipid bilayer built of 36 pc (phosphatidylocholin head groups) , 6 ps (phosphatidyloserine) and 30 pe (phosphatidyloetanolamine) (all three types of phospholipids contain the polyunsaturated docosahexaenoyl chain -dha) and water. experimental data concerning shape of oligomer, conformational changes in metaii, proper interactions and distances among residues have been looked upon. the poster shows results of the molecular dynamic carried in amber force field for ~6000ps in the periodic box. conformational changes which took place during simulation caused proper adaptation one another monomers in tetramer and ligands to activated receptors. the human cystatin c (hcc) is a one of known domain swapping proteins. during this process, one of the hcc β-hairpins (β2-l1-β3) changes its conformation forming long β-strand. this conformational transition destabilizes the monomer structure and leads to domain-swapped dimer. the causative force for changing the βhairpin conformation is assumed to be the alleviation of distortions of the l1-loop val57 amino acid residue's backbone. following the above assumption and our previous conformational studies of the hcc β-hairpin peptide we investigated the influence of the point mutations, v57d, v57p and v57n of the val57 residue, on the β-hairpin peptide structure. the conformational studies by means of cd spectroscopy and molecular dynamics studies were performed. the study revealed that the hcc peptide with the wild-type sequence has the strongest tendency from all studied peptides to form a β-hairpin structure. on the basis of these results we conclude that the presence of distortions in the val residue of l1-loop is unlikely to cause the 3d domain swapping of the human cystatin c. acknowledgments: this work was financially supported by the ministry of scientific research and information technology of poland under grant 1t09a10430. temporin a (ta) (flpligrvlsgil-nh2) and temporin l (tl) (fvqwfskflgril-nh2 ) are small, basic, hydrophobic, linear antimicrobial peptides amide found in the skin of the european red frog, rana temporaria. these peptides have variable antibiotic activities against a broad spectrum of microorganisms, including clinically important methicillin-sensitive and resistant staphylococcus aureus as well as vancomycin-resistant enterococcus faecium strains. to gain further insight into the mechanism of action of these small antimicrobial peptides, we have investigated their conformational behaviour in different environmental conditions. more specifically, we deeply investigated by solution nmr spectroscopy in water and water/dmso (8:2) solutions as isotropic solutions and 200 mm aqueous solution of dpc (dodecylphosphocholine) was used as membrane mimetic environment. understanding the basis of the interactions of temporins with membranes could be crucial for the design and synthesis of potent antimicrobial agents. cripto is the founding member of a family of soluble and cell bound growth factors known as egf-cfc [1] distinguished by the presence of an n-terminal signal peptide, two distinct cysteine-rich domains (crd) and a c-terminal hydrophobic region involved in cell surface attachment by a post-translational gpi modification. the characteristic crds, known as egf-like and cfc domains (from the first members cripto, frl1 and cryptic), both span about 40 residues with 3 disulfide bridges [2] each, which, presumably, beside a possible functional modularity, confer them also a structural independence. in this work we have focused our attention on the cfc domain of mouse cripto. the domain has been produced by ssps, along with variants bearing mutation on h104 and w107, that have been described as crucial for alk4 receptor recognition. the two variants have been purified and refolded, achieving the correct disulfide bridges, and then comparatively analyzed by cd spectroscopy under different ph conditions; thus obtaining experimental insights on the structural arrangements of this new class of protein domains. furthermore, the binding properties of wild type and mutants cfc domains to alk4 receptor have been determined by using an elisa-based assay. our results demonstrated that the cfc domain alone can directly bind alk4 in the absence of additional ligands and, furthermore, confirmed a role of h104/w107 in cripto/alk4 interaction. there is considerable interest in the pharmacology of the two cholecystokinin (cck) receptors ccka-r (or cck-1) and cckb-r (or cck-2) that mediate the biological action of the cck hormone. they are membrane receptors belonging to the superfamily of g-protein coupled receptors (gcpr) and are predominantly located in the gastrointestinal tract and in the central nervous system, respectively. a library of 14 cyclic peptide analogues derived from the octapeptide c-terminus sequence of the human cholecystokinin hormone [cck(26-33), or cck8] has been designed, synthesised and characterised. the 14 peptide analogues have been rationally designed to specifically interact with the cck type b receptor (cckb-r) on the basis of the structure [2] of the bimolecular complex between cck8 and the third extracellular loop of cckb-r [namely, cckb-r(352-379)]. the new ligands showed binding affinities generally lower than that of parent cck8. anyway, structure activity relationship data underline that preservation of the trp30-met31 motif is essential, and that the phe33 side chain and a carboxylic group close to the c-terminal end must both be present. the nmr conformational study in dpc micelles of the compound endowed with maximal binding affinity (cyclo-b11, ic50=11 m) shows that this compound presents the turn-like conformation, centred at the trp30-met31 segment, as planned by rational design, and that such conformation is stabilised both by the cyclic constrain and interaction with the micelle. cripto is the founding member of a family of extracellular growth factors called egf-cfc found in mouse, human, chicken, xenopous and zebrafish [1] . these proteins are characterized by the presence of an n-terminal signal peptide, a c-terminal hydrophobic region and two highly conserved cysteine-rich domains, the egf-like (epidermal growth factor) and the cfc (cripto/frl1/cryptic). cripto is strictly required in the early embryonic development and contributes to deregulated growth of cancer cells in adults, since it is highly over-expressed in many solid carcinomas. it has been proposed that each single domain of cripto could bind different protein partners, playing different functional roles [2] . on this grounds, investigation of the single domains 3d-structures can have also strong functional implications. we present here an extensive conformational analysis of the mouse cfc domain (96-134 sequence) and of the w107a mutant based on nmr data. sequences have been synthesized by spps and refolded reconstituting the correct disulfide bridges [3] . the molecular models have been built by computational methods using the nmr data collected under both acidic (ph 3) and nearly physiological (ph 6) conditions. both domains show a globally extended folding with three strands linked by the three disulfide bridges and two connecting loops, in which h104 and w107, key residues in receptor binding, are exposed to the solvent urantide, a selective antagonist. thus, we carried out a study aiming at the characterization of conformational arrangement and affinity properties of ut extracellular segments.we measured by surface plasmon resonance (spr) technology the binding affinities of the three ligands, u-ii, urp and urantide towards the three extracellular loops of ut. furthermore, the secondary structures of the synthetic receptor fragments in presence of dodecylphosphocholine micelles and interaction with ut ligands were analysed using nmr spectroscopy. spr data showed that the ec loop ii was able to recognize the ligands u-ii, urp and urantide with similar affinities while none of these two ligands were able to interact with the extracellular loop i. furthermore, the absence of binding of urantide, a peptide antagonist, suggested strongly that loop iii would be involved in the signal transduction process and implies that u-ii and urp, but not urantide, would bind to ut according to a common pattern. moreover, the results indicate that potent ut antagonists could be designed by producing highaffinity ligand targeting the extracellular loop ii. also, the spr and nmr studies revealed that the synthetic structural ut domains contained some of the conformational and chemical features essential for the binding of hu-ii, urp and urantide to hut. synthetic cysteine-rich replicates of naturally occurring peptides such as hormones, neurotransmitters, enzyme inhibitors, defensins and toxins often can be oxidatively folded in high yields to their native structure. the presence of identical cysteine patterns in the sequence were found to lead to identical disulfide connectivities and homologous spatial structures despite significant variability in the non-cysteine positions. therefore, it is generally accepted to attribute the disulfide connectivities based on the homology of their cysteine pattern. minicollagen-1 from the nematocysts of hydra is a trimeric protein containing n-and c-terminal cysteine-rich domains involved in the assembly of an intermolecular disulfide network. examination of three-dimensional structures of peptides corresponding to these folded domains by nmr spectroscopy revealed a remarkable exception from the general admitted rule [1] . despite an identical cysteine pattern, they form different disulfide bridges and exhibit distinctly different folds. additionally, comparative analysis of the oxidative folding revealed for the c-terminal domain a fast and highly cooperative formation of a single disulfide isomer, the n-terminal domain proceeding mainly via an intermediate that results from the fast quasi-stochastic disulfide formation according to the proximity rule. to our knowledge, this is the first case where two short peptides with identical cysteine pattern fold uniquely and with high yields into defined, but differing, structures. therefore, these cysteine-rich domains may well represent ideal targets for structure calculations to learn more about the elementary information encoded in such primordial molecules. the conformational change of the cellular prion protein, prpc, to its virulent "scrapie" form, prpsc, is believed to be responsible for prion infectivity. and several studies suggest that the prion disulfide bond is important for the stability, structure, and propagation of prion oligomers. to test this hypothesis, we selected two conserved peptides flanking the disulfide bond in the sheep prion protein, and measured the secondary structure of these peptides with circular dichroism, hydrogen/deuterium exchange, and molecular dynamics simulations. our preliminary data suggests that the two peptides do not adopt stable secondary structure, native or otherwise. thus, the folding intermediate of a prion protein seems unlikely to comprise local structure around the disulfide bond. the conformationally labile cα-tetrasubstituted α-amino acid residue bip possesses non isolable (r) and (s) atropoisomers. we have previously reported that in the linear dipeptides boc-bip-α-xaa*-ome with α-xaa* = ala, val, leu, phe, (αme)val and (αme)leu residues at the c-terminal position of bip, the onset of an equilibrium between diastereomeric conformers with unequal populations could be observed by cd and 1h nmr. the phenomenon of induced circular dichroism (icd) represents the basis for the "bip method", an easy and fast configurational assignment for chiral α-amino acids. in search for an extension of the bip method, we investigated the boc-bip-β-xaa*-ome dipeptide series with β-xaa* = β3-hala, β3-hval, β3-hleu, β3-hpro, β3-hphe, or the cyclic β2,3-amino acids (1s,2s)/(1r,2r)-achc and (1s,2s)/(1r,2r)-acpc. low-temperature (233 k) 1h nmr spectra in cd3od revealed the presence of two conformers. significant d.r. (diastereomeric ratio) values were observed for all combinations of bip with both β3-and cyclic β2,3-amino acids. cd analysis in meoh solution of the boc-bip-β-xaa*-ome dipeptides allowed us to conclude that the cd resulting from the induced axial chirality in the biphenyl core of the bip residue gives clear information on the β-xaa* configuration for both β3-and cyclic β2,3-amino acids (except the aromatic β3-hphe), with a p torsion of the biphenyl axial bond of bip being preferentially induced by (l)-β3-xaa* as well as cyclic (1s,2s)-β2,3-xaa* c-terminal residues. we have recently reported that the induced circular dichroism (icd) of the biphenyl core of boc-bip-xaa*-ome dipeptides based on the conformationally labile cαtetrasubstituted α-amino acid residue bip could allow an easy and fast configurational assignment for both α-and β-xaa* amino acid residues. in search for other biphenyl/xaa* architectures in which a transfer of central to axial chirality could result in a potentially useful icd, we considered n-substituted 6,7-dihydro-5hdibenz[c,e]azepine (daz) derivatives from α-and β-amino acids as interesting candidates. in the present communication, we report the syntheses, and the 1h nmr and cd analyses of a series of (daz)xaa*-ome amino esters derived from α-, β3-, and cyclic β2,3-xaa* residues, namely dβ-peptide molecules possess interesting conformational characteristics and biological properties. they may represent a new class of rigid foldamers potentially useful as templates or spacers. 3d-structures of β-peptides have been experimentally investigated using x-ray diffraction and various spectroscopic techniques, but they have never been doubly spin labelled and studied by epr. a terminally protected β-hexapeptide, based on trans-(3r,4s)-β-toac and trans-(1s,2s)-achc, synthesized using classical solution methods, was found by ft-ir absorption and cd techniques to adopt the 3-14-helical conformation. a set of four, terminally blocked, hexapeptide sequences, each characterized by four strongly helicogenic aib residues and all combinations of the two isomeric ile/allo-ile residues at positions 2 and 5 was synthesized by solution methods and fully characterized. a detailed solution (by ft-ir absorption, nmr, and cd) and solid (crystalline)-state (by cd and x-ray diffraction) conformational investigation allowed us to validate our assumption that all four peptides are folded in well developed 3-10-helical structures. however, the most relevant conformational conclusion extracted from this 3d-analysis is that the handedness of the 3-10-helical structures formed does not seem to be sensitive to the configurational change at the β-carbon atom of the constituent ile versus the diastereomeric allo-ile residues (in other words, the dominant control on this important structural parameter appears to be exerted by the chirality of the amino acid α-carbon atom). these results complement published findings on the diverging relative stabilities of the intermolecularly h-bonded β-sheet structures generated by ile versus allo-ile homo-oligopeptides. taken together, these data represent an experimental proof for the intuitive view that potentially different conformational properties are magnified in a strongly self-aggregated homo-peptide system (as compared to weakly self-aggregated, helical, host-guest peptides such as those investigated in this work). in a first approach to β-sandwich proteins the hydrophobic core between two symmetrical sheets each with four antiparallel β-strands was computationally designed by packing of amino acid side chain conformations (rotamers) in an initially given backbone structure. the proteins were synthesized by coupling four peptides with β-hairpin structure to a cyclic decapeptide template (tasp). an aggregation observed by equilibrium ultracentrifugation with the first designed proteins was decreased to a dimer by increasing the surface charge in two further variants of this protein from -1 to +3 and +5. replacement of l-pro by d-pro in the loops and the template proved to stabilize the β-structure. these results led us to an improved design of an asymmetric core with algorithms for selection of proteins with a minimal number of atom clashes and cavities in the core, and a maximum number of hydrogen bonds after energy minimization. this protein termed beta-mop (modular organized protein) was synthesized in amounts to allow a characterization by cd, ftir, tryptophan fluorescence during reversible unfolding, and by high resolution nmr. nmr measurements of diffusion indicate a dimeric structure. the β-structure is stable up to 80 °c (353 k) as determined by 1d 1h nmr showing sharp resonance lines. the 2d 1h,1h dqf-cosy spectrum at 750 mhz shows a typical βsheet distribution extending well into the characteristic regions >8.5 ppm (for amide protons) and >5.0 ppm (for hα signals). all data indicate a well folded protein with β-structure. a.s. galanis 1 , z. spyranti 1 , n. tsami 1 , g.a. spyroulias 1 , e. manessi-zoupa 2 , g. pairas 1 , i.p. gerothanassis 3 , p. cordopatis 1 angiotensin-i converting enzyme (ace) belongs to the m2 family of the ma clan of zinc metallopeptidases and can act either as a dipeptidyl carboxypeptidase, or as an endopeptidase. among the ace peptide substrates, the most distinguished are angiotensin i (angi) and bradykinin (bk) due to their role in blood pressure regulation. despite the fact that biological data strongly suggest that the two active sites exhibit different selectivity and activity towards physiological and exogeneous substrates none experimental evidence for the interaction of angi and bk with ace catalytic sites, is available so far. a dual approach for studying the structure and physicochemical determinants of ace-angi/bk interaction has been performed. the first involves the application of molecular dynamics simulations (presented elsewhere in this book) and the second is making use of the solid-phase synthesized 36-46 aa ace catalytic site maquettes (csm) bearing the native sequence and the application of the nmr spectroscopy, and presented herein. therefore, high-resolution multinuclear nmr spectroscopy was applied to analyze the conformational features of ace substrates angi and bk in dmso or aqueous mixtures. then titration experiments were conducted and ace csms were titrated by angi/bk peptides, while monitored by nmr. 2d 1h-1h tocsy and noesy experiments were used in order to map the interaction site of both substrates and csm through chemical shift perturbation and comparison of noe signal differentiation. competitive binding studies were also carried out through titration studies of csm-angi/bk and known ace inhibitors. a. carotenuto 1 , p. grieco 1 , l. auriemma 1 , e. novellino 1 , v.j. hruby 2 the melanocortine receptors are involved in many physiological functions, including pigmentation, sexual function, feeding behavior, and energy homeostasis, making them potential targets to treat obesity, sexual dysfunction, etc. understanding the conformational basis of the receptor-ligand interactions is crucial for the design of potent and selective ligands for these receptors. the conformational preferences of the cyclic melanocortin agonists and antagonists mtii, shu9119, [pro6]mtii, and pg911 (table 1) when two chromophores are chirally oriented and close enough to one another in space, their excited states couple and become non-degenerate. this phenomenon, termed exciton coupling, produces a typical bisignate cd curve. the intensity of the cd couplet is dependent on the molar extinction coefficient and the distance between the interacting chromophoric moieties, while the sign is governed by the angle between the effective electron transition moments. in particular, exciton coupling over a long distance can be observed only with strongly absorbing chromophores, e. g. porphyrin derivatives, characterized by their extremely intense and sharp soret band near 415 nm. in this work we examined by the exciton coupled cd method the combined distance and angular dependencies, generated by the seven conformationally restricted β-turn and 3-10-helical spacer peptides -l-ala-[l-(αme)val]n-(n = 1-7) on a system formed by two intramolecularly interacting 5-carbamido-5,10,15,20-tetraphenylporphyrin chromophores. these porphyrin derivatives are confirmed to be excellent reporter groups. we find that not only the centerto-center separation (from 19 to 34 å) between the two chromophores, but the orientation (roughly parallel or perpendicular) between the directions of their effective transition moments as well, are responsible for the onset or even for the modulation of the intensity of the exciton coupling phenomenon. in particular, the porphyrin…porphyrin interaction is still clearly detectable over the long distance of ca. 30 å when the two chromophores are about perpendicularly oriented. a. hetényi 1 , g.k. tóth 2 , c. somlai 2 , t.a. martinek 1 , f. fülöp 1 β-peptides are probably the most thoroughly investigated peptidomimetic oligomers. to extend the field of β-peptides towards the construction of possible new secondary structures, the replacement of the cα and cβ atoms of the β-amino acid with heteroatoms could be an attractive modification, for example cβ-atom of β-peptides by an nr moiety, leading to hydrazine peptides. in the literature, there are only a few studies [1] [2] [3] about hydrazine peptides, and hydrazine peptides with cyclic side-chain have not been studied yet. in order to determine the secondary structure preference of 1-amino-pyrrolidine-2s-carboxylic acid homo-oligomers (figure 1 ), their potential energy hypersurface were probed at the ab initio b3lyp/6-311g** level. the calculations predicted the 8-strand to be the most stable structure. the hydrazino-peptides in question were synthetized on solid support, and their structures were characterized by nmr and cd methods. the results were found to be in good accordance with the 8-strand structure. cathepsin c [ec 3.4.14.1](1), which belongs to family of cysteine proteases, catalyzes hydrolysis of n-terminal peptide, preferential glyphe. this enzyme may play a part in chronic airway diseases (2) . also increaser level of enzyme was found in case of cancer, rheumatism and muscle's distrophy (3) (4) (5) . for this reason we have undertook investigations of peptides containing two dehydroamino acid residues, which could act as alkylating inhibitors of this enzyme. to define structure and conformation of investigated peptides we were used different methods of nmr spectroscopy, including standard 1d experiments, protonproton correlations, proton-carbon correlations, and 2d noe experiments. to complete structural research computational chemistry methods had been used. in order to predict the biological activity of investigated peptides, the simulation of docking process of these peptides to enzyme active site had been made and after that correlated with results of enzymatic test.. the obtained results suggest, that investigated peptides containing two ∆phe residues (z and e isomers respectively) in solution have bent conformation, which is stabilized by intermolecular hydrogen bonds. these results are confirmed by the results of theoretical calculations. also simulation of docking process have showed two possible peptide's orientation in active site of cathepsin c and allowed the rational interpretation of biological test's results. turns are important elements of secondary structure in peptides and proteins. different types of turns are distinguished according to the number of residues involved. the most abundant is the β-turn, which involves four consecutive amino acids with the co at position i hydrogen-bonded to the i+3 nh. the γ-turn is centred at a single residue and is generally stabilized by a hydrogen bond between the i co and the i+2 nh. model dipeptides rco-l-pro-xaa-nhr' are the smallest systems able to adopt the β-turn conformation, which is favoured by the presence of proline at i+1. a peptide of this series, incorporating a cyclopropane amino acid (xaa), has been shown to accommodate two consecutive γ-turns in the solid state [1] , instead of the expected β-turn conformation. the double γ-turn encountered is unique among crystalline short linear peptides. in fact, the γ-turn is observed almost exclusively in low-polarity solvents, and only a few oligopeptides of cyclic structure exhibit a γ-turn in the crystal. this is the first time that the strong tendency of pro-xaa dipeptides to adopt a β-turn in the solid state has been switched to the γ-turn. theoretical calculations [2] also show the high preference of this cyclopropane amino acid for the γ-turn conformation. [ oxidative stress plays an important part in the development of cardiovascular disease (cvd). haptoglobin is a hemoglobin-binding protein that has a major role in providing protection against heme-driven oxidative stress. there are two common alleles for haptoglobin (1 and 2), and the three phenotypes, haptoglobin 1-1, haptoglobin 2-1, and haptoglobin 2-2, differ in their ability to function as antioxidants. we determined whether there was a relation between the haptoglobin phenotype and the development of coronary artery diseases. haptoglobin (hp) phenotypes were determined in iranian patients with coronary artery diseases. we performed haptoglobin (hp) genotyping by polymerase chain reaction (pcr) using allele-specific primer-pairs. in multivariate analyses controlling for conventional cvd risk factors, haptoglobin phenotype was a highly statistically significant, independent predictor of cvd. the odds ratio of having cvd in patients with the haptoglobin 2-2 phenotype was 5.0 times greater than in patients with the haptoglobin 1-1 phenotype. an intermediate risk of cvd was associated with the haptoglobin 2-1 phenotype. these results suggest that haptoglobin phenotype is an important risk factor in determining susceptibility to cardiovascular disease which may be mediated by the decreased antioxidant and antiinflammatory actions of the haptoglobin 2 allelic protein product. the special feature of proteins involved in alzheimer's or prion diseases is their ability to adopt at least two different stable conformations. the conformational transition that shifts the equilibrium from the functional to the pathological isoform can happen sporadically. it can also be triggered by mutations in the primary structure, changes of different environmental conditions, or the action of chaperones. elucidation of the molecular interactions that occur during the transformation from α-helix to β-sheet and the consecutive formation of amyloids on a molecular level is still a challenge. therefore, the development of small peptide models that can serve as tools for such studies is of paramount importance. we succeeded in generating model peptides that, without changes in their primary structure, predictably react on changes of diverse environmental parameters by adopting different defined secondary structures. these de novo designed peptides strictly follow the characteristic heptad repeat of the α-helical coiled coil structural motif. furthermore, domains that favour β-sheet formation and aggregation can be generated.1 alternatively, those peptides can be equipped with functionalities that allow either the binding of metal ions or the interaction with membranes. as proof of our concept we showed that the resulting secondary structure of such peptides will strongly depend on environmental parameters. thus, this system allows to systematically study the interplay between peptide / protein primary structure and environmental factors for peptide and protein folding on a molecular level. the pathogenesis of alzheimer's disease (ad) is strongly linked to neurotoxic assemblies of the amyloid β protein (aβ). aβ is a soluble component of human plasma which by an unknown mechanism becomes aggregated and neurotoxic. some genetic mutations within the aβ sequence cause very early onset of ad-like diseases, probably by facilitation of aβ assembly into neurotoxic species. recently, it was found that not amyloid fibrils, but smaller aβ assemblies initiate a pathogenic cascade resulting in ad. therefore, preventing the folding of nascent aβ monomers would have therapeutic benefit. to uncover details of structural changes accompanying the aggregation process, especially its initial stage, we have decided to study the aβ(11-28) fragment and its mutation-related variants. our recent studies on this aβ fragment using the cd method and the aggregation test have proved it a good model for structural studies. the obtained results confirmed that the aggregation process follows the scheme with an α-helical intermediate and pointed out differences in the behaviour of aβ variants. to further confirm the scheme of the structural changes accompanying aggregation we have applied ftir spectroscopy and analysed aggregation-induced changes of the amide i band which is directly related to peptide backbone conformations. the ftir spectra analysis indicate that water addition provoked conformational changes are strongly dependent on the aβ(11-28) variant and in some cases the formation of α-helical intermediate seems to be preceded by 310 helix formation. to verify this hypothesis the temperature dependent atr ftir spectra will be analysed. supported by ug bw grant. amino acid octarepeats present in the prion protein bind to cu2+ and are considered as a potential periplasmic copper ion transporters. this octarepeat is located in the unstructured region of the prion protein, which is supposedly not intricately involved in prion aggregation. our group is involved in exploring the function of octarepeats with a special emphasis on their possible role in amyloid fibril formation and aggregation.1 in this context, we have prepared truncated peptide constructs derived from the prion protein octarepeat phgggwgq and have reported their fibrillation activity. we will present aggregative behavior of a truncated bis-pentapeptide, containing gggwg segment, when tethered with a flexible linker diaminobutane. fibrillar architectures were observed by this bis conjugate after incubation in water which was probed by different microscopic and steady state fluorescence techniques. further investigations with ki revealed a homogeneous environment of the two tryptophan moieties in the conjugate. in the absence of other side-chains, it is likely that fibril formation involves hydrophobic interaction between tryptophan indole moieties and main chain backbone interactions. interestingly, a facilitator role for aromatic-glycine motifs for amyloid aggregation has been proposed based on bioinformatics search of the swiss-prot and trembl databases. collagens are known to fold into a highly ordered rode-shaped triple helix with stretches of lower and higher suprastructural stability and even disruptions to modulate recognition by other proteins that interact with the extracellular matrix [1] . to increase understanding of folding and stability of the collagen triple helix, we have adressed the design of photocontrolled collagenous peptides. our aim was to crosslink two side chains of the repetitive (xaa-yaa-gly) sequence motifs of collagen model compounds via an azobenzene chromophore in analogy to our previous studies on photomodulation of the conformational preferences of cyclic peptides and more recently of hairpin-peptide model systems [2] . molecular modeling experiments suggested appropriate sequence positions that could result in triple-helical peptides with conformational stabilities that can be modulated by cis/trans isomerization of the azobenzene moieties. as light switchable crosslinker azobenzene-4,4'-n-(4-iodo-2-butynenyl)carboxyamide was synthesized for reaction with two (4s)-mercaptoproline residues placed in suitable xaa and yaa positions, respectively. by this approach a fully folded triple helix was obtained upon thermal relaxation, and unfolding was induced by irradiation at 350 nm. the favorable optical properties of the azobenzene derivative together with the regular suprastructure resulted in a valuable model system that allows for ultrafast time-resolved studies of collagen folding and unfolding. amyloid formation is connected with alzheimer's disease, parkinson's disease, finnish familial amyloidosis. after protein misfolding short peptide sequences act as "hot spots" providing the driving force for protein aggregation in amyloid fibrils. we have identified one of these sequence stretches in the abl-sh3 domain of drosophila (dlsfmkge) whereas the human homologous region (dlsfkkge) is predicted to be less amyloidogenic. the possible reason for the difference of amyloid formation propensities of the two peptides was investigated by molecular dynamics (md) of β-sheet structures. the antiparallel alanine β-sheets consisting of two and ten strands were constructed, minimized, and mutated to the sequences dlsfmkge and dlsfkkge. all four systems: 1) dlsfmkge -two strands, 2) dlsfkkge -two strands, 3) dlsfmkge -ten strands, 4) dlsfkkge -ten strands, were surrounded by 10 å layer of water molecules over the solute and subjected to md, amber 8.0 force field, ntp protocol. the md runs were started at the temperature of 10 k and the temperature was elevated stepwise by 10 degrees till 300 k. the results show considerably higher hydrogen bond percentage for dlsfmkge than that one for dlsfkkge during the course of the simulation, thus suggesting that dlsfmkge is a potential fibril-maker, but dlsfkkge is not. two strand β-sheet systems were stable until 170 k. the ten strand β-sheets are more stable. angiotensin-i converting enzyme (ace) has a critical role in cardiovascular function, which consists of cleaving the carboxy terminal his-leu dipeptide from angiotensin-i producing a potent vasopressor octapeptide, angiotensin-ii. there are two isoforms of ace. the somatic isoform is present in all human cells except the testis cells, where the testicular isoform is produced. the major difference between these two types is that, the somatic form has two active sites, at the n-and c-end respectively while the testicular has only one, which is almost identical to the somatic c-terminal active site. here we report the structural study of a 108aa peptide (previously expressed in bacteria), which corresponds to an extended domain of the human somatic n-terminal active site of ace (ala361-gly468) by circular dichroism experiments, and the overexpression in bacteria, purification and structural study, using circular dichroism techniques, of a 108aa peptide which corresponds to an extended domain of the human somatic c-terminal active site of ace (ala958-gly1065). following the subclonning into an appropriate expression vector and the expression, the peptide was isolated from the inclusion bodies using chromatography techniques. the recombinant protein fragment had a molecular weight, measured by esi-ms, of 12102 kda which was in consistence with the theoretical calculation based on the dna sequence. the recombinant peptides acquired their theoretically calculated secondary structure only when 1,1,1-trifluoroethanol is present at a concentration of ~70%. in order to elucidate their structures, solutions of these peptides, labeled with 15n and/or 13c, will be studied by nmr spectroscopy. aggregation of peptides is believed to trigger various degenerative diseases but it also plays an important role for the preparation of peptide fibres and peptide-based biomaterials. it is therefore extremely important to understand the mechanisms involved in peptide aggregation and be able to control them. studies were performed on a library of amphiphilic peptides, designed around the sequence of a model antimicrobial peptide rich in leucine and lysine. the library also included peptide hybrids in which natural amino acids were replaced by non-proteinogenic omega-amino acids, such as 6-aminohexanoic acid and 9-aminononanoic acid. the aim was to estimate the aggregation and its correlation with the biological activity by using a fluorescence technique commonly employed to calculate the cmc (critical micelle concentration) of surfactants. peptides and peptide hybrids were synthesized on solid support using the fmoc polyamide protocol. they were purified by semi-preparative rp-hplc and characterized by esi-ms and analytical rp-hplc. the aggregation behaviour of the synthesized molecules was investigated in water by steady-state fluorescence measurements using pyrene as fluorescent probe. peptides were dissolved in water/pyrene or water/pyrene/0. fluorescence spectroscopy has become an extremely valuable technique for conformational studies of biopolymers, the development of peptide-based chemosensors, and biochemical research in general. in this connection, synthetic amino acids as fluorescent probes to be incorporated into a peptide chain may exhibit significant advantages over the related protein (trp and tyr) residues in terms of potentially different and ameliorated properties. we recently designed and prepared a new fluorescent amino acid, antaib, based on a planar anthracene core and belonging to the class of achiral, ciα↔ciα cyclized, cα-tetrasubstituted α-amino acids (strong β-turn and helix inducers in peptides). peptides based on antaib combined with (l)-ala residues were synthesized and subjected to a conformational analysis. more specifically, the protected derivatives boc-antaib-oet (oh) and fmoc-antaib-otbu (oh) were prepared in seven steps from 1,2, 4 conformational transitions in peptides and proteins emerge to play the major role in the genesis and evolution of prion related diseases and alzheimer's disease (ad).1 in this context, conditions influencing this transition and the following aggregation process are of paramount interest. peptides and proteins that are involved in aggregation processes contain potential metal binding sites. the concentration of metal ions in the brain tissue is naturally high and zn in the mm range has been found in ad amyloid plaques. thus, it is widely accepted that metal complexation is one of the key incidents that lead to conformational transitions and aggregation. we present here a coiled coil based model peptide system with an intrinsic amyloid forming tendency which can be used to study the impact of different metal ions on secondary structure and aggregation. metal complexing histidine residues were incorporated to create potential binding sites which, depending on their position and the nature of the metal ion, dictate folding and aggregation. the time dependent conformational transition was monitored by cd-spectroscopy. aggregates were characterized by cryo tem. high resolution fticr-ms experiments revealed information on the stoichiometry of the peptide-metal complexes. in the absence of metal ions the presented peptides formed amyloids in a time range of weeks. depending on the his positions and milieu conditions, the nature of the metal ion determines folding and aggregate morphology. furthermore, metal binding was shown to inhibit the amyloid formation. a challenge to our understanding of protein folding is the design of a protein from first principle, i.e. starting from geometric restraints and applying properties of amino acids expected to be essential for folding to a defined structure. we developed a program to calculate the backbone coordinates of antiparallel strands to match the surface of an elliptical cylinder. various parameters like the number and shearing of the strands and the ellipticity of the structure can be varied. the relative orientation of the β-strands and the geometrical features of the hydrogen-bonds were derived from statistical analyzes of natural β-sheet structures. iterative cycles of core-packing with amino acid rotamers, molecular dynamics simulation and energy minimization with the charmm forcefield are used to include backbone movements and to minimize the risk of trapping an energetically unfavorable structure. the quality of residue packing is assessed with the help of criteria which proved to be successful in our design of β-sandwiches. at the protein surface, a network of salt-bridges with an excess of positive charges has been designed to increase the stability and the solubility of the protein. the final sequence is synthesized by standard solid phase fmoc-chemistry. insights gained from the analysis of the synthesized structure with ftir and cd spectrometry should help us to refine the parameters for subsequent designs. with this strategy, we hope to contribute to a better understanding of protein folding. the immunoglobulin binding protein g (1igd) from streptococcus species consists of 61 amino acids residues, which form two antiparalell-packed beta-hairpins and an alpha-helix in the middle of the sequence packed to the beta-sheet. the second hairpin was found to be stable in isolation. this fragment is therefore likely to be the first folding initiation site of the protein which could provide an adequate nucleation center on which the rest of the polypeptide chain would find a favorable environment to fold. thus, among the two beta-hairpins, the 48-59 fragment of 1igd corresponding to the c-terminal beta-hairpin was synthesized. in our studies, we investigated different environmental and temperature conditions for formation of the 48-59 beta-hairpin structure. its structure was examined by means of cd spectroscopy in water, buffer solutions (ph = 3 to 9) and in aqueous solutions of trifluoroethanol. additionally, its structure was investigated in the solid state by ftir spectroscopy. the cd studies revealed that the 48-59 fragment of 1igd in water forms mainly a statistical-coil structure, whereas the ftir technique shows formation of a regular beta-sheet structure. nmr spectroscopy and calorimetric measurements were carried out at various temperatures. our studies show that the 48-59 fragment at low temperature exists in an equilibrium between two conformations -a regular beta-hairpin and a statistical-coil. although increasing temperature resulted in shifting the equilibrium in the direction of the statistical-coil structure, the overall beta-hairpin shape of the 48-59 fragment was maintained. prion diseases are characterized by the conversion of the physiological cellular form of the prion protein (prpc) into an insoluble, protease-resistant abnormal scrapie form (prpsc) with an highly beta-sheet content [1] . the analyses of intrinsic structural propensities of the prp c-terminal domain showed an high conformational flexibility for the αhelix 2 fragment which indicates that this region may be particularly important in the prpc→prpsc transition [2] . therefore conformation-based approaches focalized on helix 2 region appear to be the most promising for the study of prion protein misfolding. recent studies on tetracycline properties showed that this molecule binds and disrupts prp peptide aggregates and inactivates the pathogenic forms of prp [3, 4] . a fluorometric titration of the fluoresceinated peptide corresponding to prion protein helix-2 with tetracycline has been carried out to determine the value of the apparent dissociation constant of this interaction (estimated to be 189 ± 7 nm). remarkably, the fluoresceinated peptide exhibits in water a canonical α-helical cd spectrum, that is maintained even in presence of tetracycline. accordingly, docking calculations and molecular dynamics simulations suggest that tetracycline interacts preferentially with the c-terminal end (residues 183-195) of helix-2 with a significant involvement of the treonine rich region. in the last decades, a series of discoveries have shed light on the role played by the carbohydrate moiety in glycoproteins. it has been shown that covalently linked sugar moieties influence peptide/protein properties such as hydrophobicity, conformation, biostability and bioactivity. the design of carbohydrate-peptide analogs with increased, retained or modified biological activity requires an understanding of their conformational preferences both in solution and in the receptor-bound state. in our recent work we have created two classes of well-structured linear and cyclic carbohydrate-modified analogs of opioid peptides, leu-enkephalin and leuenkephalin amide. the first class represents a group of compounds in which the linear peptide is alkylated at the n-terminal position by 1-deoxy-d-fructose unit, while its cyclic analog possesses an ester bond between the c-6 hydroxy group of the sugar moiety and the c-terminal carboxy group of peptide, 1-deoxy-dfructofuranose acting as a bridge between the leu-enkephalin terminal parts. the rigid 5-membered imidazolidinone ring is characteristic for the second class of compounds. in these adducts an imidazolidinone moiety connects the acyclic sugar residue with the linear peptide chain. in the corresponding bicyclic imidazolidinone analogs 19-membered ring is formed through an ester bond between the primary hydroxyl group of the d-gluco-pentitolyl residue and the cterminus of peptide. this work reports the comparative cd and ftir spectroscopic properties of the prepared glycopeptides in comparison with data on the non-modified flexible parent peptides performed in different solvents in order to expose the structural and conformational differences caused by a keto-sugar, rigid 5membered imidazolidinone ring and/or cyclization. the α-helical coiled coil structural motif consists of two to five α-helices which are wrapped around each other with a slight superhelical twist. the simplicity and regularity of this motif have made it an attractive system to study the role of complementary interactions for protein folding. here we present a systematic study showing that intermolecular electrostatic interactions between positions e and g of the helices are in competition with the intramolecular interactions between positions e/b and g/c. those competitive interactions affect folding and stability of the motif which were monitored by temperature dependent cd-spectroscopy. incorporation of oppositely charged amino acids in positions e/b and g/c reduced considerably electrostatic repulsion between equally charged amino acids in positions e and g. in addition coiled coil stability can be increased by the alkyl part of the amino acid side chains in positions e and g. studies with natural and unnatural amino acids showed that the longer this alkyl part the better is the hydrophobic core protected from solvent. therefore the repulsion of equally charged amino acids in positions e and g can be overruled by involving them either into attractive intrahelical electrostatic interactions or into hydrophobic core formation. human cystatin c (hcc) is a 120 amino acid residues protein that reversibly inhibits papain-like cysteine proteases. this inhibitor belongs to the amyloidogenic proteins shown to oligomerize through 3d domain swapping mechanism. the crystal structure of hcc reveals the way the protein refolds to produce symmetric dimer while retaining the secondary structure of the monomer. the monomeric form of hcc consists of a core with a five-stranded antiparallel β-sheet wrapped around a central α-helix. the hcc dimerization is preceded by an opening movement of l1 loop from β2-l1-β3 hairpin and separation of the β1-helix-β2 fragment from the remaining part of the molecule. the amino acid sequence of β1-helix region suggests additional possible partial unfolding in the n-terminal part of helix. in order to investigate the structural stability of β1-helix region the peptide corresponding to the helix and its n-terminal truncated analogs was synthesized along with the peptide analogs of helix containing point mutations that could stabilize helical structure of the n-terminus. the peptides were synthesized by the solid-phase method using fmoc/tbu tactics. purified products were identified by maldi-tof. the secondary structure content was calculated from the cd spectra using selcon3. the random coil was the predominant structure of the peptide corresponding to the α-helical fragment of hcc and its n-truncated analogs. however, an increase of αhelix content was observed in some of the peptides corresponding to the helix containing point mutations. we expect that these mutations could stabilize the hcc monomer and suppress dimerization. the tumor suppressor protein p53 is a trarnscription factor that triggers cell-cycle arrest and apoptosis in response to genotoxic stress signals. the tetramereric structure of the p53, which is essential for its activity as a transcription factor, is formed as a dimer of dimers. while the primary dimer is constructed from inter molecular formation of a two-stranded anti-parallel b-sheet and a two anti-parallel a-helix bundle, the secondary dimer is stabilized through interactions between residues on the surface of the primary dimer. from various substitution experiments on p53, it has been shown that hydrophobicity of phe341 is critical for the tetramer formation of p53. also we have substituted three phenyl groups of p53 with cyclohexylalanine (cha) and showed that phe341cha is dramatically stabilized against temperature, chemical denaturant, and organic solvent by cd measurements. here, to clarify the mechanism of the stabilization of phe341cha, we analyzed its three dimensional structure using x-ray crystallography. we obtained two kinds of crystals, one is a hexagonal bipyramid crystal in the space group of p6422 with a=50.12 å, b=50.12 å, c=48.18 å diffracted to about 2.0 å, and another is tabular crystal in the space group c2 with a=77.25 å, b=50.04å, c=55.10 å diffracted to about 2.1 å. in these crystals, the peptides formed tetramers which are very similar to those observed in the wild-type. the structure of the pocket where the side chain of cha341 is incorporated was defined to elucidate the hydrophobic interaction to determine the stability. helices shown in proteins, as a secondary structure, almost always form right-handed screw sense. this right-handedness is believed to result from the chiral center at the αposition of proteinogenic l-α-amino acids. among proteinogenic amino acids, l-isoleucine and l-threonine possess an additional chiral center at the side-chain β-carbon besides the α-carbon. however, no attention has be paid how the side-chain chiral centers affect the secondary structures of their peptides. recently, we have reported that sidechain chiral centers of chiral cyclic α,α-disubstituted amino acid (s,s)-ac(5)c(dom) affected the helical secondary structure of its peptides, and the helical-screw direction could be controlled without a chiral center at the α-carbon atom. herein, we synthesized a chiral bicyclic α,α-disubstituted amino acid (r,r)-ab(5,6)c and its homopeptides, and studied the relationship between the chiral centers and the helical-screw handedness of peptides. contrary to the left-handed helices of (s,s)-ac (5) four threefinger-toxins (tfs) have been purified from the pooled venom of golden krait (bungarus fasciatus, i.e. bf) from thailand and studied previously. these peptide toxins contain 60-65 residues and 4 or 5 pairs of disulfide bonds, and are rich in β-structure. we herein analyzed the tf-isoforms in bf venoms from kolkata (eastern india), hunan province (eastern china) and indonesia to study the geographic variations and structure-function relationships of the venom polypeptide family. a total of five or six tfs of low lethality were purified from each of the geographic venom samples, the n-terminal sequences and accurate masses of the peptides were determined. the cdnas encoding some of these tfs were also cloned and sequenced. full peptide sequences were deduced and match with those of the tfs purified from crude venom. intra-species variations of the venom tfs were found to be surprisingly high since sequence-identities between the majorities of orthologous toxins in different geographic samples are only 75-80 %. most of the bf proteins were not neurotoxic by electrophysiological assays using chick biventer-cervicis and mouse diaphragm neuromuscular tissues. the toxins appear to be associated with weak toxins or non-conventional snake venom tfs as analyzed by a phylogenetic tree. the reason behind their lack of neurotoxicity would be discussed. v. moussis, e. panou-pomonis, c. sakarellos, v. tsikaris peptides involved in neurodegenerative diseases can adopt at least two different stable secondary structures. amyloid-forming proteins can experience a conformational transition from the native, mostly α-helical structure, into a ß-sheet rich isoform. the latter conformation is probably present in intermediates for the formation of amyloids. the conformational change can be triggered by protein concentration or environmental changes. therefore, our aim was to generate a de novo designed peptide that contains structural elements for both, stable α-helical as well as ß-sheet formation. this model peptide can be used to elucidate the conformational changes dependent on concentration and ph.1 the design is based on the well studied α-helical coiled coil folding motif. the conformation and structure of the resulting aggregates were characterized by cd-spectroscopy and cryo transmission electron microscopy. as a result, three distinct secondary structures can be induced at will by adjustment of ph or peptide concentration. low concentrations at ph 4.0 yield globular particles of the unfolded peptide whereas, at the same ph but higher concentration, defined ß-sheet ribbons are formed. in contrast, at high concentrations and ph 7.4, the peptide prefers highly ordered α-helical fibers. in conclusion, we successfully generated a model peptide that, without changes in its primary structure, predictably reacts on environmental changes by adopting different defined secondary structures. thus, this system allows to systematically study now the consequences of the interplay between peptide primary structure and environmental factors for conformation on a molecular level. cells respond simultaneously to a multitude of different signals. inside a cell signals from activated receptors are integrated by networks of enzymatic reactions and molecular interactions, leading to a spectrum of cellular responses. in order to understand the relationship between a specific cellular stimulus and a cellular response, methods are required to detect in parallel the pattern of molecular interactions. a large number of molecular interactions is mediated by protein domains, binding to linear peptide motifs. lysates of activated and resting cells were incubated on peptide microarrays carrying peptides corresponding to such binding motifs of signalling domains. binding of proteins to a spot of the array was probed by immunofluorescence. jurkat t cells were stimulated either with the phosphatase inhibitor pervanadate or with antibodies directed against cell surface receptors. upon activation of t cells, numerous changes in the pattern of molecular interactions were detected for a total of 10 proteins and 30 peptides. these changes were caused either (i) by masking or unmasking of a binding domain which resulted in a reduced or increased binding of a protein to the microarray or (ii) by recruitment of a protein into a complex that in turn bound to the microarray. the changes were dependent on the nature of the stimulus. the human melanocortin receptor 1 (hmc1r) was constructed to contain a flag epitope and a hexahistidine tag at the amino-terminus as well as at the carboxyl terminus to facilitate purification. stably transfected hmc1r in human embryonic kidney (hek293) cell lines that expressed the receptor resulted in a kd value of 0.1 and 0.2 nm respectively in each case when the super potent agonist mtii was competed with [125i]ndp-α-msh. treatment of the tagged receptors in the hek293 cells with agonist resulted in down-regulation which indicates that these tagged receptors retain their biological functions. the hmc1r was solubilized from cell membranes with n-dodecyl-β-d-maltoside and purified at a nickel chelating resin and a newly constructed affinity column. the purified hmc1r was a glycoprotein that migrated on sds/page with a molecular mass of 58 kda. the results from matrix-assisted laser desorption ionization-time of flight (maldi-tof) mass spectrometry was used to identify and characterize peptides derived from the hmc1r following in-gel digestion with chymotrypsin. the phosphorylation sites were identified on the purified human melanocortin receptor 1 with agonists (peptide vs small molecule) treatment. the discovery of antibiotics in the 1930s has been one of the most revolutionary events in the history of medicine. however, during last decades, the increase of antibiotic resistance has significantly hampered the application of antibiotics. therefore, further scientific effort to find new antibiotics with novel mechanisms of action is of high importance. insects are the largest and the most diverse group of living animals on earth. they have potentially been confronting high variety of microorganisms. as a result, they have evolved powerful defense system, thus representing vast source of novel potential therapeutics. we chose larvae of fleshfly neobellieria bullata for identification and characterization of new promising molecules, peptides or proteins, which participate in immunity response against microbial infections. the hemolymph of the third-instar larvae of neobellieria bullata was used for isolation. the larvae were injected with bacterial suspension of escherichia coli or staphylococcus aureus to induce antimicrobial response. the hemolymph was separated into crude fractions, which were subdivided by rp-hplc. isolated fractions were characterized by uv-vis spectroscopy, amino-acid analysis, mass spectroscopy, 1-d and 2-d sds electrophoresis, capillary zone electrophoresis, ion-exchange hplc, tryptic digests and n-terminal sequencing. we found out significant antimicrobial activities against escherichia coli, staphylococus aureus or pseudomonas aeruginosa in several fractions. using real-time pcr, we followed and compared levels of mrna of different proteins and peptides in induced and non-induced larvae. despite the fact that many technological advances are currently involved in proteome analysis, like two-dimensional gel electrophoresis and mass spectrometry, there is still a great need for the development of novel engineered chemical probes for proteomics and interactomics. here, we describe our approach concerning the study of proteome and interactome of proteins involved in cell-matrix interactions. it relies on the use of a small synthetic inhibitor chemically modified to allow for its immobilisation to magnetic beads or affinity chromatography materials. proteins will be detected together with their native interaction partners because of nondenaturing conditions. this general procedure is applied for the enrichment of metalloproteinases, especially matrix metalloproteinases, which are potential target in tumour therapy. hydroxamic acids are known to be potent inhibitors of metalloproteinases. marimastat is a reversible inhibitor with a good potency and shows activity towards a wide range of metalloproteinases. the synthesis of new marimastat derivatives will be reported. the parent compound is modified with a linker to allow immobilisation on a solid surface. binding studies were performed using surface plasmon resonance. this approach is not only appropriate for the generation of metalloproteinase proteome subsets by affinity column or using magnetic beads, but also to enrich and isolate interaction partners of the target proteins. the present report summarizes the latest data devoted to theoretical and experimental investigation of the high temperature solid state catalytic isotope exchange reaction (hscie) that takes place in peptides and proteins by the action of deuterium and tritium [1] . the available ms-procedures, designed to estimate the amount of protein, are aimed at derivatization at different stages of sample preparation, and as the best result, it is only possible to achieve quality comparison of the objects involved. the hscie reaction allows the production of evenly deuterium labelled proteins and peptides, and their application makes it possible to create a qualitative mass spectrometry method for protein analysis. introduction of definite amounts of these deuterium-labeled proteins into biological objects, prior to isolation, separation and trypsinolysis, will generate quantitative information concerning the composition of the proteins under study. tritium labelled proteins produced at a temperature of 100-120oc carry the isotopic label in all the peptide fragments and completely retain their enzymatic activity. the proteins' reactivity is dependent on their three-dimensional structure. the hscie reaction has been shown to be used both in the production of tritium labelled proteins and in the investigation of spatial interactions in protein complexes. in addition, evenly deuterium and tritium labelled peptides can be used in studies of the kinetics and transformation paths of peptides in the organism's tissues. immunization of mice with type ii collagen (cii) from rat leads to development of collagen-induced arthritis (cia). susceptibility to cia is associated with the major histacompatibility class ii protein h-2aq that binds the glycopeptide epitope cii260-267 (1) and presents it to helper t cells. [1] to explore the interactions in the system and to stabilize 1 towards in vivo degradation, amide bond isosteres have been introduced in its backbone.glycopeptide 1 was virtually docked into the binding site of a comparative model of h-2aq. based on the hydrogen bonding network between the peptide backbone and h-2aq, the amide bond between ala261-gly262 was chosen for isosteric replacement. to vary the geometric and hydrogen bonding properties, mimetics of the dipeptide were synthesized with the amide bond replaced by ψ[ch2nh], ψ[coch2] and ψ[(e)-ch=ch], respectively. these were introduced in 1 using solid-phase synthesis to give glycopeptidomimetics that were biologically tested for their ability to bind to the h-2aq protein and for recognition by t cells. shown to be a potent neuroprotective factor in various pathophysiological models. despite its therapeutic potential in diverse neurodegenerative diseases, its short in vivo half-life limits its utility as a useful clinical agent. moreover, the development of a peptidomimetic that reproduces the pharmacological activity of pacap is unlikely since the pharmacophores are spreaded throughout the entire peptide chain. therefore, the development of pacap analogues with lower susceptibility to proteolysis represents a first step toward clinical applications. in the present study, derivatives of both pacap27 and pacap38 with particular chemical modifications were developed targeting specific peptidase sites of action. results indicate that the incorporation of an acetyl or a hexanoyl group at the n-terminus and modifications at the ser2 residue contributed to increase stability against dipeptidyl peptidase iv, the major enzyme involved in pacap degradation. moreover, after determination of pacap metabolites in human plasma, the amide bond between residues 21 and 22 was substituted by a ch2nh surrogate and this derivative showed increased plasma stability. all modified peptides were tested for their ability to induce pc12 differentiation. the effects of pacap analogs on pc12 cells are mediated through the pac1 receptor which is the major receptor involved in the neuroprotective effects of pacap. this study exposes interesting data concerning pacap metabolism in isolated human plasma and demonstrates the possibility of increasing the metabolic stability of pacap without significantly reducing its biological activity. species of the fungal genus trichoderma are commercially used as bioprotective agents against fungal plant diseases. more than 400 strains were collected from their natural habitats and evaluated for biocontrol properties. seven of the most active isolates exhibiting strong biological activity towards eutypa dieback and esca diesease of grapevine were classified as trichoderma brevicompactum, or shown to be closely related to that species. these strains were screened for production of peptaibiotics. the formation and synergistic action of hydrolytic enzymes and peptaibiotics were to play an important role in mycoparasitism. after background and aims: conotoxins are short, disulfide-rich neurotoxins that target various ion channels and receptors. these peptides have desirable pharmacological properties to become therapeutics for neurological disorders; several conotoxins have already reached clinical development stage. our long-term goal is to improve bioavailability, metabolic stability and pharmacokinetics of conotoxins using a variety of chemical modifications. methods: we designed and chemically synthesized conotoxin analogs containing two distinct types of backbone modifications: (1) peptide-peptoid chimeras (conopeptoids) of alpha-conotoxin imi and (2) peptide chimeras of mu-conotoxin kiiia containing non-peptidic "backbone spacers". results: conopeptoid-imi, containing ala9 replaced by n-methyl glycine potently blocked activity of nicotinic acetylcholine receptors. in mu-conotoxin kiiia, aminohexanoic acid or amino-3-oxapemtanoic acid were inserted to be a part of the peptide backbone. the two oxidized analogs containing "backbone prosthesis" differed in their hydrophibicity profile, but they both potently inhibited neuronal sodium channels. conclusion: our results suggest that backbone engineering may become an effective method of producing conotoxin analogs with modified bioavailability. to increase the stability and the therapeutic efficacy of peptide sequences from myelin oligodendrocyte protein (mog) that act as multiple sclerosis (ms) antigens, we grafted them onto a framework of a particularly stable class of peptides, the cyclotides. they are a recently discovered family of cyclic plant peptides with superb intrinsic stability. the limitations of linear peptides as drugs due to their instability and poor bio-availability can be overcome by using the cyclotide scaffold as a framework for novel drug design. peptide epitopes from mog protein were incorporated onto the framework of the model cyclotide kalata b1 by means of boc-spps approach. after successful backbone cyclisation and oxidation of the cysteine residues, the peptides were purified to high purity with rp-hplc. nmr chemical shift analysis was used to assess whether the grafted analogues have a stable scaffold, similar to that of kalata b1. a structure of a representative peptide was determined and it shows remarkable resemblance to the native scaffold of kalata b1. the activity of the bioengineered peptides has been tested in vivo. a group of mice injected with one of the peptides have shown a depression in the clinical score and have not fallen ill. this is an exciting result that shows the first active bioengineered cyclotide in an animal model of disease. the structural information from nmr studies will be used in conjunction with the results from the activity studies in a feedback loop to design second-generation lead molecules. a.d. de araujo, p.f. alewood conotoxins are small, disulfide-rich peptide neurotoxins produced in the venom of marine cone snails that enable these molluscs to capture their prey. these compounds exhibit a high degree of selectivity and potency for different ion channels and their respective subtypes, making them useful tools in the investigation of the nervous system. due to the key role of ion channels in many physiological processes, conotoxins are also excellent drug lead candidates in drug discovery, with some examples currently undergoing clinical trials and one recently approved drug (prialt®). like other peptide drugs, the use of conotoxins in drug development is limited by the low oral bioavailability of these compounds due to pre-systemic enzymatic degradation and poor penetration through the intestinal mucosa. peptide analogs that mimic the native structure and incorporate rationally designed non-natural modifications in the disulfide framework or peptide backbone may exhibit increased resistance to proteolytic degradation. in biological environments, disulfide linkages are susceptible to attack by enzymes and reducing agents (such as glutathione). therefore we carried out the synthesis of redox-stable conotoxin analogs in which intrinsic disulfide bonds were replaced by thioether linkages. in this work, wehave explored two solid-phase synthetic routes in the preparation of thioether conotoxin mimetics: the first route based on peptide assembly using a tetra-orthogonally protected lanthionine building block, and the second route based on intramolecular on-bead cyclisation between cysteine and betabromoalanine residues. thyrotropin releasing hormone (trh, l-pglu-l-his-l-pro-nh2), a tripeptide synthesized in the hypothalamus, operates in the anterior pituitary to control levels of tsh (thyroid-stimulating hormone) and prolactin. the thyrotropin-releasing hormone (trh) receptor (trh-r) belongs to the rhodopsin/β-adrenergic receptor subfamily of seven transmembrane (tm)-spanning, g protein-coupled receptors (gpcrs). the two g-protein-coupled receptors for trh, trh receptor type 1 (trh-r1) and trh receptor type 2 (trh-r2), have been cloned from mammals and are distributed differently in the brain and peripheral tissues. the trh receptor subtype-1 appears to mediate the hormonal and visceral effects, whereas trh receptor subtype-2 has been implicated in its central stimulatory actions. identification of critical features of the trh, separation of its multiple activities through design of selective analogues and affinity labels have been elusive and unfulfilled goals for more then 30 years. this presentation will highlight our studies on effect of the biological activity of trh with the introduction of alkyl groups of varying sizes at the n-1 and c-2 position of the centrally placed histidine residue of trh peptide. the requisite n--boc-dialkyl-l-histidines as building scaffolds have been synthesized in six steps from l-histidine methyl ester dihydrochloride and used for the synthesis of various trh analogues. ghrelin, the natural ligand for the growth hormone secretagogue receptor (ghsr-1a), has received a great deal of attention due to its ability to stimulate growth hormone secretion and to control food intake. during these last years, ghrelin analogues or mimetics gained interest for their implication in food intake regulation. in the course of our studies concerning ghrelin analogs, we developed new ligands of the ghsr-1a based on heterocyclic structures. interestingly, one heterocyclic family presented high affinity for the ghrelin receptor. a structure activity study was performed within this family and led to potent ghsr-1a agonists and antagonists. the binding affinities were determined by displacement of radiolabelled ghrelin and the agonist or antagonist character was measured by intracellular calcium mobilisation. the first in vivo results revealed that in vitro activities and in vivo responses were not correlated. particularly, binding to ghsr-1a and in vivo gh release and food intake control were not fully correlated. these results strongly suggest the presence of receptor subtypes that modulate ghrelin actions. some examples will be reported. further investigations are going on in the laboratory. background and aims: alzheimer's disease (ad) related beta-amyloid (aβ) peptides possess high propensity towards aggregation. nowadays one of the major directions of the drug-design against ad is the synthesis of putative amyloid aggregation inhibitor molecules (aai) which are able to hinder the formation of these toxic amyloid aggregates. in the present work we report the design, synthesis and investigation of putative aais derived form the peptide sequence aiigl identical to aβ(30-34). methods: aβ(1-42) peptide and putative aggregation inhibitors were synthesized manually using fmoc-chemistry and dcc/hobt activation. studies on both the aβ aggregation and the effect of the aais were performed with several instrumental techniques. the total amount of the aggregates was determined by thioflavine-t binding assay, while their size distribution was characterized with dynamic light scattering (dls). aggregated aβ forms were visualized with transmission electron microscopy (tem). the binding affinity of the aais to aβ fibrils was studied in saturation transfer nmr experiments, while in vitro viability assays were performed on cultured human sh-sy5y neuroblastoma cells to monitor the neuroprotective effect of the amyloid aggregation inhibitors. results and conclusions: 32 derivatives of aβ(30-34) were synthesized and tested concerning their neuroprotective effect against aβ-mediated toxicity. the most promising candidates were examined with physico-chemical methods to characterize their aggregation altering ability. the pentapeptide riigl-amide proved to be the most powerful neuroprotective compound, and it was also able to alter considerably the aggregation kinetics of aβ(1-42) . this molecule might serve as a lead compound in the drug-design against ad in the future. m. mattiuzzo, a. bandiera, m. benincasa, i. samborska, m. scocchi, r. gennaro antimicrobial peptides (amps) are ancient molecules of the innate immune defence system. most of them kill bacteria by lysing their membranes. the proline-rich group of amps represents an exception, as they act via a permeabilization-independent mechanism that is likely based on recognition of molecular targets/transporters. however, specific internal targets have not yet been identified for most of these amps. bac7 is a pro-rich amp isolated from bovine neutrophils that is capable to translocate across membranes to target hypothetical intracellular proteins. in this study, we used a molecular approach to identify putative targets for bac7 by selection of peptide-resistant mutants followed by identification of the genes responsible of this resistance. to this aim, an e. coli strain susceptible to the fully active fragment bac7(1-35) was subjected to chemical mutagenesis and a number of peptide-resistant mutants was obtained. a pool of genomic dna from these mutants was used to construct a plasmid library that was used to transform a susceptible strain. this approach allowed the identification of 14 different clones that provided a high level of resistance to bac7. sequence analysis revealed the presence of genes originating from five different regions of the e. coli chromosome. among them, one clone contained ptrb, a gene coding for a serine peptidase broadly distributed among gram -bacteria, which could be involved in resistance by degrading the peptide. other resistanceconferring clones, which encode for membrane proteins that may be involved in peptide translocation across the memmbrane, are currently under investigation. lycotoxins are peptides from the venom of the wolf spider that were predicted to have an amphipatic alpha-helical structure and confirmed to possess significant antimicrobial and pore-forming activities. [1] we became interested in these peptides as potential leishmanicidal agents against leishmania donovani promastigotes and leishmania pifanoi amastigotes. thus, lycotoxin i (lyi) and lycotoxin ii (lyii), [1] and shortened analogues lyi1-19, lyi1-15, lyii1-21 and lyii1-17, were synthesized as c-terminal carboxamides. short-and long-term parasite proliferation measurements showed all peptides except lyi1-5 to be active against both promastigotes and amastigotes at the micromolar range, and suggest peptide effects on parasites to be irreversible. lyii, that showed the lower activity, became more leishmanicidal upon residue trimming, whereas the most active lyi displayed the opposite behaviour. a set of complementary techniques showed that lycotoxin peptides are membrane-disruptive to promastigotes. electron microscopy showed that two populations of promastigotes, one with intact parasites and the other extensively damaged, are formed upon addition of the peptides at their ec50. all peptides were non-hemolytic for sheep erythrocytes below 20 micromolar. tissue transglutaminase (tg2) is an enzyme that plays a key role in the pathogenesis of the celiac disease. tg2 is the main autoantigen recognized by the antiendomysium antibodies and, furthermore, catalyzes the deamidation of strategic glutamine to glutamic acid within the sequence of immunodominant gliadin epitopes. recently, another unexpected role for surface tg2, in the innate immune response in the celiac disease, has been suggested. it follows that tg2 inhibitors might represent a potential attractive pharmacological alternative to the gluten-free diet that, nowadays, is the only possible therapy for celiac patients. starting from the sequence of the heptapeptide pqpqlpy, known to be a high affinity substrate of human tg2, we have synthesized new analogs replacing pro3 with different constrained amino acids (d-pro, pip, chg, ind, deg, inp, hyp, thz)) with the aim to develop specific inhibitors of tg2. actually, proline residues present in the gluten epitopes are important in determining the immunogenicity of the epitopes and the specificity of tg2. herein, we describe the preliminary conformational studies of the synthesized analogs by cd and nmr spectroscopy and molecular modeling methods. the structural features of the peptides have analyzed in different environment. given the role of the domain pqpqlpy in the gliadin proteins, structural analysis on its analogs are of considerable interest. the results of our studies might be useful to clarify the role of the proline residues in the interaction of the gluten epitopes with tg2 and, consequently, to gain new insight in the molecular mechanism of celiac disease. carnosine and related histidine-containing dipeptides are known to react with high efficiency with the products of lipid peroxidation, namely 4-hydroxy-trans-2,3nonenal (hne) and other alpha, beta-unsaturated aldehydes, preventing their reaction with nucleophilic residues in proteins and nucleic acids. histidyl hydrazide alone or conjugated with aminoacids, long chain fatty acids, cholesterol and alpha-tocopherol synthesized in our laboratories demonstrated higher aldehydesequestering efficiency than carnosine, and were also efficient in protecting sh-sy5y neuroblastoma cells and rat hippocampal neurons from hne-mediated death. the cytoprotective efficacy of these compounds suggests their potential as therapeutic agents for disorders that involve excessive membrane lipid peroxidation. lantibiotics are antibiotic natural products that are synthesised ribosomally and undergo extensive post-translational modification, resulting in multiple thioether bridges and dehydro amino acids in the mature peptide. one such lantibiotic is mersacidin, which is produced by bacillus sp. and exhibits extremely promising in vitro and in vivo efficacy versus a number of clinically relevant pathogens, most notably methicillin-resistant staphylococcus aureus (mrsa). mersacidin is believed to function by binding to the bacterial cell wall precursor lipid ii, thereby preventing its incorporation into the growing peptidoglycan network. in an attempt to better understand this mode of action and develop more active analogues, we have embarked upon its chemical derivatisation. some of these modifications resulted in an altered antibacterial spectrum, permitting some insight into tentative structure-activity relationships. the characterisation of these structurally complex compounds via a combination of multidimensional nmr and tandem mass spectrometry will also be presented. conjugation of peptides with different moieties, such as peg, lipids, carrier proteins and toll-like receptor ligands is an established approach to improve their pharmacokinetic profile for drug use and/or to enhance their immunogenicity as subunit vaccines. the development of suitable conjugation strategies for peptides of any complexity is therefore fundamental for their effective use in human therapeutic applications. here we describe our strategy to engineer a trimeric coiled coil to obtain a covalently linked, structurally stable construct endowed of extra functionality for further derivatization. we previously showed that covalent stabilization of the designed trimeric coiled coil izn17, by interchain disulfide bonds, yielded an extremely potent and broad inhibitor of viral infection, (ccizn17)3, [1] . this potent inhibitory activity makes (ccizn17)3 not only attractive as an antiviral compound, but also as an immunogen to elicit a neutralizing antibody response [2] . we have now developed an alternative synthetic strategy to obtain the covalently-linked izn17 trimer, which allows the presence in the molecule of a free thiol for subsequent chemoselective reactions. first, we showed that stable interchain thioether bonds can be effectively substituted for the disulfides. second, we devised an orthogonal cysteine protection scheme which allows formation of the thioether bonds, while leaving an extra free cysteine on demand. this thiol group can be used for conjugation of the trimeric coiled coil to adjuvant/carrier proteins or, as reported more specifically here, to a peg40kda moiety. human igg is a most abundant type of immunoglobulin in serum and most of antibody drugs applied for therapy of cancer and autoimmune diseases also belong to this group of human immunoglobulin. specifically binding peptide to human igg is very useful for detection and purification as an affinity ligand of human igg. although some previous reports described such peptides, we tried here to isolate high-specific and high affinity ligands by employing random peptide-displayed t7 phage library. our t7 random peptide phage library possesses a total diversity of 10 powered 10th consisting of different sequence length constrained by disulfide bond. by biopanning against human igg, we isolated several igg specific clones from our library. the peptides displayed on these phages shared some common sequences in the limited region surrounded by cys residues, which suggests they are essential for binding. these clones bound only to fc region of igg and did neither other types of human immunoglobulin nor igg of other animals. a synthetic peptide derived from a phage clone showed a sub-nanomolar of kd value in binding to human igg fc on spr analysis. these results indicate that the peptide motifs obtained here are a strong candidate of human igg-specific affinity ligand for detection and purification of igg. therefore, we are now going on constructing detection and purification system using modified and improved peptide motifs. synthesis of glp-1 analogues as potential agents for blood glucose control p. kanda, r.p. sharma, c.p. hodgkinson in this study, we synthesised a panel of glp-1 analogues stabilised against dpp-iv degradation through either selective amino acid substitutions for ala8, or introduction of amide bond surrogates into the peptide backbone between ala8 and glu9. each was made by standard fmoc or boc chemistry, purified by hplc, and characterized by electrospray mass spectroscopy. all derivatives except one bearing a hydrazine modification were stable to dpp-iv proteolysis for up to 48 hours at ph 7.5, 37o. each was tested for its ability to augment insulin release from a glucose-sensitive, murine insulinoma-derived tc-6 cell line in culture. it was found that each compound acted as a glp-1 agonist to varying degrees, with some exhibiting higher activity than native glp-1 toward promoting insulin release. the most active analogues have been chosen as candidates for stabilisation against renal clearance in efforts to develop new glp-1 analogues with therapeutic potential. the high propensity of the glucose regulatory hormone human islet amyloid polypeptide (iapp) to misfold and aggregate into cytotoxic beta-sheets and fibrils is strongly associated with beta-cell degeneration in type ii diabetes (t2d) and precludes its pharmacological use for the treatment of diabetes. iapp analogs that combine solubility, lack of cytotoxicity, and bioactivity with the ability to block iapp aggregation and cytotoxicity could thus be of high biomedical interest. here we apply a minimalistic conformational restriction strategy to redesign the extremely insoluble and amyloidogenic 37-residue iapp sequence into a soluble, nonamyloidogenic, noncytotoxic, and bioactive iapp mimic (yan et al., pnas (2006) ). the designed mimic has nearly the same sequence as iapp but is highly soluble, nonamyloidogenic, noncytotoxic and a full iapp receptor agonist. in addition, the mimic binds with high affinity iapp and blocks and reverses with nanomolar activity its cytotoxic self-assembly which makes it to the most potent known iapp cytotoxic self-assembly inhibitor. due to its bifunctional nature, the mimic might find therapeutic application for the treatment of diabetes both as an inhibitor of amyloid cytotoxicity and as a soluble iapp receptor agonist. our findings offer a proof-of-principle of a chemical design approach for generating a novel class of highly potent inhibitors of polypeptide cytotoxic selfassembly which are nonamyloidogenic mimics of the native amyloidogenic sequence as well. such reengineered biomolecules -the design of novel mimics is in progress-are of high biomedical significance for understanding the mechanism of protein aggregation diseases and for the development of prospective therapeutic treatments. peptides that are capable of targeting abnormal changes of living tissue can be very useful in early detection or diagnosis of, e.g., cancer. conjugating a functional agent, an effector unit, to such a peptide may provide the agent with improved pharmacodynamic properties.the specificity of a thiol group for reactive groups offer a unique way to attach effector units to cysteine-free linear or cyclic peptides. tumor targeting peptides were synthesized by fmoc-type solid phase methods. peptides cyclic by cystine were modified by lactam bridges. fluorescein, metal (e.g. lanthanide) chelates and cytotoxins were coupled to tumor targeting peptides via e.g. a peg-type spacer, or the conjugates were immobilized on plates for adhesion assays. the method was uncomplicated and gave stable conjugates with good solubility. the approach is useful in making stable peptide-effector conjugates and sets of them for e.g. detection assays such as delfia method and has prospective use in development of diagnosis and therapy. antibacterial proline-rich peptides -synthesis and antibacterial activity d. knappe, r. hoffmann the antibacterial proline-rich peptide family, originally isolated from insects, shows remarkable activity against diverse bacterial and fungal pathogens. while more and more bacterial pathogens become resistant to common drugs, this part of insect innate immunity provides a new promising approach to develop future peptide-drugs. proline-rich peptides possess a significant sequence homology and share a common mechanism of action. in addition oglycosylated threonine residues of drosocin and formaecin appear to be necessary for full antimicrobial activity, although the significance of the carbohydrate moiety in interaction with intracellular targets is still unknown. we synthesized analogues of different antibacterial peptides on solid phase by the fmoc-tbustrategy. the combination or insertion of sequence regions from different native antibacterial peptide sequences offers several advantageous effects, including further reduction of toxicity and broadening of the antimicrobial activity. furthermore, mimicking the o-glycosylation site and changing the carbohydrate moieties, may yield new synthetic approaches to increase both the activity and the selectivity of these oligopeptides. new immunotherapeutic approaches have been developed for treatment of neurodegenerative diseases of the alzheimer's dementia (ad) type. the identification of a ß-amyloid-plaque specific epitope, aß(4-10) (frhdsgy) [1] , recognised by therapeutically active antibodies from transgenic ad mice, provides the basis for development of new ad vaccines and for molecular ad diagnostics. in order to produce immunogenic conjugates, the aß4-10 epitope was attached via thioether linkage to synthetic carriers of well-defined structures, such as tetratuftsin derivative (ac-[tkpk(clac)g]4-nh2) and its elongated version by a helper t-cell epitope (ac-fflltriltipqsld-[tkpk(clac)g]4-nh2); sequential oligopeptide carrier (ac-[lys(clac)-aib-gly]4-oh) and multiple antigenic peptide (clac-lys(clac)-lys(clac-lys(clac))-arg-arg-ßala-nh2). the epitope conjugates containing a cysteine residue either at the c-or n-terminus, and the chloroacetylated carriers were prepared by spps according to boc/bzl strategy. conjugation reactions were performed in solution under slightly alkaline conditions, and monitored by hplc and high resolution-ms. structures and molecular homogeneities of all epitope peptides, carriers and conjugates were ascertained by hplc, maldi and esi-fticr-ms. conformational preferences of the synthesized compounds in water and in tfe were examined by cd spectroscopy. comparative binding studies of the conjugates with a mouse anti-amyloid protein beta-(1-17) monoclonal antibody were performed by indirect elisa. experimental data showed that the chemical nature of the carrier, the epitope topology and the presence of a pentaglycine spacer between the epitope peptide and the carrier, had significant effects on the antibody recognition and on the secondary structures of the conjugates. the new cla analogue 1, containing ethylene bridge between phe nitrogen atoms, was found to exhibit unexpected stimulatory effect in the model of the in vitro humoral immune response in mice. to disclose the structure-activity relationship the nmr solution conformational analysis was carried out. the solid-state and solution conformational analysis of native cla indicated the existence of this cyclic system as a complex mixture of conformations [1] . the nmr spectra recorded at 600 mhz in chloroform at 214 k showed the different conformational behaviour of both cyclopeptides: cla exists as one isomer [1] , peptide 1 is in an equilibrium among at least of three conformers. the picornaviruses are small nonenveloped rna viruses with a single positive strand rna. the virus replication cycle starts after the penetration of the virus in the cytoplasm of the host cell. there are several stages of the virus life cycle used for attack. one of the most useful strategies for attacking of the virus includes inhibition of important for the virus lifecycle enzymes. the key enzymes in the replication of the picornaviruses are 3c and 2a proteases. changes in the active center of these enzymes make them incapable to produce polyprotein in vitro, therefore the inhibition by low molecular weight molecules could stop the viral replication in vivo. 3c protease plays a major roll in the enzymatic proteolysis of the initial viral polyprotein. the target compounds were based on structural modifications in the known as crucial for the 3cp inhibition activity dipeptides phe-gln by incorporation of additional amino acid and pyrrole moiety. the synthesis was cared out as multiple-peptide synthesis in parallel using stepwise spps, fmoc-strategy. the obtained compounds were tested for antiviral activity by agar-diffusion plaque inhibition test against coxsackievirus b1 replication in fl cell and on this base some structure-activity interpretations were made. histone deacetylases (hdac) play important roles in various aspects of regulation such as proliferation, differentiation, and aging by counteracting with histone acetyl transferases. the hdacs categorized in class-i and ii have a metalloprotease-related mechanism in its catalytic activity. these enzymes could be inhibited by small molecules bearing various zinc ligands such as hydroxamic acid and mercaptan. based on the structure of chlamydocin, which has a cyclic tetrapeptide framework, cyclo(-l-aoe-aib-l-phe-d-pro-), where aoe is (2s,9s)-2-amino-8-oxo-9,10-epoxydecanoyl, we have developed potent hdac inhibitors as shown in the figure. in the present study, we examined the effects of the chlamydocin hydroxamic acid (1) and ss-hybrid (2) on the diabetes model mice, kk-ay. the peptide (1) exhibited satisfactory activity to reduce both the blood glucose and blood insulin levels comparable to or even superior to those by pioglitazone, a pparγ agonist. the ss-hybrid (2) , which is expected to be reduced inside of cells to generate the corresponding thiol-containing cyclic tetrapeptide, also showed a significant effect but less than (1). the effect was dose-dependent from 3 mg/kg to 30 mg/kg. the effects of hdac inhibitors were also confirmed by the observation of in vivo histone hyperacetylation induced in the lymphocyte cells. synthetic peptides have a number of advantages over current vaccines. however, exploitation of synthetic peptides as vaccines has been limited by the small size; copy number, inefficient delivery, poor peptide immunogenicity and mhc restriction. we have developed chemical methodologies that overcome these limitations by synthesising and polymerising vinyl-peptides. protective b-cell and t-cell peptide epitopes from the oral pathogen porphyromonas gingivalis were identified for three different mhc restricted mouse strains using pepscan techniques. fmoc chemistry for spps was used to synthesize these peptides and a vinyl moiety was incorporated at the n-terminus using acryloyl chloride. after rp-hplc purification free radical polymerisation using ammonium persulphate and temed was used to polymerise the vinyl-peptides in the presence of either acrylamide or other vinyl-monomers to produce single peptide or multi-peptide polymers. size exclusion chromatography indicated that the peptide polymers were >2 million da. the peptide polymers were used to immunise each mouse strain (balb/c, cba and c57bl6) and the t-cell response induced was evaluated using proliferation and cytokine (elispot) assays. the peptide polymers were found to be highly immunogenic, the single peptide polymers were found to only induce a response in their respective mouse strain, however, the multi-peptide polymer containing all of the t-cell epitopes was found to induce a response in all three mouse strains. in conclusion, our data shows that the polymerisation method overcomes all of the limitations in developing a peptide vaccine and most importantly that of mhc restriction. cytotoxic substances are auspicious weapons in tumour therapy. this compounds inhibit cell division and proliferation, hence, they affect all cells that are able to divide. however, all these compounds act intracellularly, i.e. at first they have to enter the tumour cell efficiently. this is a serious obstacle when using highly effective cytostatica and the cause of severe adverse effects by using higher doses. our aim is to overcome this problem by using synthetic hybride molecules composed of the cytostatic agent, in our case derivatives of arglabin, covalently linked to shuttle peptides. in order to identify the most effective possibility we tested two different strategies. by using the peptide hormone npy, whose specific y receptors are often overexpressed by tumour tissues, we intended to address the chemotherapeutic selectively to y receptor expressing tumour cells via receptor mediated endocytosis. on the other hand, the cytostatic agent was covalently bound to a cell penetrating peptide derived from human calcitonin (hct). recently, a c-terminal fragment of human calcitonin was found to internalize into excised nasal epithelium while the receptor activating n-terminal part is lacking. for the class of hct derived carrier peptides previous studies suggested a receptor independent "lipid raft" mediated endocytotic mechanism of uptake. here we present comparative data investigating both strategies, the highly selective receptor mediated delivery and the highly efficient receptor independent delivery. we investigated the peptide uptake by various cell lines and examined the release of the cytostatic agents inside the cells and its toxic effects. background and aims: synthetic peptides provide a straight-forward access to rationally designed inhibitors of molecular interactions based on structural information of proteins. poor membrane permeability may be overcome via cell-penetrating peptides, but low stability remains a major drawback. an increase of stability and bioactivity of peptides by coupling to polymers is intended. methods: for the development of peptide-functionalized cell-penetrating polymer conjugates, peptides were coupled chemoselectivly to hpma (n-(2hydroxypropyl)methacrylamide, average mw 28.5 kda), as an inert backbone polymer by native chemical ligation. hpma is water soluble and its capacity in drug delivery has been demonstrated. peptide-functionalized polymers and free peptides were incubated with proteases or cell lysates and proteolytic break-down was determined by fluorescence correlation spectroscopy (fcs) deriving information on the number and size of fluorescent particles based on temporal fluctuations of a fluorescence signal caused by diffusion of particles through a femtoliter-size confocal detection volume. the diffusion time depends on molecular size. therefore this technique is suited for the detection of proteolytic fragments. results: efficient chemoselective conjugation of unprotected fluorescein-labelled peptides was accomplished by means of native chemical ligation. our fcs investigations revealed that conjugation of peptides to hpma increased their biostability. this data also indicate that this effect is peptide-dependent. a proapoptotic peptide coupled to hpma and introduced into mammalian cells by electroporation retained its bioactivity. cofunctionalization of hpma with this peptide and nonaargine yielded an efficient cellular import. macrolides, a rather large group of natural or semi-synthetic antibiotics, are widely known translation inhibitors whose structure is based on 14-16-member lactones with carbohydrate substitutes attached. macrolides bind to ribosomal tunnel (rt) in a way that their lactone ring is located orthogonally to the long axis of the rt, covering most of its cleft. carbohydrate residues of the macrolides are spread along the walls of the rt. hence, the mechanism of protein synthesis inhibition by macrolides relies on the mechanical obstruction they provide to the passage of nascent polypeptide chain through the rt. the goal of this study was to design and obtain peptide derivatives of macrolides interesting both as antibacterial agents and potential probes for investigation of nascent peptide chain topography in the rt. tylosin ( in order to reach target sites inside the cns, neurotherapeutic candidates must overcome the blood-brain barrier (bbb). while several transport mechanisms occur at the bbb, this work has focused on the passive diffusion mechanism. the prediction of a peptide's ability to cross the bbb is not a simple task; hence there exists the need for the rational study of the relevant factors that affect the movement across this physiological barrier. here we present two new approaches based on in vitro non cellular assays for this type of studies. firstly, the evaluation of compound mixtures on parallel artificial membrane permeability assays (pampa). this approach increases the throughput of the study and structure-activity relationships can be easily establish. secondly, the transport and biological activity evaluation in a single assay. this second approach has been applied to the search of inhibitors for cns proteases involved in different neurological diseases (such as prolyl oligopeptidase for schizophrenia) able to cross the bbb. these two new approaches allow assaying compound's permeability in the early stages of a drug development project, and then designing novel analogues with improved bbb transport properties or using blood-brain-barrier shuttles for their delivery. two newly synthesized compounds are derivatives of l-valin and are positional isomers of nicotinic (m-6) and isonicotinic acid (p-6). these functional groups, as well as established good lipid solubility suggest that the main target for their biological action probably will be central nervous system. the presence of aminoacid l -valin, supposes their low toxicity, confirmed by our earlier experiments. the aim of the present work is to study their pharmacological activity as putative drugs. methods: male albino mice (18-20g, 10 in groups) were used. next neuropharmacological parameters were studied: analgesic effect (acetic acid test), neuromuscular coordination (rota-rod test), orientation ("hole board" test) and learning and memory (passive avoidance-step down test). results: significant analgesic effect of both compounds was established (more pronounced by dose 250 mg/kg i.p.). slight depressing effect on the orientation reactions in animals was registered, but the neuromuscular coordination and locomotor activity of treated animals were not changed. good dose-dependent effect on learning and memory was established and м-6 had stronger effect than р-6. the compounds modified the effects of some model substances with central nervous activity. hexobarbital sleeping time was significantly prolonged by p-6, but was antagonized by m-6. pentileneterazole threshold was increased significantly and suggests some anticonvulsant activity of both compounds. conclusion: as positional isomers m-6 and p-6 demonstrated some variations in their pharmacological activity probably due to the differences in their kinetics, metabolism and excretion. registered significant neuropharmacological activity, accompanied by low toxicity motivates the new synthesis and future experimental studies. the glyproline family of regulatory peptides includes pro-gly-pro, gly-pro, pro-gly and also the simplest peptides with hyp substituted for pro. a distinctive feature of these peptides is that they exhibit a broad spectrum of biological activities: the antiulcer activity, the inhibition of blood clotting and thrombosis, the reduction of degranulation activity of mast cells, and the normalization of stressogenic behavioral disorders. alzheimer's (ad) and parkinson's (pd) disease are progressive neurodegenerative disorders which are characterized by amyloid plaques. the main components of the plaques are β-amyloid peptides (aβ1-40 and aβ1-42) and α-synuclein. we have previously shown that small peptides structurally related to the sequence of aβ(1-42) protect against the neurotoxicity of aβ peptides. recent studies by other groups have shown that β-synuclein can counteract the aggregation of α-synuclein in the neurodegenerative process of pd, hereby might protect the central nervous system from the neurotoxic effects of alpha-synuclein. it was found that a tetrapeptide (kegv) protected against the neurotoxicity of aβ peptides in vivo. based on the previous findings, the following sequences of β-synuclein have been synthesized: kegv-nh2 and kegv-oh. after comparing the sequences of α-and β-synuclein, we found common sequences which are keqv, regv, kqgv, keqa. we have synthesized these tetrapeptides in amide forms at their c-termini. the peptides have been synthesized on mbha resin using a manual solid phase peptide synthesis equipment and boc-chemistry. the neuroprotective effects of peptides have been investigated in vitro in mtt test in differentiated neuroblastoma cell culture (sh-sy5y) and in electrophysiological test on rats using multibarrel electrodes in vivo. the neuroprotective peptides might stop neuronal death and can influence ad and pd progression. the present study was carried out to determine antinociceptive effect in vivo of plant peptide hormone phytosulfokine-alpha (h-tyr(4-so3h)-ile-tyr(4-so3h)-thr-gln-oh ) (i) and its selected analogues, such as h-phe(4-no2)-ile-tyr(4-so3h)-thr-gln-oh (ii), h-d-phe(4-so3h)-ile-tyr(4-so3h)-thr-gln-oh(iii), h-tyr-ile-tyr(4-so3h)-thr-gln-oh(iv), h-tyr(4-so3h)-ile-tyr-thr-gln-oh(v) and h-tyr-ile-tyr-thr-gln-oh (vi) in rats. peptides were injected into the lateral brain ventricle at the dose of 100 nmol. in the preliminary investigation we found the psk-as well analogues ii and iii induced a significant antinociceptive effect determined in the test of hot plate. the probable mechanism of this effect was discussed. a.b. bozhilova-pastirova 1 , b.a. landzhov 1 , p.v. yotovski 1 , e.b. dzambazova 2 , a.i. bocheva 2 the tyr-mif-1 family of peptides (tyr-mif-1`s) includes mif-1, tyr-mif-1, tyr-w-mif-1 and tyr-k-mif-1, which have been isolated from bovine hypothalamus and human brain cortex. all these peptides interact with opioid receptors and in addition bind to non-opiate sites specific for each of the peptides. data in the literature suggest that tyr-mif-1`s have antiopioid and opioid -like effects. we used wistar rats to study distribution and density of the tyrozine hydroxylase (th) imunoreactive fibres and nadph-d reactive neurons in the rat ventral and dorsal striatum. our results showed that neuropeptides mif-1 and tyr-mif-1 may affect them. opioid peptides have been recognized as modulators of reactive oxygen species (ros) in mouse macrophages and human neutrophils. data in the literature suggest that peptides of the tyr-mif-1 family -mif-1 and tyr-mif-1 have antiopioid and opioid -like effects. these neuropeptides are isolated from bovine hypothalamus and human brain cortex. so far no data about direct scavenger properties of tyr-mifs peptides were available. in this study we tested the hypothesis that they may scavenge ros in vitro. the antioxidant activity of these two substances was studied in the concentration range of 10-6 -10-4 mol/l. we investigated the luminol-dependent chemiluminescence to test their ability to scavenge the biologically relevant oxygen-derived species: hydroxyl radical, superoxide radical, hypochlorous acid in vitro. we found that tyr-mif-1 was a powerful scavenger in all tested systems. the effects were higher for hypochlorous anion and weaker for superoxide radical. mif-1 had no scavenge activity against the hydroxyl and superoxide radicals and showed a moderate scavenger effect on hypochlorous anion. we have compared different strategies to increase the immunogenicity of an antigenic hiv peptide as a vaccine candidate. our selected b-cell epitope comprises 15 amino acids (317-331) of the v3 region of hiv-1, jy1 isolate (subtype d), and is in tandem with a t-helper epitope corresponding to the 830-844 region of tetanus toxoid. several presentations, including oligomerization, map dendrimer, conjugation to dextran beads or to other macromolecular carriers, have been synthesized and evaluated. murine sera from the different presentations of the v3 epitope have been compared with regard to antibody titers and cross-reactivity with heterologous hiv subtypes. the map dendrimer version of the peptide, conjugated to recombinant hepatitis b surface antigen protein, was a better immunogen than the dendrimer alone, and showed higher immunogenicity than other multimeric presentations, or than the peptide alone conjugated to dextran beads. the map dendrimer version, either alone or conjugated to hbsag, enhanced cross-reactivity towards heterologous v3 sequences relative to monomeric peptide. group a streptococcus (gas) responsible for critical diseases (eg. acute rheumatic fever and rheumatic heart disease) are classified over 100 serotypes according to their surface virulence m proteins. development of vaccine to prevent infection with gas is hampered by the widespread diversity of circulating gas strains and m protein serotypes, and multivalent vaccine strategy would contribute to prevention against various gas infections and provide better protective immunity. we have studied the efficacy of incorporating four different epitopes derived from gas m protein into a single synthetic lipid core peptide (lcp) construct, in inducing broadly protective immune responses against gas following parenteral delivery to mice. peptide vaccine was synthesized on mbha resin by manual spps in situ neutralization/hbtu activation in boc-chemistry. immunisation with the mono-or multi-epitopic lcp vaccine led to high titers of antigen-specific systemic igg responses, and the production of broad protective immune responses as demonstrated by the ability of immune sera to opsonise multiple gas strains. systemic challenge of mice after primary vaccination showed that mice were significantly protected against gas infection in comparison with control mice demonstrating that vaccination stimulated long-lasting protective immunity. these data support to the usefulness of lcp system in the development of synthetic multiepitope vaccine to prevent gas infection. glycoprotein d represents a major immunogenic component of the virion envelope of herpes simplex virus and able to induce high titres of neutralizing antibodies. one of its optimal epitopes is the 9-22 region (9lkmadpnrfrgkdl22). several cyclic peptides possessing thioether bond and different ring size have been already prepared and some of them were conjugated with tetratuftsin derivative (ac-[tkpkg]4-nh2) by thioether bond formation using selectively removable cys protecting groups. antibody recognition results suggested that the size of the cycle has considerable influence on antibody recognition, however, the replacement of met in position 11 by nle is permitted. conjugation of cyclic peptide might increase the antibody recognition, but the binding depends on the structure and/or conjugation site of the cyclic peptide. conjugate with the best binding capacity (7.2 pmol/100ul) as well as the conjugate containing the linear (9-22c) epitope (0.7 pmol/100ul) were selected for immunization. in order to increase the production of antibodies a new group of conjugates was prepared. in these constructs promiscouos t cell epitope peptide derived from tetanus toxoid (ysyfpsv) was attached to both amino groups of lysine residue coupled to the n-terminus of the carrier (ac-ysyfpsv-k(ac-ysyfpsv)-[tkpk(clac)g]4-nh2). the cys containing linear and cyclic epitope peptides were conjugated to the carrier in solution (0.1m tris buffer, ph 8). this work was supported by grants of the spanish-hungarian intergovernmental program and cost chemistry action. background. primary hyperparathyroidism (pht) is characterized with increased parathyroid hormone (pth) secretion and in 70% of pht patients with hypertension. it was previously shown that pro-analogue of pth with a reversed sequence (which include strong alkali sequence -arg-lys-lys-) induced significant hypertensive response at dose 10-10m/kg b.w. one of the hypothesis attributed hypertension in pht patients to the presence of fragments of degraded pth possessing -arg-lys-lys-sequence. aim. to compare influence on mean arterial blood pressure (map) of analogue of 25-34 pth fragment (amide) and 25-34 fragment of pth, with -arg-lys-lys-sequence and also responsible for binding to pth receptor. methods. chosen peptides were synthesized manually by a solid phase peptide synthesis method. the purity of the products was tested by reversed-phase high performance liquid chromatography. the synthesized peptides showed the right molecular mass. the influence on map of synthesized peptides was tested in wistar rats. sequential increasing boluses of each peptide: 10-10, 10-9 and 10-8m/kg.b.w. in the same animal were given i.v. blood pressure was measured continuously in carotid artery. results. injection of synthesized analogue of 25-34 fragment of pth does not show influence on map vs. control group. synthesized 25-34 fragment of pth increased map in 92min. of experiment for 12mmhg ± 3mmhg vs. time of administration of first dose and for 17mmhg vs. control group. conclusion. it seems to be possible, that in case of alternate degradation of pth, accumulation of 25-34 fragment of pth may partially play role in the mechanism inducing hypertension in pht patients. biologically active domains of a high affinity receptor for ig е (fcεri) were determined, the fragments 111-114 and 111-117 of the receptor. the program of research of biological properties of synthesized tetrapeptide rnwd and heptapeptide rnwdvyk included the study of their binding with ige, which was contained in standard solutions and in patients' blood serum. the binding of peptides with ige was explored by the ifa method using ige antibodies labeled with horse-radish peroxidase (hrpo). peptides in the concentration of 100 mkg/ml were used for sorption on immunological plotting boards. higher correlation between the ige concentration and the optical density of the solution after introducing monoclonal antibodies labeled with hrpo and substrate chromogenic mixture (r=0.99) was found in the diagnostic system with sorption rnwdvyk-peptide than in the diagnostic system with sorption rnwd-peptide (r=0.94). similar investigations were conducted with the diagnostic systems with sorption rnwd and rnwdvyk peptides, but rwnd peptides conjugated with hrpo were used as antibodies against immunoglobulin e, instead of hrpo-labeled monoclonal antibodies. almost equal correlation was found between the concentration of ige in standard serum and serum of allergy patients with the known concentration of ige and the optical density of the solutions after introducing the rnwd peptide, conjugated with hrpo. after introducing allergy patients' blood serum in the holes on the plotting board, the heptapeptide bound 23.79% more ige than the tetrapeptide. our experiments demonstrated a high ige binding activity of synthetic rnwd-and rnwdvyk-peptides. in this study, the information spectrum method (ism) of the ha1 subunit of the h5 hemagglutinin protein of the influenza virus, h5n1, of different reference isolates was performed in order to identify possible antigenic determinants resistant to virus mutations. results of this analysis demonstrated that the primary structures of ha1 subunit of h5 hemagglutinins encode a common information corresponding to one characteristic frequency in their iss, which is probably important for the biological function of these proteins, including their possible recognition by the immune proteins targeting this molecule. besides, comparison of the iss of ha1 proteins of h1 "spanish flu" and h5n1 isolates demonstrated an informational similarity between them. based on these results, a segment of the n-terminus of ha1 h5n1 was identified to play a crucial role in the inhibitory and immunological properties of all possible h5n1 variants. the identified core segment, being highly conserved in all h5 strains, was selected as an antigenic determinant and coupled to the sequential oligopeptide carrier (socn), (lys-aib-gly)n, to the lys-nεh2 groups, in order to develop a diagnostic immunoassay and formulate a vaccine candidate for the highly pathogenic h5n1 influenza virus. background and aims: thrombin plays a key role in various disorders such as arterial thrombosis, atherosclerosis, restenosis, inflammation and myocardial infarction. insights into the way in which thrombin interacts with its many substrates and cofactors have been clarified by crystal structure and site-directed mutagenesis analyses, but until recently there has been little consideration of how its non-proteolytic functions are performed. we investigated cardiovascular effects of seven modified proline-and rgd-containing peptides designed from three surface-exposed sites of prothrombin, corresponding to residues 218-223, 332-347, and 445-454. methods: cardioprotective effects of synthetic peptides were tested on the two rat models of heart failure produced by coronary artery occlusion (10-or 45-min) and reperfusion (30-or 240-min). arterial blood pressures from left carotid artery, heart rate and ecg ii standard lead were measured throughout experiment. at the end of second experiment hearts were morphologically investigated by light microscopy and electron microscopy methods. results: on animal model with short-term ischemia investigated peptides did not protected from myocardium ischemia during occlusion, however, tp-l13, bk-mc and tp-h7 protected rat hearts from ventricular fibrillation, contributed more significant functional recovery during reperfusion and raising survival rate. on the model with prolong ischemia, acceptable cardioprotective effect revealed tp-h7 and bk-mc. these peptides significantly diminished necrotic zone of left ventricle, protected hearts from ischemia-reperfusion induced functional and morphological changes. conclusions: investigated proline-containing peptides revealed activity on cardiovascular system -decreasing of blood pressure, cardioprotective properties and improved recovery from ischemia. r. mansi, d. tesauro, c. pedone, e. benedetti, g. morelli the widespread use of compounds containing the gamma-emitting radionuclide 99mtc in nuclear medicine for the scintigraphic imaging, as well as the recent introduction of the beta-emitting radionuclides 188re and 186re in radiotherapy, have led to a rapid development of their chemistries, in order to produce novel radiopharmaceuticals. we have developed new peptide based radiopharmaceuticals based on a scaffold in which the radioactive metal ion is complexed by two different peptides that are able to bind two target receptors (see figure) . the 3+1 mixed-ligand approach has been used for the preparation of neutral oxotechnetium(v) and oxorhenium(v) peptide complexes. the complex preparation requires the simultaneous action of a dianionic tridentate ligand and a monoanionic monodentate thiolato on a suitable metal precursor. the dianionic tridentate ligand is based on the snn donor set able to stabilize the metal complex. the chelating agent (hsc(ch3)2conhch2ch(co-r)nh2) was coupled step by step to a bioactive peptide synthesized on solid phase. the second ligand, based on monodentate thiolato moiety, was coupled on n-terminus of the second peptide. labelling procedures and biological tests on tumour cells overexpressing receptors are described for 99mtc(o) complexed by cck8 and octreotide peptide derivatives. background: endostatin inhibits the proliferation of endothelial cells and induces their apoptosis. the measurement of serum endostatin can predict tumor vascularity. tumor angiogenesis is a strong prognostic factor in patients with hepatocellular carcinoma(hcc). significantly high levels of endostatin were noted in patients with trabecular-type tumors , and with hepatitis infection. methods : 20 patients with hcc, 16 patients with git malignancies, 8 patients with liver metastasis and 8 without metastasis, and 10 normal persons . all subjects were tested for alfa-feto protein (afp) , ca19.9 , carcinoemberyonic antigen (cea), and endostatin by elisa results : endostatin in normal control persons was 47.5 ± 14.22 ng/ml with a significant elevation (p< 0.001) between the hcc group and all the other tested groups .afp was 1.9 ± 0.98 ng/ml in normal persons with a significant elevation between hcc and all the other tested groups ( p< 0.01). ca19.9 was 8.14 ± 1.89 u/ml in normal persons with a significance elevation ( p< 0.01) relative to hcc., and a significance of ( p< 0.001 ) relative to git cancers with metastases. cea was tested to be 1.12 ± 0.71 µg/l in normal persons , and had a significance of ( p< 0.001) relative to git metastases. endostatin was elevated in 2 of 8 patients with git cancers not proved to have metastasis. conclusion : endostatin can be used to denote metaplasia and can also detect possibilities of metastasis or liver cell affection even before the frank development of metaplasia affibody® molecules are a novel class of affinity proteins which are generated by combinatorial engineering of the 58 aa three-helix bundle scaffold, originating from the b domain of staphylococcal protein a. we have used fmoc/tbu chemistry for total chemical synthesis of the affibody zher2:342, binding with picomolar affinity to the cell surface receptor her2. the synthetic protein was investigated for molecular imaging of her2-overexpressing tumours. in vivo detection of her2 in malignant tumours provides important diagnostic information which may influence patient management. to enable gamma camera imaging of the tumours, a panel of potential 99mtc-chelating sequences was designed and introduced into the affibody. the well-studied tc-chelating sequence mercaptoacetyltriglycyl (mag3) was compared to serine-containing sequences with increased hydrophilicity, such as mercaptoacetyltriserinyl (mas3). the total synthetic yield was 14-16 % and the her2-binding affinity of the affibody conjugates were all in the range 200-400 pm. binding specificity of tc-labelled affibody molecules was determined on her2-expressing skov-3 ovarian carcinoma cells. all variants showed receptor-specific binding. the tumour-targeting properties were studied in skov-3 tumour-bearing nude mice. all conjugates demonstrated high tumour uptake, quick blood clearance and low uptake in most other organs. the biodistribution results further showed that the more hydrophilic, serine-containing chelators resulted in a reduced hepatobiliary excretion, which significantly decrease the background in the abdomen area and provide for more sensitive detection. gamma camera images of mice with grafted tumours showed clear visualization of her2-expressing tumours using the 99mtclabelled mas3-affibody conjugate, suggesting a potential future application of this agent for diagnostic imaging. antimicrobial peptides are molecules with a unique mechanism of action. they are widespread in nature and play the role of an effective weapon of innate immune system against bacteria, fungi and viruses. the purpose of this study was to investigate the in vitro activity of natural antimicrobial peptides: citropin, piscidin, protegrin, temporin, uperin and the analogues of antimicrobial peptides: iseganan, pexiganan and omiganan. the peptides were synthesized using the solid-phase method and purified by high-performance liquid chromatography. the peptides were subjected to microbiological tests [mic (minimal inhibitory concentration) and mbc (minimal bactericidal concentration)] on reference strains of bacteria, according to the procedures outlined by the national committee for clinical laboratory standards (nccls). for comparison, conventional antibiotics (vancomycin, rifampycin, piperacillin, chloramphenicol) were included in this research. both the natural antimicrobial peptides and the analogues inhibited the growth of bacteria, but at higher concentrations than did conventional antibiotics. nevertheless, both natural origin of antimicrobial peptides and their low toxicity constitute a considerable advantage and this is an argument for considering the antimicrobial peptides as good candidates for medicines. the linear hexapeptide cypate-grdspk (compound 1; the cypate moiety is a near-infrared fluorescent label) whose rgd sequence was rearranged to grd showed high uptake in the αvβ3 integrin-positive tumor tissues in vitro and in vivo. despite low affinity of 1 to the integrin in the binding assays, the uptake was inhibited by equimolar amounts of the cyclic peptide c(rgdfv), which possesses high affinity to αvβ3 integrin. these observations led to hypothesis that cell internalization of compound 1 may be mediated mostly by only one of the integrin subunits, as the β3 one. indeed, blocking of αv integrin by the specific antibody did not inhibit the internalization of 1 in tumor cells, which was in the contrast with successful blocking the cell internalization by the anti-β3 integrin antibody. similar results were obtained in immunocytochemical assays employing the anti-αv and anti-β3 integrin antibodies. also, studies utilizing the β3-knockout and wild-type mouse cell lines demonstrated that deletion of the β3 subunit markedly decreased internalization of compound 1 in the β3knockout cells. the preferential interaction of compound 1 with the β3 subunit of integrins relative to the αv subunit was supported also by molecular modeling studies. summarizing, the bulk of our experimental and modeling data emphasizes interaction with the β3 integrin as the primary mechanism of the uptake of cypate-grdspk by tumors. since this compound showed the superior biodistribution profile in vivo, our results may provide a strategy to image and monitor the functional status of the β3 integrin in cells and live animals. background and aim: a growing tumor is accompanied by tumor intoxication development. intoxication independs on tumor size and intensity of its break-up. tumor intoxication is one of variant of endogenous intoxication. concentration of tyrosine-contained peptides (tcp) in blood plasma have been proposed as biochemical marker of endogenous intoxication at different organs cancers. our aim was to determine the tcp concentration in blood plasma patients with ovary tumor and its association with the severity of tumor. materials and methods: 178 patients with ovary tumor, mean age is 53 years, were studied. the control group consisted of 20 healthy women without tumor. patients were divided into 2 groups: people with non-malignant and people with malignant ovary tumor. tcp content in blood plasma was estimated by our technique. results: tcp concentration in the control group were 0,32±0,13 mmol. the tested marker was present in increased concentration in blood plasma of the patients with ovary tumor. the mean concentration tcp in patients with non-malignant tumor was 0,53±0,16 mmol. the content of this marker in blood plasma of patients from second group was increased 1,32±0,20 mmol compared with healthy control group. after treatment a significant decrease in tcp content was observed. conclusions: the result indicate that content tcp in blood plasma depends of the type of tumor. it could be suggested that determination of tcp concentration in blood plasma could be useful for improve the diagnostic of ovary tumor and monitoring of its progression. c. strongylis 1 , ch. papadopoulos 1 , k. naka 2 , l. michalis 2 , k. soteriadou 3 , v. tsikaris 1 troponin is a structural protein complex, which is responsible for the regulation of skeletal and cardiac muscle contraction. it consists of three components: troponin i (24kda), troponin c (18kda), and troponin t (37kda), each of which carries out different functions in the striated muscles. cardiac troponins are released into the bloodstream of patients after the onset of a cardiovascular damage. even minimal elevations over the normal values, of serum troponin t and i are being used to diagnose acute myocardial infarction and also to rule out the patients' condition. the development and commercialization of highly specific biological assays for the detection of cardiac troponins is based on the production of specific antibodies against the whole complex or individual subunits. however, the specificity and sensitivity of these assays vary due to problems mainly originated from the fact that cardiac troponins have a high homology with the skeletal isoforms. the aim of this work is to select and synthesize appropriate regions of the cardiac isoforms of troponin i, c and t, suitable for the production of more sensitive and specific cardiac troponin detecting reagents. in order to construct the immunogenic complexes, the selected sequences were conjugated to the tetrameric sequential oligopeptide carrier (soc4), either by the classic solid phase step-by-step methodology or by chemoselective ligation reactions. using the carrier conjugated troponin sequences, anti cardiac troponin complex specific antibodies in high titers were produced. the increasing problems with the reproductive systems of man and animals are recently linked to the presence of polluting chemicals with endocrine activity, the so called endocrine disrupting chemicals or edc's . the family of edc's is a heterogeneous one and consists of natural and synthetic hormones (like estradiol, ethynylestradiol and diethylstilbesterol), phyto-estrogens (like genistein and coumestrol) and industrial chemicals (like bisfenol-a, ftalates and various pesticides). because of the complexity of the environmental matrices and the low physiologically active concentrations of the edc's there is still a need for an efficient routine analysis protocol. we want to develop a solid-phase bound receptor that possesses a high selectivity for edc's and thus can be used in a simple solid-phase extraction protocol. this receptor must have the right functional groups that bind the edc's with a strong affinity and must be able to create a cavity in which the edc's can fit. by looking at nature's own estrogenic receptor for humans we have found the different amino acids responsible for the specific interactions . in order to create the cavity which mimics the behaviour of the hormone-binding domain of the human estrogenic receptor we have made a tripodal scaffold. this tripodal scaffold has three orthogonal protected amino groups that will allow the generation of three independent peptide chains. milk proteins are a source of opioid peptides. these peptides are liberated from milk proteins during enzymatic hydrolysis. some of these peptides are characterized with agonistic (β-casomorphins) and some with antagonistic (casoxins) properties. the aim of the investigations was to determine the presence of opioid peptides with antagonistic properties in milk products. the experimental material included cheeses, yogurts and kefirs. peptides were extracted with a methanol-chloroform mixture (2:1 v/v). the peptide extracts were purified by spe method on c18 or stratax columns and characterized by sds-page electrophoresis. the agonist opioid peptides (casoxins) were identified by hplc using standard agonist peptides. the opioid activity was measured by examining the effects of peptide extracts on the motor activity of isolated rabbit intestine. the results of sds-page electrophoresis indicated the presence of 5 to 9 fractions in peptide extract derived from cheeses and yogurts and 17 to 20 ones from kefirs. the presence of casoxin a (0.22-0.68 µg/mg of extract) was proved in all examined the milk products. lactoferroxin a (0.31-1.88 µg/mg of extract) was identified only in kefirs and yogurts. those products were also found to contain trace amounts of casoxin c. all peptide extracts showed the antagonistic activity in the relation to motor activity of isolated rabbit intestine. the highest antagonistic activity was reported of peptide extract from kefirs (3.62-17 .20%) and gouda cheese (15.68-16.36%), as compared to morphine. the physiological and nutritional function of these antagonist peptides requires elucidation. a. péter 1 , r. berkecz 1 , f. fülöp 2 the past decade has seen a growing interest in β-amino acids, which are important intermediates for the synthesis of compounds of pharmaceutical interest and can be used as building blocks for peptidomimetics. oligomers of β-amino acids (β-peptides) fold into compact helices in solution. recently, a novel class of β-peptide analogues adopting predictable and reproducible folding patterns (foldamers) was evaluated as a potential source of new drugs and catalysts. studies on synthetic β-amino acids can be facilitated by versatile and robust methods for determining the enantiomeric purity of starting materials and products. highperformance liquid chromatography (hplc) is one of the most useful techniques for the recognition and/or separation of stereoisomers including enantiomers. the aim of the present work was to evaluate hplc methods for the separation of enantiomers of eighteen 3-amino-3-aryl-substituted propanoic acids (β-amino acids). direct separations were carried out on different macrocyclic glycopeptide based stationary phases, such as ristocetin a containing chirobiotic r, teicoplanin containing chirobiotic t, teicoplanin aglycon containing chirobiotic tag, vancomycin containing chirobiotic v columns and on a chiral crown ether based column. the effects of different parameters on selectivity, such as the nature of the organic modifier, the mobile phase composition, the flow rate and the structure of the analytes are examined and discussed. the separation of the stereoisomers was optimized by variation of the chromatographic parameters. the efficiency of the different methods and the role of molecular structure in the enantioseparation were noted. the elution sequence of the enantiomers was determined in most cases. a rational approach to evaluating peptide purity a. swietlow 1 , r. lax 2 1 amgen, inc., pharmaceutics, thousand oaks, ca 2 polypeptide laboratories, inc., torrance, ca, usa recent years have seen an enormous increase in interest in peptide therapeutics. new peptide leads are often chosen by screening procedures using microgram to milligram quantities of peptides, frequently provided by specialized manufacturers utilizing automatic synthesizers to maximize output. the purity of the resulting compounds is often not very high. the use of spps synthetic procedures predetermines that most impurities are closely related and difficult to resolve by reversephase purification. these factors, combined with the use of generic analytical methods not specifically optimized for the peptide in question (e.g. the ubiquitous 0.1% tfa/water/acetonitrile system), lead to erroneous results that frequently severely overestimate the purity of the peptide. the use of poorly characterized materials in pharmaceutical development leads to significant risks of obtaining false negative or false positive results that may cause potential leads to be overlooked or misinterpretation of their pharmacological profiles. we describe a rapid, systematic and reliable hplc procedure for evaluation of peptide purity. utilizing the increased separation efficiency by increasing the column temperature and adjusting the gradient in two steps in reverse-phase buffers containing tfa, naclo4, or ion-exchange buffers containing kcl, we demonstrate that methods -suitable for preclinical research -can be developed rapidly. the proposed approach will be illustrated with examples of peptides ranging between 9 and 28 amino acids and a model peptide vypnga. it will be demonstrated that peptides showing an hplc purity close to 100% are often 10 -25% less pure. to approach a high-throughput cell assay format using peptides, we attempt to design and construct a peptide microarray for examination of cell activities of peptides including apoptotic cell death. peptides were immobilized onto solid surfaces via a novel multi-functional linker. the linker enable us to examine various types of peptide cell assays in an array format. we also designed and synthesized peptidyl capture agents on the basis of the cell-active sequences suitable for the peptide microarray. the utility of targeted nanoparticles as fluorescent probes for tissue imaging has recently been subject to widespread interest. one exciting prospect is the further development of nanoparticles conjugated to both targeting peptides and cytotoxic cargoes. these nanodevices could preferentially bind to specific cells and/or tissues to provide effective tools for drug delivery. hence, such multifunctional nanoparticles could provide both diagnostic and therapeutic functions by acting as fluorescent probes that offer targeted delivery of therapeutic agents. we have coated the surface of quantum dots (qdots) with cell-penetrating peptides (cpp) to target and label u251m cells for fluorescence imaging. qdots were initially coupled to polyethylene glycol linkers via carboxyl functionalities on their surface. a heterogeneous mixture of poly-arginine peptides of varying lengths (arg(6)-arg(10)) were covalently coupled via amide bonds to the polyethylene glycol linkers, conferring a cell penetrating capacity to the modified qdots. fluorescence imaging of u251m cells, after incubation with the conjugated qdots at concentrations of 20nm, gave clear signals indicating cell binding and internalisation of the modified qdots across the plasma membrane. we aim to further expand this work by employing racemic mixtures of cpp and cytotoxic agents to engineer conjugates that will facilitate both imaging and the therapeutic delivery of cytotoxic moieties. the ability of cell penetrating peptides (cpp) to deliver biologically active cargoes into different cell types has been successfully applied in several experimental systems. despite the progress and growing number of described cpps, reports about the internalization mechanisms and the intracellular routes of cpps still remain controversial. we have characterized the membrane interaction and cellular localization of proteins delivered into hela cells by cell penetrating peptide transportan (tp) and its shorter analogue tp10 on ultrastructural level. our previous results obtained by transmission electron microscopy showed that complexes of transportans with gold-labeled streptavidin translocated into cells inducing large invagination of plasmamembrane, suggesting the uptake by macropinocytosis. the complexes of transportan with gold-labeled neutravidin, in contrary, were taken into cells mostly via caveosomes and clathrin-coated vesicles. the cell-transduced transportanprotein complexes localized mainly in the vesicular structures of different size and morphology. most of the complexes-containing vesicles in the perinuclear area contained also lamp2 protein, marker of late endosomes and lysosomes. still, the transportan-protein complexes were not confined in the membrane-surrounded vesicles, but spread in the cytosol suggesting the escape of transportan-protein complexes from endosomes. our findings show the involvement of different endocytic pathways in the transportan-mediated uptake process of proteins. the concentration of a cpp and the properties of cargo protein seem to determine the pathway for the cellular uptake of a particular construct. rgd peptides (r = arginine; g = glycine; d = aspartic acid) have been found to promote cell adhesion upon interaction with alphav-beta3 receptors, which are strongly overexpressed during neoangiogenesis by solid tumor associated cells compared to healthy cells. in this study we designed new targeting motifs aimed to deliver various antitumoral drugs specifically to cells involved in tumor vascularization. we inserted this short rgd sequence in tetracyclopeptides closed with various means. we expect these new cyclotetrapeptides to be more specific for the targeted receptor. moreover, these new type of cyclic peptides were multimerized on different scaffold to further improve the receptor avidity. our purpose is first to scrutinize and to quantify the efficient cellular uptake of these molecules and second, to address the specific cell targeting of a fluorescent cargo by differents tools such as fluorescence activated cells sorting (facs) analysis or fluorescence microscopy. these new targeting units were evaluated on two different cell lines: human umbilical vein endothelial cells (huvec) with an over expression of the alphav-beta3 integrin receptor and a549 cells expressing a much lower level of this receptor. preliminary results about the selectivity and the efficacy of these new targeting units will be presented. we have recently developed new approaches for obtaining highly immunogenic peptide conjugates: synthetic polyelectrolytes (pe) were used for the conjugation with peptide molecules in which pe carry out the carrier and adjuvant roles simultaneously. in this study, 4 epitopes of antigenic parts of surface antigen of hepatitis b virus (2-16, 22-35, 95-109 and 115-129 of the s gene.) had been synthesized.the synthesis of peptides was performed by explorer pls ® automated microwave synthesis workstation (cem). peptide conjugates of synthetic anionic polyelectrolytes (copolymers of acrylic asid and n-vinylpyrrolidone) were synthesized by carbodiimide condensation following the modification procedures described early. composition and structure of bioconjugates were characterized by hplc (shimadzu), nanospr-3, zetasizer nano zs, steady state fluorescence spectrometer qm-4 and viscotek tda 302 size exclusion chromatography. it was obtained that a single immunization of mice with pe-peptide conjugates without classical adjuvant increased the primary and secondary peptide-specific immune response to hbsag. moreover, these conjugates possess own selectivity for recognizing the antibody in blood sera of hepatitis virus injected people. tissue engineering requires delivery of transplanted cells to organ sites needing repair/regeneration. we have demonstrated that several active laminin peptide-conjugated chitosan membranes enhanced the biological activity and promoted cell adhesion in a cell-type specific manner. the most active laminin peptide (ag73: rkrlqvqlsirt)-conjugated chitosan membrane could deliver keratinocytes to a wound bed. when human keratinocytes were seeded onto the ag73-chitosan membranes under serum-free condition, more than 70% of the cells attached within 2 hrs. the membranes carrying keratinocytes were stable enough for handling with forceps and were inverted onto the muscle fascia exposed on the trunk of nude mice. keratinocyte sheets were observed after 3 days and colonies appeared after 7 days on the fascia of host mice. these cells were multilayered on day-3 and expressed various keratinocyte markers, including cytokeratin-1, involculin, and laminin ?2-chain. these results suggest that the ag73-conjugated chitosan membrane is useful as a therapeutic formulation and is applicable as a cell delivery system, such as delivering keratinocytes to the wound bed. the peptide-chitosan approach may be a powerful cell transplantation tool for various tissues and organs. the fluorescent semi-conductive (cdse/zns) nanocristals possess very attractive optical properties that could be used for tracking individually biological receptors in vivo. our aim is to design functionalized water-soluble semi-conductive nanocristals (or quantum dots) that interact selectively with lipidic or biological membranes. to valid our approach, the interaction between the decorated qd and giant vesicles were observed by optical fluorescent and dark field microscopies. in view to solubilize and selectively bind fluorescent nanocristals to a lipid membrane, heterobifunctionalized peptidic ligands (liipe) that presented an adhesion domain for the nanocristal surface, an hydrophilic spacer and a terminal recognition function, were synthetized. the colloïdal stability of the water-soluble nanocristals (nc-lipe) was checked by dynamic light scattering, optical and electron microscopies the interaction of grafted nanocristals (nc-lipe) with positive or neutral giant vesicles was observed by optical fluorescence and dark field microscopies. as shown in figure, negatively charged nanocristals (nc-lipe) selectively adsorbed onto the surface of positively charged giant vesicles without altering the morphology of the vesicle. the nanocristals appeared as fluorescent patches growing on the surface of the vesicle until completely recovering. therefore these ligands (lipe) permitted to chemically functionalize the nanocristals by keeping their colloïdal stability and their fluorescence in water. furthermore it was possible to selectively label vesicle membrane. creatine analogues for treatment of obesity: us patent 5 bioactive peptides containing pairs of basic amino acids are rapidly metabolized as a result of cleavage by trypsin-like enzymes. to increase the metabolic stability of opioid peptides containing arg-arg and arg-lys pairs, the arg residues were replaced by homoarginine (har) kenes international leprince 1 , f. cavelier 2 , p. gandolfo 1 , m. diallo 1 , h. castel 1 , l. desrues 1 m304 new bradykinin analogues modified with 1-aminocyclopentane-1-carboxylic acid effect on rat blood pressure and rat uterus o. labudda-dawidowska 1 , d. sobolewski 1 , m śleszyńska 1 , i derdowska 1 synthesis and heparin binding sites identification f. baleux 1 , f. arenzana-seisdedos chemokines are small proteins involved in numerous biological processes (inflammation, immunity, morphogenesis, tissue repair, and tumour development) the general goal of our project is to elucidate the role that hs plays in vivo in the physiologic and pathologic activities of the chemokine cxcl12/stromal derived factor-1, and to characterise the molecular and structural determinants accounting for the interaction. three sdf isoforms, alpha (68 aa), beta (72 aa) and gamma (98 aa) have been identified. we previously identified the major heparin binding site on sdf alpha and demonstrated the importance of hs/sdf interaction in hiv entry cell inhibition (1,2). sdf gamma amino acids sequence corresponds to the sdf alpha sequence extended by a c-ter 30 amino acids sequence containing putative heparin binding sites. in order to determine sdf gamma heparin binding sites, wild type and mutants proteins were synthesised by stepwise solid-phase peptide synthesis using fmoc chemistry prusiner proc. natl. acad. sci here we describe xenome's drug development process for the chi family, conopeptide -mria[1] of the predatory marine snail conus marmoreus, leading to a suitable drug candidate (xen2174). xen2174 is highly selective for the norepinephrine transporter (net) compared to other transporters, such as dopamine and serotonin, and inhibits net via an allosteric mechanism. xen2174 is currently in a phase i/iia clinical trial for the treatment of severe pain. an intensive synthetic analogue and screening program around -mria, incorporating early stage animal data xen2174 isomers were synthesized via selective disulfide bond formation to identify the active connectivity. data from alanine-scans, single amino acid mutations and probing of backbone interactions combined with the full 3d nmr structure, led to the development of a pharmacophore for xen2174. this model is refined from further studies where structure-activity relationships were developed utilising binding and functional assay data for a range of peptides anti-cancer drug design anti-cancer drug design published structural data for hdac-like protein, a bacterial enzyme sharing high homology to the hdacs in its active site, confirmed that this protein contains a zinc in the active site. for the discovery of specific hdac inhibitors, a number of hydroxamic acis and related compounds have been designed based on the ligating function to the zinc atom. the mechanism also involves an appropriate nucleophile in the active site. chlamydocin is a cyclic tetrapeptide on the other hand, we have been focusing the cyclic tetrapeptide to develop the potent and specific inhibitors of hdacs. in the present report, we employed the chlamydocin scaffold and successfully introduced the series of thioether as the functional group to the cyclic tetrapeptide. it is well argued that the strong inhibition of hdac requires the best combination of zinc ligand, capping group, and appropriate spacer between them jerusalem 3 department of biological regulation 4 department of organic chemistry 5 department of biological chemistry, weizmann institute of science, rehovot, israel estrogen has a key role in the regulation of skeletal growth and maintenance of bone mass. the use of estrogen and selective estrogen receptor (er) modulators in treatment of osteoporosis is limited due to substantial risks for breast cancer. recently, we developed peptides having estrogen-like activity peptide emp-1 had no effect on bone growth in normal mice, and did not influence the ovx-induced bone-loss. we then developed a new µct methodology to evaluate uncalcified and calcified growth-plate parameters. in the ovx mice, peptide emp-1 reduced volume and thickness of the uncalcified growth-plate, a possible cause for the inhibition of bone longitudinal growth. based on a reported enhancement of er-in female mice during protein biosynthesis, after the release of the nascent polypeptide chain, ribosome recycling factor (rrf) disassembles the post-translational complex. rrf has been shown to be essential for bacterial growth. thus, we are attempting to design suitable compounds to inhibit the rrf function as candidates for new-type antibiotics. we have determined the structure of rrf with 185 amino acid residues by nmr and x-ray analyses and shown that rrf has two domains domain i with three stranded helical bundles and domain ii with β/α/&beta furthermore, we recently determined the structures of the 70s ribosome-rrf complex by cryo-electron microscopy and the 50s ribosome-rrf domain i complex by x-ray analysis using the results of these experiments, we elucidated the interaction profiles between rrf and ribosome and found that the cationic center consisting of three arginine residues on the surface of the helical bundle, which we have shown to be essential for the activity of rrf, is bound to helix 69-71 of 23s ribosomal rna. we synthesized the rna and peptide fragments around this interacting site and characterized them by physico-chemical analyses. the results of cd and biacore experiments to investigate the details of the interactions between them showed that a 27 mer of rna fragment is bound to rrf biochemistry 42 causes an increase of pain by inhibiting the opioid response [1]. recent research has shown further that melancortin receptors, mainly subtype mc4r, produce an increase in response to pain stimuli [2]. based on this previous work, we are developing chimeric ligands which will be of benefit to therapeutic pain treatment with enhanced opioid efficacy by acting as agonists at opioid receptors and antagonists at both cck and melanocortin receptors cck (i) and melanocortin (ii) pharmacophores were overlapped by trp, and different profiles of opioid pharmacophores (iii) were linked to the n-terminal of the melanocortin pharmacophore (figure). the designed ligands showed moderate to high biological activity at all three receptors depending on their respective structures. design considerations and structure-activity relationships will be discussed in detail along with in-vivo assay results china synthetic exendin-4 is a 39-amino acid peptide that exhibits potent anti-diabetic and dose-dependent glucose-regulatory activity. exendin-4 is susceptible to degradation in plasma, so its activity is limited. our aim is to find sites in exendin-4 that are susceptible to cleavage and provide information for designing new exendin-4 analogues. in this study the stability of exendin-4 in human plasma was evaluated in vitro. exendin-4 was incubated in plasma at 37 ℃, extracted with sep-pak octadecyl columns and subsequently analyzed using high performance liquid chromatography (hplc). the results showed that exendin-4 was slowly broken down in plasma with a half-life of 9.57 h. the degradation products were identified by quadrupole time of flight mass spectrometry (q-tof-ms) with electrospray ionization pharmacology of exenatide(synthetic exendin-4): a potential therapeutic for improved glycemic control of type 2 diabetes one of the proposed solutions for the pharmacotherapy of obesity, a major health problem in the western world, is to regulate the biochemical pathways that control the metabolic balance in the body. the melanocortin pathway regulates energy balance by binding of the catabolic endogenic neuropeptide αmsh to its mc4 receptor and thus causes a decrease in food intake. we have synthesized a backbone cyclic peptide library, based on the minimal αmsh sequence phe6-d-phe-arg-trp9 [1], that activates the mc4 receptor. all the members of the library shared the same sequence analysis using colorimetric liposomes confirmed that bbc-1 penetrate the intestinal cells by the transcellular mechanism. moreover, bbc-1 have high metabolic stability to intestinal enzymes (100% recovery after 5 hr). ec50 analysis showed that bbc-1 selectively binds and activate the mc4 and mc5 receptors (ec50 3.97±0.63 and 7.27±0.40 respectively). oral administration of bbc-1 in mice showed reduces food intake melanocortin tetrapeptides modified at the n-terminus, his, phe,arg and trp positions 2 department of experimental and health sciences graduate school of pharmaceutical sciences 2 graduate school of agriculture 3 graduate school of medicine consisting of 54 amino acids, is a product of the metastasis suppressor gene kiss-1. this c-terminally amidated peptide was identified as the endogenous ligand of an orphan g-protein coupled receptor metastin-gpr54 signaling may regulate gonadotropin secretion and negatively regulate cancer metastasis. it is interesting that activation of gpr54 signaling negatively regulates the function of sdf-1-cxcr4 axis in cho and hela cell transfectants we conducted the structure-activity relationship (sar) study on kisspeptin-10 using the neuropeptide-derived rw-amide scaffold to identify five-residue peptide amides as novel gpr54 agonists equipotent to kisspeptin pro-), where aoe is (2s,9s)-2-amino-8-oxo-9,10-epoxydecanoyl. in continuation of our study to design and synthesize analogues bearing a zinc ligand to develop potent inhibitors of histone deacetylase (hdac) inhibitors, we shifted the aromatic ring of phenylalanine at the aminoisobutric acid (aib) position and also at the imino acid position. the aim is to explore the interaction of cyclic scaffold with the rim of hdac paralogs. we replaced the epoxyketone moiety of aoe with sulfhydryl group, which is protected as disulfide hybrid, as zinc ligand. benzene ring was introduced to aib structure to design amino-1-indane carboxylic acid and amino-2-indane carboxylic acid. aromatic ring-containing imino acids, such as d-tic were also employed to replace d-pro. the cyclic tetrapeptides were profiled by the inhibition of hdac1, hdac4, and hdac6 in adult goats, however, the infection remains unapparent and the virus may cause abortion, vulvovaginitis or balanoposthitis. the use of a vaccine could provide a powerful tool for the control of cphv-1 infection. synthetic peptide-based vaccines have advantages of being selective, chemically defined and safe. in order to further localize immunogenic epitopes, glycoproteins b, c and d of cphv-1 were analyzed with several prediction programs. peptide conjugates incorporating t and b cell determinants in multiple copies in branched architecture are better immunogens for the preparation of goat vaccines, we synthesized peptide conjugates bearing t cell epitope on the n-terminal of the core. b cell epitopes were conjugated via a thioether bond on the ne-amino group of four choroacetylated lysine residues of the core. elisas confirm that the b cell epitopes and the conjugates t celltetratuftsin induce epitope-specific and antibody responses two recorded electrodes implanted bilaterally. eeg recorded after the mirror focus was arises. trh applicated intranasal in ultra low doses (10-9 m and 10-12 m) or intravenously in high doses (25mg, 50mg, 100 mg). for eeg registration and analysis the computer system conan was used and new modification of fractal analyze of quantized eeg. results: the synchronization of epileptic activity between primary and mirror focuses observed on the third day after operation. intranasal application of trh induced reduction of spontaneous focal epileptic activity as in primary cobalt damage focus as in the mirror focus more than 1h. the inhibition of mirror focus was more expressed. quite the contrary intravenous trh administration provocated the epileptic discharges in both local focus. the intense synchronized generalized activity was record during 30-40 min deraos 1 , t.v. tselios 1 , i. mylonas 2 , g.n. deraos 1 it is known that cyclic peptides are more stable in enzymatic degradation and conformationally restricted compared to linear. the cyclization was achieved using o-benzotriazol-1-yl-n,n,n',n'-tetramethyluronium tetrafluoroborate (tbtu) and 1-hydroxy-7-azabenzotriazole, 2,4,6 collidine allowing fast reaction and high yield final product. the purification was achieved using reversed phase high performance liquid chromatography (rp-hplc) and the peptide purity was assessed by analytical hplc and mass spectrometry (esi-ms). the synthesized cyclic plp analogue was found to exhibit lower agonist eae activity compared to linear plp139-151 epitope in sjl/j mice. this implies that the conformation of cyclic analogue does not trigger autoimmune reaction in the central nervous system and therefore encephalomyelitis the netherlands 3 department of experimental and health sciences on the other hand, in both central and peripheral nervous system, cck acts as neurotransmitter. recently, cck is focused at modulation effects on feeding especially. in this study, we tried to establish a sensitive and specific enzyme immunoassay (eia) for detecting cck and to investigate the effect of some dopamine receptor antagonists using this eia, we measured plasma cck-like immunoreactive substance (is) levels in five healthy human subjects after single oral administration of some prokinetics. the minimum amount of cck detectable by our eia system was 2.0 pg/ml, and the ic50 of the calibration curve was 75 pg/ml. we revealed that domperidone and itopride caused significant decreases in plasma cck-is levels but metoclopramide and sulpiride did not. we established a sensitive and specific eia for cck. furthermore pro-pro-phe-phe-), isolated from linseed oil [1], possesses a strong immunosuppressive activity comparable at low doses with that of cyclosporin a [2]. it has been postulated that the tetrapeptide sequence pro-pro-phe-phe is important for biological activity of cla on the basis of this information we have synthesized a series of cla analogues in which the alpha-proline residue was replaced by beta2-iso-proline and beta3-homo-proline residues, respectively (fig.1). the synthesis of titled beta-amino acids has been achieved according to the literature procedure italy synthetic peptides are largely used as antigens in solid-phase immunoenzymatic assays (elisa) for recognition of antibodies (abs) in biomedical research and, most importantly, in the set up of diagnostic methods. it is well known that the method of peptide immobilization on the solid support is very important for a correct ab recognition atherosclerosis in patients infected with helicobacter pylori (hp) we synthesized appropriate oligopeptides immobilized on cellulose via n-or c-termini, using standard -alanine linkers as well as a new linker, developed for this particular studies glc)ghsvflapygwmvk) we found the strongest recognition when the peptide was linked to the cellulose support via the c-terminus. however, in the case of ureb f8 hp urease smallest epitope (sikedvqf), and epitope ub-33 (321-339 hp fragment: chhldksikedvqfadsri) the strongest reactions with sera of atherosclerosis suffering patients were obtained for n-terminally anchored peptides the synthetic dipeptide gamma-d-glutamyl-l-tryptophan (scv-07) has been shown to stimulate t-lymphocyte differentiation and specific immune responses, and enhance il-2 and inf-gamma production. due to this preferential activation of th1 cytokine production, scv-07 may show utility in treatment of infectious diseases cba mice at a dose of 2,500 µg/kg. the same animals were used for all three methods of administration with a dosing interval of 2 weeks. blood samples were taken from the right retro-orbital sinus. for determination of the scv-07 concentration in blood samples, an "eia-scv-07" competitive solid-phase immuno-enzyme assay was developed (loq 20 ng/ml) with a mean residence time of 10 minutes. 5 minutes postdose, indicating very rapid absorption. mean concentrations then declined and were measurable through 1.5 and 3 hours postdose (mrt 20 and 50 minutes, for i/p and p/o, respectively). the estimated bioavailability of scv-07 after i/p and p/o administration was almost equivalent gastrin-17 (g17) is a peptide which promotes gastric acid secretion, cell proliferation, and occasionally gastrointestinal cancer in the gastric antrum. g17 also promotes the growth of cancerous colonic epithelial cells, but the cck2/gastrin receptor, which mediates its activity, is largely not expressed on such cells. instead, our previous studies have shown that some other receptor mediates stimulation of proliferation of dld-1 and ht29 human colonic carcinoma cells by ncarboxymethyl gastrin (g17gly) at namomolar concentrations and inhibition at micromolar concentrations, indication separate binding sites. we have shown previously that g17(1-12)-oh stimulates cell proliferation of ht-29 cells -6)-nh2, in order to determine their selectivity for and activation of the putative proliferation-stimulatory receptor. the results revealed that g17(1-12)-oh is not selective for a single receptor, but binds both sites as do g17 and g17gly. g17(1-6)-nh2 promotes dose-dependent and non-biphasic proliferation of dld-1 cells and binds a single receptor with low affinity. m484 comparative study of ige-binding activity of synthetic peptides rnwd and rnwdvyk v th486 involvement of l-name in the antinociceptive effects of newly synthesized analogues of tyr-mif-1 peptide in rats tyr-mif-1 is able to interact with opioid receptors with a higher potency at m sites as well as to its specific non-opiate receptors in the brain. nitric oxide and tyr-mif-1`s are potent modulators of opioid activities. involvement of no in nociceptive effects is well documented in various physiological and pathological processes in the cns. l-name when administrated i.c.v. or systemically exhibit antinociceptive activity in rats as evaluated by the pp test. in the present study, we investigated the involvement of l-name in the antinociceptive action of newly synthesized analogues of tyr-mif-1 peptide: nα-(me)tyr-mif-1, d-tyr-(me)-mif-1, tyr(cl2)-mif-1 and tyr(br2)-mif-1. the experiments were carried out on male wistar rats (180-200 g). the changes in the mechanical nociceptive greece paclitaxel is one of the most important anticancer drugs used mainly in treatment of breast, lung, and ovarian cancer and is being investigated for use as a single agent for treatment of lung cancer, advanced head and neck cancers, and adenocarcinomas of the upper gastrointestinal tract. however, the development of resistance to paclitaxel, the side effects and low solubility of this drug remain major obstacles for its optimal use in the clinical practice. in this work, we present the synthesis of various analogues in which paclitaxel is covalently bound to peptides or as multiple copies to synthetic carriers. these peptide-paclitaxel derivatives possess greater solubility in water, could be suitable in producing anti-paclitaxel antibodies and inhibit the proliferation of human breast, prostate and cervical cancer cell lines. although, no major differences were found concerning the extent of the antiproliferative effect between various paclitaxel derivatives and paclitaxel, the analogue with four molecules of paclitaxel covalently bound to synthetic carrier [ac-(lys-aib-cys)4-nh2] when used at low concentrations inhibited cell proliferation more potently than paclitaxel th489 involvement of the histaminergic system in the nociceptin-induced pain-related behaviors in the mouse spinal cord the purpose of the present study was to determine whether histamine-containing neurons in the spinal cord are involved in nociceptin-induced behaviors in mice. the i.t. injection procedure was adapted from the method of hylden and wilcox. immediately following the i.t. injection, the time spent for nociceptive behaviors including scratching, biting and licking were measured. the i.t. administration of nociceptin resulted in nociceptive behavioral responses, which were eliminated by the i.t. co-administration of opioid receptor like-1 (orl-1) receptor antagonists. the nociceptive behaviors were significantly attenuated by the i.t. co-administration of the h1 receptor antagonists, but not the h2 receptor antagonists. i.t. co-administration of the h3 receptor antagonist significantly increased the behavioral responses, whereas the behavioral responses were completely attenuated by the i.t. co-administration of the h3 receptor agonist. an antiserum against histamine injected i.t. reduced the nociceptin-induced behavioral responses. the same result was observed by i.p. pretreatment with histidine decarboxylase inhibitor. in conclusion, i.t.-administered nociceptin elicits the orl-1 receptor-mediated nociceptive behavioral responses. the activation of the orl-1 receptor by nociceptin may induce the release of histamine in previous studies, we demonstrated that highly constraint cyclic (s,s)-cxaac-containing peptides inhibit platelet aggregation and fg binding [1,2]. cyclization reduces the allowed conformations, of both the backbone and the side chains, and possibly induces a favourable for the biological activity orientation of the charged side chains. conformational studies revealed that orientation of the charged side chains toward the same side of the molecule increase the anti-aggregatory activity of the inhibitor. in this work we present the synthesis and the inhibitory activity of new cyclic compounds. for the design of the studied compounds we combined the available information from the -cdc-containing inhibitors and the gpiib 313-320 (ymesradr) sequence which has been shown to inhibit the adp induced human platelet activation however, its pharmacological effects and physiological functions are still unclear. the present study was designed to characterize the nociceptive behaviours induced by intrathecal (i.t.) administration of hk-1 in mice. the i.t. administration was made in conscious mice according to the method described by hylden and wilcox. immediately after the i.t. administration, the accumulated time for nociceptive behaviours was measured for 10 min. the i.t. administration of hk-1 dose-dependently produced characteristic nociceptive behaviours consisting of scratching, biting and licking, which peaked at 0-5 min and almost disappeared by 15 min after injection. the subcutaneous pretreatment with morphine dose-dependently attenuated the hk-1-induced nociceptive behaviours. the nociceptive behaviours elicited by low-dose of hk-1 were significantly inhibited by i.t. co-administration with nk1 receptor antagonist, however, the nociceptive behaviours elicited by high-dose of hk-1 were not affected by i.t. coadministration with nk1 receptor antagonist. on the other hand, nmda receptor antagonists significantly suppressed both high-and low-doses of hk-1-induced nociceptive behaviours in a dose-dependent manner. these results suggest that the nociceptive behaviours induced by low-dose of hk-1 may be mediated through both nk1 and nmda receptors, whereas high-dose of hk-1 may induce the nociceptive behaviours through nmda receptor the bacterial lp are strong modulators of the innate immune system. until recently, it was generally assumed that triacylated lp, like the synthetic pam3cys-sk4, are recognized by tlr2/tlr1 heteromers, whereas diacylated lp, like fsl-1, induce signalling through tlr2/tlr6 heteromers. contrary to this model, we could show that depending on the peptide moiety, diacylated lp also signal in a tlr6-independent and tlr1-dependent manner. the aim of this study was to analyse more closely the structural basis of this heteromer usage. the synthesis of lp was carried out by fully automated solid phase peptide synthesis and fmoc/tbu chemistry on tcp or rink amide resin. information on the structural factors determining the tlr2/tlr1 versus tlr2/tlr6 heteromer usage was obtained by testing of ligands with cells obtained from tlr2 , tlr6 , and tlr1-deficient mice. when stimulating b-lymphocytes of wild-type mice we found that ester-bound long-chain fatty acids are necessary to induce considerable responses. for triacylated lp with long chain length ester-bound acyl residues (like pam3c-ssnask4) the response in tlr1-deficient cells was only slightly decreased, whereas for lp with short length ester-bound fatty acids (like pamoct2c-ssnask4) the response was completely abolished. in summary, a tri-acylation pattern is necessary but not sufficient to render an lp tlr1-dependent and a di naples 2 department of organic chemistry "ugo schiff sera from patients suffering from autoimmune disorders often contain multiple types of autoantibodies, some of which can be exclusive of a disease and thus used as biomarkers for diagnosis. identification of these autoantibodies, as disease biomarkers, should be achieved using native antigens in simple biological assays. however, post-translational modifications, such as glycosylation, may play a fundamental role for specific autoantibody recognition. in line with these observations, we previously described synthetic glycopeptides able to detect high autoantibody titers in sera of patients affected by multiple sclerosis, an inflammatory, demyelinating disease of the central nervous system. we also demonstrated that glycopeptides able to reveal high antibody titers in multiple sclerosis sera are characterized by a type i we describe here the result of a conformationally driven rational design exercise, which led to the preparation of new, optimized glycopeptides endowed with enhanced antigenic properties. most importantly, the same approach, based on structure alignment, was used to shed light on the native antigen(s), target of pathogenetic autoantibodies involved in demyelination processes vitro and in vivo evaluation for cholecystokinin-b receptor imaging istituto nazionale per lo studio e la cura dei tumori ome (f0)
(aib, α-aminoisobutyric acid) to investigate its binding properties to tb(iii) ions. according to our published spectroscopic results f0 populate a set of ordered conformations involving 310/α-helical segments and compact structures generated by the formation of a turn around the flexible gly5-gly6 central motif. cd experiments showed that the binding of tb(iii) to f0 gives rise to a structural transition of the peptide chain from a helical to a folded conformation. peptide binding is also responsible for the dramatic increase in the tb(iii) fluorescence intensity, suggesting that the tb(iii)/f0 complex may represent an interesting system for imaging applications or bioanalytical sensing the 16 kda protein of mycobacterium tuberculosis provokes specific immune response, therefore related epitope peptides and peptide-conjugates can be considered as potential diagnostics. in our previous study we have determined the functional human t-cell epitope within the 91-110 region. based on this we synthesised two groups of peptides: a) nand c-terminal alanine and beta-alanine elongated variants of the 91-104 epitope and b) 91-104 peptides with alanine substitution at different position according to the hla dr and tcr binding sites. peptides were prepared by solid phase synthesis using boc/bzl or fmoc/tbu strategy. the homogeneity and the primary structure of peptides were checked by analytical rp-hplc, amino acid analysis and esi-ms. the t-cell stimulatory activity of the compounds was investigated using in vitro assays (proliferation and ifn-gamma production) on the 91-110 epitope specific human t-cell clones and pbmc (peripherial blood mononuclear cells) from patients and healthy (ppd+, ppd-) subjects. the effective peptides were conjugated to branched polypeptides with polylysine backbone (sak, eak), tetratuftsin derivative (h-[thr-lys-pro-lys-gly]4-nh2) and lysine dendrimer (h-lys-lys(h-lys)-arg-arg-beta-ala-nh2) (map) carrier via thioether bond formation. the subtitution degree of the conjugates was determined by amino acid analysis. pbmc and human t-cell clones were stimulated with the free peptide alone or with peptide-conjugates containing an equimolar amount of peptide or with a mixture of free peptide and carrier italy we demonstrated, for the first time, that an aberrant post-translational modification (ptm, n-glucosylation) is possibly triggering autoantibody response in multiple sclerosis. this was possible because of a "reverse approach", which led to csf114(glc), a structure based designed glycopeptide, as the first multiple sclerosis antigenic probe accurately measuring high affinity autoantibodies (biomarkers of disease activity) in sera of a statistically significant patients' population universal peptide scaffold" to be modified with a series of glycosyl amino acids (different in sugars and linkages), in the aim of developing personalized diagnostic/prognostic tests. the csf114 beta-turn structure, exposing at the best the aberrant ptm specific for antibody-mediated forms of other autoimmune diseases, will lead to a family of antigenic probes to be used in diagnostic this information is encoded by the distribution of the electron-ion potential (eiip) of amino acids along the sequence and is represented by the frequency components in is. proteins with the same biological functions or interacting proteins (e.g. antibody/antigen) share the information corresponding to the common frequency components in their iss. investigation of the hiv-1 envelope glycoprotein gp120, as a model system for hypervariable proteins, revealed that this information is strongly conserved and is not significantly affected by natural mutations. the c-terminus of the second conserved region (c2) of gp120, encompassing ntm peptide is important for infectivity and neutralization of hiv-1, while human natural anti-vasoactive intestinal peptide (vip) antibodies reactive with gp120 play an important role in control of hiv disease progression. ntm/vip multiple copies were coupled to an artificial sequential oligopeptide carrier for developing an immunoassay (elisa) as a reproducible, reliable and sensitive tool for the detection of anti-ntm/vip derived antibodies these peptides have been utilized in an immunopeptidometric assay for specific measurement of active, noncomplexed psa. however, this assay has not been sensitive enough for the measurement of active psa in clinical samples. therefore, we aimed to develop an improved assay utilizing the same principle as previously, but using a more sensitive detection method based on proximity ligation assay. methods: in the assay, psa is first captured on a solid phase by a psa antibody czech republic rapidly increasing knowledge of new gonadotropin-releasing hormones (gnrh)of different species of the animal kingdom induces the need to prepare new synthetic derivatives and fragments of these peptides with higher potency and metabolic stability and suitable for the formulation of new immunogens. the species related differences in the sequence of the native mammalian gnrh pglu-his-trp-ser-tyr-gly-leu-arg-pro-glynh2 concern predominantly the positions 5, 7 and 8, particularly tyr in position 5 is replaced for his or leu, leu in position 7 by val or trp, and arg in position 8 is substituted by lys, ser, asn or gln. several gnrh derivatives with with the above substitution and gnrh fragments were prepared by solid phase peptide synthesis and purified by rp-hplc. purity of the synthetic peptides was checked by capillary zone electrophoresis (cze); peptides were analysed as cations in acidic backround electrolytes (ph 2.25 -2.quantitative analyses for determination of their effective electrophoretic mobilities and the estimation of their effective charges.supported by grant of ministry of agriculture of cr-nazv qf 3028 by rants of ga cr nos we use peptaibiomics for the structural determination of peptaibiotics from fungi grown on single agar plates thus avoiding time-consuming isolation and purification procedures. the method comprises fast and effective solid-phase extraction followed by on-line rp-hplc coupled to tandem esi-ms. here we present a survey of the peptaibiome of hypocrea species. in extracts of hypocrea semiorbis, h. vinosa, h. dichromospora, h. gelatinosa, h. nigricans, h. muroiana and h. lactea a multitude of short and long-chain peptides containing aib could be characterized. the formation of new and known peptaibiotics could be established by comparison with sequences stored in data bases japan process scale rp hplc purification of peptides and proteins is increasingly important in bio-pharmaceutical production. besides selectivity, other crucial factors are loadability, recovery loadability is believed to depend on the surface area of the packing material. consequently, smaller pores providing larger surface area should lead to increased loadability. this principle is misleading in the case of large molecules, because they cannot penetrate smaller pores. therefore the chromatographically accessible surface area has to be taken into account. recovery problems like irreversible adsorption or aggregation are frequently caused by hydrophobic surface properties of ods phases. the less hydrophobic c8 is a substituent to avoid considerably these problems. however c8 is less durable than ods under extreme acidic conditions. our new proprietary c8 modification technology combined with a perfect end-capping minimizes the presence of residual silanol groups and protects the silica surface sp-200-c8-bio demonstrates high mechanical stability by no obvious alteration of back pressure and particle size after 10 repeated packing cycles in dac columns. by overcoming the common weaknesses of the conventional c8 rp silica phases, daisogel sp-200-c8-bio opens new avenues for process scale separation of peptides and proteins. m514 determination of peptide: protein molecular ratio in conjugates by seldi-ms method synthetic peptides are widely used as antigens in various research and practical areas of biology and medicine. peptides with molecular masses < 5000 kda should be conjugated with carrier proteins in order to ensure their immunogenicity and protect from proteolysis. in these cases the comparison of peptide immunogenicities and immunotest system development should be performed having in mind exact peptide-to-protein ratios. 23 conjugates of peptide fragments of hepatitis c virus envelope protein e2 with ovalbumin, bovine serum albumin, and myoglobin were prepared using glutaraldehyde (ga), m-maleimidobenzoyl n-hydroxysuccinimide ester, dimethyl suberimidate (dms), 1-ethyl-3-(3-dimethylaminopropyl)-carbodiimide as conjugating reagents. the rough evidence of the peptide-protein conjugate formation was obtained by page. the exact peptide:protein molar ratio was estimated in all 23 conjugates by seldi-ms. almost all conjugates had oligomeric structures due to the formation of intermolecular linkages between proteins. the peptide : protein molar ratio in conjugates varied from 1:1 to 13,6:1. conjugates obtained with the ga were more diversified in the number of peptide molecules linked to carrier proteins (peptide:protein ratios ranged from 3:1 to 13:1) than other conjugation reagents mazur-marzec 1 poland posttranslational modifications (ptms) like phosphorylation, acetylation, or methylation have been shown to play a significant role in directing the function of various proteins [1]. in eukaryotes, most of proteins have been shown to be posttranslationally regulated by a variety of different modifications. many effects of ptms include a change of enzymatic activity capillary electrophoresis (ce) has been used to study electrophoretic behavior of ptm-peptides gal-nh2, gal(1-15)-nh2 by capillary electrophoresis. using a phosphate buffer most of ptm-peptides were poorly separated at acidic or neutral ph. the best results were obtained using trifluoroethanol containing separation buffers. optimization of ce separation of maps of peptides containing ptms should allow to detect ptm-proteins and characterize their role in the living cell. comparison of modification events occurring in diseased and healthy cell may iran the purpose of this study was to use the application of multiplex reverse transcription polymerase chain reaction(rt-pcr) assay for detecting the two most common leukemia translocations t(1;19) and t(9;22) in childhood acute lymphoblastic leukemia in iranian children. 32 cases of leukemia patients were screened with the rt-pcr assay. this assay will identify the all type bcr-abl transcripts encoded by the t(9;22) and all described variants of the e2a-pbx1 transcript encoded by the t(1;19). rna was isolated from leukocyte cells of patient's samples. through the construction and optimization of specific primers for each translocation,we have been able to set up multiplex rt-pcr reactions.then pcr products was electrophoresed on agaros gel and were compared with size markers and expected fragtments key words: acute lymphoblastic leukemia -multiplex rt-pcr tu521 study on the syntheses and lc/esi-ms analyses of the glutathione conjugates of bile acids t. wakamiya 1 , m. sogabe 1 a carboxylic acid-containing drug, is metabolized to a glutathione (gsh) conjugate in vivo, and the conjugate is excreted in human urine [1]. although bile acids, compounds with carboxylic acid in molecules, are also expected to form gsh conjugates in liver, no evidence is so far obtained to confirm such metabolism, since there are no suitable standard samples for the research. in the present paper we report the syntheses of the gsh conjugates of main bile acids in human, i.e., cholic acid (ca), chenodeoxycholic acid (cdca), deoxycholic acid (dca), ursodeoxycholic acid (udca) and lithocholic acid (lca) as shown below, and the detailed analyses of these synthetic conjugates by means of linear ion trap lc/esi-ms. furthermore, the evidence for conversion of cholyl adenylate [2] and ca-coa thioester into ca-gsh conjugate will be presented these peptides do no exhibit such strong side effects as csa, but their practical application is hindered because of their poor solubility in water. the 49-57 fragment of tat protein and its analogs, including oligoarginine sequences, are known for their unusual ability to cross cell membranes, skin, and blood-brain barriers. moreover, these peptides are able to transport other substances into cells. this strategy was successfully applied in cases of csa, taxol, and other drugs to improve their bioavailability. now we have synthesized a series of analogues of cyclolinopeptide a, clx, and the immunosuppressive fragment of ubiquitin, covalently bound to the cell-penetrating fragment of tat and its analogues. the ability to cross the biological membranes and the immunosuppressive activities of these conjugates were tested. the conformation of the peptides was determined by circular dichroism methods: we used fluorescein-labelled cationic cell-penetrating peptides and analyzed the uptake efficiency (flow cytometry) as well as the intracellular distribution (confocal laser scanning microscopy). the bioactivity of a proapoptotic cargo-peptide, delivered into the cells either via electroporation or via cpps was quantified using a caspase-3 activity assay and cellular assays. to address the integrity of cpps during their trafficking, a fluorescent double-labelled antp peptide was designed and used as an intracellular fret-sensor. results: endocytosis-mediated uptake of the cpp-cargo conjugate led to a significant reduction of cargo bioactivity compared to its direct transfer via transient membrane permeation. this finding was related to the sequestration of peptides within endocytic vesicles but also, in the case of the tnf response, to the induction of receptor internalization during cell entry. moreover, during endolysosomal passage peptides undergo significant proteolytical degradation. conclusions: the endocytosis-dependent uptake mechanism of cholesterol pullulan (cp), in which maltose moieties are partially modified by cholesterol, is unique in forming self-assembled nanoparticles (20-30 nm) in water. combination of these characteristics is considered to be promising for development of effective non-viral vectors without toxicity. a conjugate of hiv-tat and cp was synthesized and its gene expression efficiency was evaluated. fully protected hiv-tat-(48-57)-cys(snm)-gly-nh-r was obtained by conversion of the corresponding cys(acm) peptide which was synthesized by the solid-phase method [snm: (n-methyl-n-phenylcarbamoyl)-sulfenyl] [1]. the sulfhydryl function was introduced to the hydroxyl groups of cp by acylation with trt-3-mercaptopropionic acid followed by acid treatment. resulting 3-mercaptopropionyl-cp was coupled with cys(snm) peptide to form disulfide bridge and the protecting groups of the peptide were removed to give the cp-tat conjugate. cp-tat and pcmv-luc complex was transfected into cos7 cells and luciferase activity was analyzed after 24 h. cp-tat elicited remarkable cytoplasmic luciferase activity and low toxicity this finding provides a possibility to use gnrh-iii as a targeting moiety for intracellular drug delivery. therefore we have prepared methotrexate and doxorubicine conjugates of gnrh-iii. the drug molecules were attached to the lys side chain in position 8 of gnrh-iii by thioether bond formation through the gflg spacer elongated with cys either at the c-or n-terminus. since we found earlier that the dimer derivative of gnrh-iii was more effective, new dimer derivatives with a combination of antitumour agents were also prepared. branched gnrh-iii derivative (pyrhwshdwk(clac-glfgc(acm))pg-nh2]) was synthesised by spps. the drug molecules were attached to this compound by thioether bond and finally disulfide bridge was formed between two peptide chains. the cytotoxicity of new derivatives was characterised by mtt test th531 oligopeptide antifungals are exceptionally active against multidrug-resistant yeast previous studies have demonstrated that longer sirnas that are processed by dicer can result in more potent knockdown than the corresponding standard 21-mer sirnas. dicer-substrate 25-27-mer sirnas were conjugated with different structural classes of peptides and their cell uptake properties evaluated. peptides were conjugated to the 5' end of the sirna sense or antisense strand via a thioether bond under denaturing conditions to prevent aggregation and precipitation. the ability of conjugates to translocate fluorescently-labeled sirna across the plasma membrane was evaluated by flow cytometry. the results indicate that some peptides can mediate higher efficiency uptake of sirna into cells compared with lipofectamine or cholesterol-conjugated sirna. the peptide-sirna formulation with 27-mer sirna conjugates showed higher knockdown of tnf-alpha mrna and protein levels in activated human monocytes in vitro compared to the conjugated 21-mer sirna species. the products resulting from in vitro digestion of peptide-conjugated rna duplexes with recombinant human dicer were identified using esi-ms and consisted predominantly of the desired 21-mer sirna several peptides containing the sequence arg-gly-asp (rgd) were studied and developed for their nanomolar affinity to the membrane receptor alfa v beta3 and alfa v beta 5 integrins, which are over-expressed by endothelial cells during proliferation and by tumor cells. to improve the pharmacological profile of some camptothecin derivatives (cpts), five conjugates were designed, where the cytotoxic drugs were covalently attached to the rgd peptide analogues for preferential uptake into tumor cells. the peptides to be used have been selected among a series of new pentacyclic peptides bearing at 5-position a trifunctional pseudoamino acid with a carboxy-terminal side-chain. peptide analogues showing the highest affinity to alfa v beta 3 and alfa v beta 5 integrins were coupled with cpts at different positions. the conjugates have been optimized for binding to the receptors, proteolytic stability and an overall improvement in tumor selectivity. the nature of the linkage between rgds and cpts has a major impact on stability and biological activity of the conjugates. the conjugates with amide bond, but not those with ester bond, are sufficiently stable and show in vitro antitumor activity against a498 and a2780 cell lines combination of amaranth protein with other plant proteins (cereals) enables to formulate the composite protein (near to milk or beef protein, but exclusively on vegetal basis). it is shown on graphs. the aim of the projects' proposals is a development and realization of the technology for fractionation of amaranths defatted flour product is a top protein obtained by removing starches and next polysaccharides decomposed on soluble monosaccharide by specific enzymes. there are shown the chromatograph measuring. we can see complete disintegration of starch and the unchanged proteins. the separate solution monosaccharide is usable for others fermentative processes or as a nutrient solution for yeasts there are description methods for isolation amaranth protein -extraction processes, enzymatic removal starch. the product is a isolate protein rich in essential amino acids. the waste monosaccharide solution was used to production yeasts biomass rich in proteins vitamins amaranth protein isolate have high nutritional value and can be used as food ingredients, for functional, probiotic formulation to begin with, we made epitope mapping with the highly sensitive spot array method (2) in order to study antigenic regions of parvovirus b19 vp1 and vp2 capsid proteins. epitope mapping identified highly reactive, immunodominant early epitope on parvovirus surface that centered to kyvtgin residues of vp1. in the subsequent phases we developed the kyvtgin epitope type specific (ets) igg serodiagnostics. a correlation between enhancing igg avidity to b19 capsid and a transient reactivity with the point-of-care kyvtgin peptide was clear. together the two assays enhanced the value of early diagnosis of b19 infections (3) acute phase-specific heptapeptide epitope for parvovirus b19 diagnosis synthetic peptide arrays on membrane supports-principles and applications human parvovirus b19 infection during pregnancy--value of modern molecular and serological diagnostics 1 laboratory of peptide & protein chemistry & biology 2 department of organic chemistry "ugo schiff" and cnr-iccom 3 department of chemistry 4 department of pharmaceutical sciences glc) able to recognize specific autoantibodies in multiple sclerosis (ms) patients' sera has been developed by the laboratory of peptide & protein chemistry and biology [1and bio-plex suspension array system, biorad. the biosensor technology and bio-plex suspension array system will offer advantages such as rapid analysis, and high sensitivity for a high throughput screening. immobilisation will be based on different strategies that are anchoring the synthetic antigen on different solid supports such as polystyrene well plates (elisa) optimisation of the different techniques was performed with anti-csf114(glc) autoantibodies isolated using affinity chromatography from ms patients' sera. the analytical parameters such as specificity, sensitivity, and matrix effect were evaluated. the different technologies have been used for a high throughput screening of ms sera which control specific cell adhesion.2 here we discuss the route for preparation of amphiphilic block copolymers composed of hydrophobic polylactide and hydrophilic polyethylene oxide (peo) blocks, carrying various cell-adhesion oligopeptide sequences at the end of peo block. fully protected peptide fragments were prepared by solid-phase peptide synthesis by using fmoc strategy and chlortrityl resin. the side-chain protected peptides were cleaved from resin by 25% hfip solution in dcm. the copolymers peptide-polytehyleneoxide were prepared by coupling of the activated peptide fragments with α-amino-ω-hydroxy-peo in dmf using pypop as an activation reagent. subsequently, the polylactide block was grafted to the ω-hydroxy end group of the peptide-peo copolymer via a controlled rop polymerisation of lactide 2r)-2-aminocyclopentanecarboxylic acid (acpc) and beta-methylphenylalanine (beta-mephe) were designed and synthesized to obtain more potent and selective mu-opioid receptor ligands with higher stability against proteolytic enzymes. we have prepared the peptides by spps methods using racemic amino acids. the diastereomeric peptides were separated by hplc. the configuration of the unnatural amino acids in the peptides was determined by chiral tlc using enantiomeric standards. radioligand binding assays and in vitro gpi and mvd assays indicated that several analogues showed high, subnanomol affinity and high selectivity for mu-opioid receptors having agonist or antagonist properties. the incorporation of alicyclic amino acids into the endomorphins resulted in enzyme resistant peptides. the most promishing analogues (dmt-pro-phe-phe-nh2 and tyr-(1s,2r)acpc-phe-phe-nh2) were labeled with tritium using precursor peptides containing dehydroproline or dehydro-(1s,2r)acpc amino acids and tritium gas and pd/baso4 catalyst. the novel peptides and their radiolabelled analogues with high specific radioactivity (1.4-2.8 tbq/mmmol) have become useful pharmacological and biochemical tools for the opioid research iran background and aims: injectable drug delivery based on polymer solution platforms has gained in resent years, particulary for protein-based therapies. the influence of polymer molecular weight (rg 502h, rg504h) on the morphology, erosion of matrices and also on their in-vitro drug release behavior over a period of 28 days was assessed for leuprolide acetate in this study. methods: each formulation was composed of 33% (w/w) polymer and 3% (w/w) leuprolide acetate dissolved in nmp. release studies were performed in a home-made diffusion cell at 37°c. the polymer erosion was studied using two different methods as follows. (a): l-lactic acid detection (b): ph change study. the morphology of the matrices was then analyzed by scanning electron microscope as is shown, the lower molecular weight polymer formulation shows higher porosity and pore diameter due to a rapid phase inversion phase i) can be divided into three more phases with different release rates. results showed that burst effect for rg 502h, 32%, was significantly (p<0.05) higher than rg 504h (13%) italy fabrication of photocurrent-generating systems based on bioinspired organic-inorganic hybrid materials is currently of great interest. more specifically, the photoelectronic properties of nanometric films formed by peptide self-assembled monolayers have been actively investigated. in this work interdigitated gold microelectrodes were modified by covalently linking a hexapeptide ester functionalized by a lipoic acid (lipo) at the n-terminus. the peptide chain [lipo-(aib)4-trp-aib-otbu] comprises five α-aminoisobutyric acid (aib) residues and one trp, a fluorescent amino acid with strong absorptions in the uv region. due to the very high percentage of conformationally constrained aib residues in the chain, the peptide adopts a rigid 3-10-helical structure. cyclic voltammetry measurements indicate that the peptide forms a homogeneously and densely packed monolayer on the gold surface, while current/voltage curves exhibit interesting rectifying properties of the peptide sam. photocurrent generation experiments, performed on the peptide-layered microelectrode, show peculiar modifications of the spectrum. at 240 nm a notably higher photocurrent/voltage response was observed for the peptide-modified electrode, suggesting that a photoinduced electron transfer process from trp to gold does take place with high efficiency this may lead to randomly orientated enzymes and subsequently limited activity. the aim of this work is to selectively activate enzymes at their c-terminal position in order to allow specific immobilisation. we chose akr1a1, an enzyme of the aldo/keto reductase superfamily, for the synthesis of an artificially monolabeled redoxprotein. akr1a1 is a monomeric enzyme and catalyzes the nadph dependent reduction of aliphatic/aromatic ketones and aldehydes. to produce monofunctionalized enzymes we applied the strategy of expressed protein ligation (epl). accordingly, we used the impact®-system and cloned the aldo/keto-reductase as fusion protein with an additional intein/chitin binding domain. through intein mediated splicing we could produce c-terminal thioester of the akr1a1. in the next step, the thioester was coupled to a biotin containing peptide by native chemical ligation. this specifically modified enzyme was immobilised on avidin coated surfaces. the attachment on the surface was tested by tryptic digestion, followed by maldi-tof-ms analysis since safe organic solvent waste disposal is an important environmental problem, we aimed to perform peptide synthesis in water. we have reported solid phase peptide synthesis in water using water-soluble n-protected amino acids, such as 2-[phenyl(methyl)sulfonio]ethoxycarbonyl and 2-(4-sulfophenylsulfonyl)-ethoxycarbonyl amino acids. following to study on water-soluble n-protected amino acids, we developed a new technology based on nanochemistry for solid phase peptide synthesis in water. the new technology is based on coupling reaction of suspended nanoparticle reactants in water. fmoc-amino acids are used widely in peptide synthesis, but most of them show poor water-solubility. we prepared well-dispersible fmoc-amino acid nanoparticles in water by pulverization using a planetary ball mill in the presence of poly(ethylene glycol) (peg). the size of fmoc-amino acid particles was 300-500 nm. to evaluate the utility of this technique, leu-enkephalin was prepared using the nanoparticulate fmoc-amino acids on a peg-grafted rink amide resin in water supramolecular structures formed from n-lipidated oligopeptides immobilized in the regular pattern on the cellulose surface are able to bind ligand molecule, thus acting like artificial receptors. due to the conformational flexibility of lipidated oligopeptide chains, the supramolecular structure is highly flexible, forming the cavities with the shape and prosperities adjusted most effectively to requirements of the guest molecule. structural requirements for a peptide providing the most efficient fit the guest molecules are not known, therefore an array of the artificial receptors have been synthesized and used in the studies. thus, even in the case, when the single receptor in an array does not necessarily have selectivity for a particular analyte, the combined fingerprint response can be extracted as a diagnostic pattern visually, or using chemometric tools in order to improve the sensitivity of the competitive binding and to study the mechanism of molecular recognition, experiments involving fluoresceine and fluoresceine marked acp fragment were performed. we found that λmax and intensivity of fluorescence depends on the structure of the peptide motif and lipidic fragment of receptor this mts was linked with dhhp-6 by disulfide bond, and the new molecular was named mts~dhhp-6. the peroxidase activity of mts~dhhp-6 (2.1x103 u•µmol-1) was tested and similar to that of mp-11 (4.2x103 u•µmol-1). mts~dhhp-6 coated with quantum dots (qds) [3] were observed to accumulate into neonatal rat cardiomyocytes (nrcms) of wistar rats and co-localized with mitotracker red in mt. these results suggest that mts~dhhp-6 is an excellent apx mimics and may have potential proceedings of the twenty-eighth european peptide symposium, kenes international, israel, 2005, 551. references: 1 elastin-based polypeptide, poly(val-pro-gly-val-gly), undergoes self-assembly called coacervation, in which microcoacervate droplets with approximately 1000 nm diameters are formed [1]. nanoparticles cross-linked by cobalt-60 γ-irradiation of these microcoacervate droplets are useful as drug release devices. to investigate the size optimization of nanoparticles, the stability of nanoparticles in the treatment of enzyme, and the drug release profiles from nanoparticles, the three copolymers; poly[10(val-pro-gly-val-gly), (val-ala-pro-gly-val-gly)], poly[4(val-pro-gly-val-gly) application of polyelectrolytes and theoretical models the synthetic heptapeptide rnwdvyk is a fragment of a high affinity receptor (fcεri) for immunoglobulin e (fragment 111-117). it is the active domain for binding with ige. the program of studies of biological properties of the heptapeptide included the investigation of its binding to ige contained in standard solutions and in patients' blood serum. the binding of rnwdvyk with ige was investigated by the ifa method using the ige antibodies labelled with horse-radish peroxidase (hrpo). we determined the optimum sorption concentration of the peptide in this experimental immunoenzyme system to be 100 mkg/ml. the ability of synthetic rnwdvyk peptides to bind with ige was studied as a function of ige concentration in standard serum (0.47 to 60 ng/ml ige). a high correlation was found between the ige concentration and the optical density of the solution after introducing monoclonal antibodies labeled with hrpo and the substrate chromogenic mixture (r=0.99). similar investigations were conducted using the allergy patients' blood serum. the serum with a known concentration of ige was added to immunological plotting boards with sorbed synthetic rnwdvyk peptides. a high correlation was also found between the concentration of ige in the patients' blood serum and the optical density of the solution after introducing monoclonal antibodies labelled with hrpo and substrate chromogenic mixture (r=0.94). our experiments showed the high ige binding activity of synthetic rnwdvyk peptides. we demonstrated the possibility of construction of diagnostic systems for the quantitative determination of ige and ige-antibodies. the fusion protein nucleocapsid-dutpase is present in virions of mason-pfizer monkey betaretrovirus and in virus-infected cells where it potentially contributes to rna/dna folding and reverse transcription (barabas, et al., 2003; bergman et al., 1994; berkowitz, et al., 1995) . in addition to trimeric dutpase core, the protein possesses flexible n-and c-termini consisting of the nucleocapsid segment, and a peptide motif conserved in dutpases. to analyze the function of the flexible cterminal peptide segment, reconstitution experiments were designed with truncated enzyme lacking the c-terminal 14mer oligopeptide and the synthetic oligopeptide prepared on rink-amid resin by solid-phase peptide synthesis, using fmoc strategy. the truncated enzyme proved to be practically inactive. addition of the synthetic 14mer (pyrgqgsfgssdiy) at 100fold molar excess resulted in partial complamentation of the catalytic activity (to 10% of original). a mixture of the truncated enzyme and the 14mer oligopeptide (this latter at 100fold excess) was put to crystallization trials. we conclude that the c-terminal 14mer is essential for catalytic activity. antifolate drugs are inhibitors directed to interfere with folate metabolic pathway. methotrexate (mtx) and pemetrexed (alimta®) are known folic acid analogues used in cancer treatment. different peptide conjugates of mtx have been prepared for intracellular delivery. (1) in octaarginine conjugates one of the carboxylic groups of mtx was attached to the n-terminal of the peptides. (2) however, as results showed, that both carboxylic groups are required to the biological effect of mtx. therefore we decided to synthesize peptide conjugates of folic acid analogues in which the carboxylic groups are untouched. octaarginine, penetratin and a cyclic peptide cgnkrtrgc, which can deliver a cargo molecule in the lymphoid system, were used as delivery peptides. we introduced squaric acid or aminoxy acetic acid as linker moiety between the peptides and cargo molecules. the conjugation was monitored by rp-hplc, the crude products were purified by rp-hplc and were identified by mass spectrometry. the biological activity of the conjugates was evaluated in vitro on sensitive and resistant human leukemia (hl-60) cell lines. besides its endocrine activity, trh (the tripeptide pglu-his-pro-nh2) has also been long recognized as a modulatory neuropeptide with broad range of physiological and pharmacological activities in the central nervous system (cns). although numerous centrally active and metabolically stable analogues and peptidomimetics have been synthesized using trh as a template, selectivity of their cns action has remained an issue to be addressed. we aimed at discovering novel analogues with enhanced cns-selectivity by incorporating pyridinium building blocks. the design also allowed for enhancing transport across the blood-brain-barrier and increasing residence time in the cns through prodrug strategy. solid-phase chemistry was used to prepare the analogues and novel methods previously not used to incorporate pyridinium moieties into resin-bound peptides such as the zincke reaction were also introduced. comprehensive evaluation included measurement of affinity to trh-receptor, acetylcholine-releasing, analeptic and antidepressant-like activity in animal models, as well as prediction of membrane affinity, determination of in vitro metabolic stability, and pharmacokinetics and brain uptake/retention studies that employed in vivo microdialysis sampling. a strong connection between acetylcholine-releasing potency and analeptic effect in animals was obtained for close analogues of trh, while pyridinium compounds designed from the structurally related pglu-glu-pro-nh2 maintained the antidepressant-like effect of the parent peptide, while showing significant decrease in analeptic action.in conclusion, an increase in the selectivity of cns-activity profile was obtained by the incorporation of pyridinium moieties. we have also demonstrated the benefits of the prodrug approach on the pharmacokinetics, brain uptake and retention of the analogues upon systemic administration. the use of small radiolabelled compounds such as peptides is a very attractive tool for the diagnosis of several different pathologies, specially cancer, through the use of nuclear medicine techniques.1 among the various membrane receptors, the two cholecystokinin receptors ccka-r and cckb-r are very promising biological targets for radiolabelled compounds due to their overexpression in many tumours2. in order to develop radiolabelled peptide derivatives able to target these receptors, the binding mode of the c-terminal cholecystokinin octapeptide (cck8), toward the two cholecystokinin receptors ccka-r and cckb-r has been, recently, structurally characterized. 3 the structural data suggest that modifications on the n-terminal end of cck8 obtained by introducing chelating agents and their metal complexes should not affect the interaction of the derivatized cck8 peptides with both ccka-r and cckb-r. here we report the labelling procedures and the in vitro and in vivo characterization of new 99mtc cck8 derivatives. a stable 99mtc-nitrido complex is obtained by using the coordinative set formed by: 1) the n-terminal amino group and the sh cystein of the cck8 derivative cys-gly-cck8 peptide; and 2) a pnp aminodiphosphine used as coligand. several phosphines are used in order to define the best labelling procedures and to optimize the in vivo biodistribution properties of the 99mtc labelled peptides.references various combinations of pore size and chemistry of silica-based materials were investigated for high performance liquid chromatography (hplc) of peptide separation. incorrect pore size and ligands have been suggested to cause peak broadening, poor resolution and poor recovery. our study suggests that an appropriate combination of pore sizes and ligands is necessary to obtain the most efficient usage of reversed-phase hplc columns according to the molecular weights of peptides and proteins. we will also show the possibilities of an improved method development for the separation of complex peptide mixture by ph or additives.the development of new biopolymer materials as drug delivery systems is of enormous interest on biomedicine. dendrimers are polymers with particular properties; they are highly branched polymers with well-defined chemical composition, and show compact globular shape, monodisperse size and controllable surface functionalities. peptide dendrimers incorporates amino acids in their structures and have additional features such as biocompatibility and biodegradability.in previous studies we described the solid-phase synthesis of a new class of polyproline-based dendrimers. these biopolymers have the capacity to cross the mammalian cell membrane and moderate toxicity. these promising results open up the possibility to explore these dendrimers as delivery systems.to design more versatile polyproline dendrimers, we have developed a methodology that involves the combination of solid-phase and solution strategies. diverse multivalent peg-and proline-based cores were synthesized to attain dendrimers with distinct topologies. dendrimers were synthesized by iterative building block addition [(glypro5)2imdoh], around an inner core, using a peptide solution convergent approach. a variety of coupling methodologies and protecting groups for the n-terminal function were explored. the novel high throughput protein detection system using designed peptide arrays has been successfully indicated on their capabilities as the "protein-chip" [1] [2] [3] [4] . our concept has many advantages especially for high-quality industrial production and practical applications compared to arrays with antibodies or recombinant proteins. the deposited peptide solution can be dried without covalent immobilization, although, when the resulted arrays are exposed in protein-solution they showed planned conformation [5] . based on these basic results, several hundreds of α-helical, β-loop and β-sheet peptides, which involved a cysteinyl residue for covalent immobilization and tamra as a fluorescent label, were successfully synthesized, characterized and used as capture molecules. a novel material for chips made from amorphous carbon suitable for our concept has been developed, which has significant advantages over conventional glass or polymer plates, such as no selffluorescence, mechanically more stable, easy manufacturing in the aspects of precised and high throughput processing. thus, chip-plate with nano-l wells could also be easily manufactured. peptides were deposited on these chips surface covalently as well as non-covalently in 350 picol/spot (diameter: ca 200 µm). the resulted chips were used for protein detection. a part of this work was funded by the okinawa-bio-project and nedo-grant. key: cord-004879-pgyzluwp authors: nan title: programmed cell death date: 1994 journal: experientia doi: 10.1007/bf02033112 sha: doc_id: 4879 cord_uid: pgyzluwp nan it is widely held that all developmental cell death is of a single type (apoptosis) and that neuronal death is primarily for adjusting the number of neurons in a population to the size of their target field through competition between equals for target-derived factors. we shall draw on our research and on that of others to criticize these views and replace them by the following. at least three types of neuronal death occur, only one of which resembles apoptosis; a neuron can choose between several self-destruct mechanisms depending on the cause of its death. the purpose of the death is to regulate connectivity, not neuron number. competitors for trophic factors are unequal, and many losers have made axonal targeting errors. a neuron's survival and differentiation depend on multiple anterograde and retrograde signals. activity affects retrograde signals and some but not all anterograde ones. the pattern of activity is more important than the overall amount. in rodents, the period of naturally occuring cell death of motoneurons is followed by a period of supersensitivity to axonal injury. thus, in newborn rodents lesion of the facial nerve leads to a rapid degeneration of the injured motoneurons. we have tested whether overexpression, in rive, of the bcl-2 proto-oncogene was capable of preventing death of axotomized motoneurons. to address this question we used transgenic mice whose motoneurons overexpress the bcl-2 protein. one of the two facial nerves of newborn mice was transected on the 2nd-3rd post-natal day. seven days after the lesion, the morphology of the facial nuclei was analyzed. in control mice, and when compared to the intact nucleus, 70 to 80 % of axotomized motoneurons had disappeared. in contrast, in the transgenic animals, the number of motoneurons on the lesioned side remained unchanged when compared to the eontralateral nucleus. furthermore, their axons remained visible up to the distal lesion site. these experiments show that, in rive, motoneurons overexpressing the bcl-2 protein survive after axotomy, and suggest that, in rive, bcl-2 protect neurons from experimentally induced cell death and could be a target for treatment of motoneurons degenerative diseases. messmer s., mattenberger l., sager y., blatter-garin m-c., pometta d., kate a., james r.w. drpt de mrdeeine, drpt. de pharmacologie, div. de neurophysiologie clinique, facult6 de mrdecine, gen~ve. clusterin is a widely expressed glycoprotein, highly conserved across species. numerous functions have been postulated for this protein. the most important are roles in lipid transport, as elusterin is associated with apolipoprotein ai in hdl, complement regulation and tissue remodelling, in particular during cell death and differentiation. using cultures of rat spinal cord neurones (90% neurons and 5-10% non-neuronal cells), we have studied the expression of clusterin and ape e in glutamate-induced neuronal cell death to examine potential roles in lipid management. up-regulation of the two proteins was observed. clusterin and ape e appear in the conditioned medium respectively 15h and 7.5h after incubation with glutamate. control studies, in the presence of a noncompetitive nmda receptor agonist showed the secretion of clusterin and ape e to be diminished by >60%. no up-regulation of either protein was observed in complementary studies with exclusively non-neuronal cell cultures. the cellular origin of the 2 secreted proteins is presently under investigation. programmed cell death and tissue remodelling are consequences of hormonally induced restructuring of the rat ventral prostate after castration and the rat mammary gland after weaning. we used the "differential display"-method (liang and pardee, 1992, science 257:967) to detect and isolate edna fragments whose corresponding rnas are regulated either coincidentally, or in an organ specific fashion during mammary gland involution and postcastrational prostate regression. partial sequencing of 12 clones revealed high, but not absolute homology of 5 fragments with sequences, previously characterized in different biological contexts. these five encode functions which could be anticipated to be important for cell growth and/or programmed cell death, we are presently investigating the functions of several of these transcripts in cell culture and in rive. antisense oligos are being employed in vivo to determine whether these genes contribute to the phenotype of programmed cell death. b epitopes derived from the envelope gp52 glycoprotein (ep3) or from the viral superantigen of mmtv have been incorporated into inert or live vaccines. the inert vaccine consists of purified chimeric proteins which contain the b epitopes alone or fused to multimeric promiscuous t helper epitopes from tetanus toxin. mice were immunized subcutaneously with these chimeric proteins. the live vaccine consists of an avirulent strain of salmonella typhimurium which expresses the mmtv epitopes in the form of chimeric proteins fused to the nucleocapsid protein of hepatitis b virus. this vaccine is given to mice in one oral dose. the level, duration and isotype of the immune response generated by each vaccine have been measured and compared. the level of protection has been investigated by systemically challenging immunized mice with the relzovims. a reduced binding of oxytocin (ot) occurs with aging in some, but not all, areas of the rat brain (arsenijevic et al., experientia 1993, 49, a75) . the candate putamen showed the most impressive loss of ot receptors. two other regions, the hypothalamic ventromedial nucleus (vmh) and the islands of caueja (icj) had also an important deficit of ot binding sites. on the other hand, these two regions were known to be sensitive to sex steroids. in the present work, we treated from 20 month old rats during one month with testosterone propionate (2 #g/kg s.c., once every 3 days) dissolved in oil. three rats of the same age injected with oil only served as controls. we labelled ot receptors throughout the brain of old rats using a 125i-labelled ligand specific for ot receptors. analysis of autoradiograms by an image analyzer revealed that the testosterone treatment increased ot binding sites in the vmh, in the icj, and, to a lesser extent, in the bed nucleus of the stria terminalis, a region also sensitive to sex steroids, by contrast, in the caudate putamen, the disappearance of ot receptors was not compensated. in conclusion, the decrease of ot receptors occurring in vmh and icj with aging can be reversed by administration of gonadal steroids. in contrast, the loss of ot receptors in the striatum appears to depend on another mecanism. vasopressin (avp) receptors are expressed transiently in the facial nucleus during development (tribollet et el., 1991, dev. brain res., 58, 13-24) . avp may therefore play a role in the maturation of neuromuscular connexions in the neonate rat, and possibly in the restanration of these connexions after nerve lesion in the adult. in order to investigate the latter proposition, we have sectionned the facial nerve in adult rats and used quantitative autoradiography to look at avp binding sites in the facial nucleus at various postoperative times. we observed a massive and transient increase of avp binding sites on the operated side. the number of facial avp binding sites reaches a maximum about one week after nerve section, remains stable during 2-3 weeks, then begin to decrease towards control level. the induction of avp receptors is markedly delayed if the proximal stump of the nerve is ligated. to assess whether other motor nuclei would also react to axotomy by up-regulating the expression of avp receptors, we have sectionned the hypoglossal nerve and the sciatic nerve. in both cases, the binding of avp receptor ligand increases massively in the respective motor nuclei, with a time-course similar to that found in the facial nucleus. altogether, our data suggest that central avp could be involved in the process of nerve regeneration. cytotoxic t-cell mediated apoptosis schaerer,e, karapetian,o.,adrian,m. and tschopp,j. inst.de biochimie, univ.de lausanne, 1066 epalinges. an apoptotic cell death mechanism is used by cytolytic t cells (ctl) to lyse appropriate target cells. ctl harbor cytoplasmic storage compartments, containing the lytic protein perforin and serineproteases (granzymes), whose content is released upon target cell interaction. we show that these granules are multivesieular bodies and that degranulation releases these intragranular vesicles (igv) having granzymes, t-cell receptor and yet undefined proteins associated. isolated igvs and perforin induce dna breakdown in target cells within 20 minutes. microscopic analysis demonstrates that igv specifically interact with target cell via the t-cell receptor and that their contents is taken up by the target cell. already 15 min. after interaction, 3 distinct igv proteins are found in the nucleus of the target cell.one of the molecules has been identified to be granzyme a, previously reported to be involved in apoptosis. we propose that lymphocytes transfer apoptosisinducing proteins to the nucleus of the target cells using vesicles as vehicles for delivery. cytotoxic t cells kill their targets by a mechanism involving membranolysis and dna degradation (apoptosis). recently, two sets of proteins have been proposed as dna breakdown-inducing molecules in t cells: granzyme a, b and tia-i. in this study, we cloned and further characterized the tia-i mouse homologue. aa sequence comparison with the human tia-1 showed an overall identity of 93%. devoid of a signal peptide, tia is yet localized to cytotoxic granules, probably targeted via a gly-tyr-motif. as tia-i, its mouse homolcgue contains three rnabinding domains. expression of tia during development shows a very strong signal in the brain and weaker signals in thymus, heart and other organs. during embryonic development several structures that contribute to organogenesis form transiently and are later eliminated by apoptosis. this pattern of tia expression could indicate its involvement in apoptosis. prostate involution occurs after castration in rats and is associated with the death by apoptosis of a large fraction of the epithelial cells. we have isolated several genes from a prostate involution bacteriophage lambda library using differential screening methods. among these clones, one d~monstrated an especially strong signal when used as a probe against northern blots of prostate mlhna obtained before, and at different times after castration. this gene is down-regulated after castration by 40-fold within 5 days. intramuscular injection of a testosterone depot resulted in complete restoration of expression within 24 hours. upon sequencing it became apparent that this clone has a high degree of homology to a known ndah dehydrogenase encoded in mitochondrial dna. the clone failed to hybridize to any transcripts from rat organs other than prostate. we are now in the process of isolating the htm~n hc~olog to this gene for use as a biomarker in study of benign hyperplasia and developing carcinoma. this gene is a possible indicator for testosterone-independent cell populations or of cells lacking ftl~ctional testosterone receptor. during the first three postnatal weeks the rat lung undergoes the last two developmental stages, the phase of alveolarization and the phase of microvascular maturation. the latter involves a decrease of the connective tissue mass in the alveolar septa and a merging of the two capillary layers to a single one. speculating that programmed cell death may play a role during this remodeling, we searched for the presence of apoptotie cells in rat lungs between days 10 and 24. lung paraffin sections were treated with y-terminal transferase, digoxigenin-dutp, and anti-digoxigeninfluorescein-f(ab)-fragments, and the number of fluorescent nuclei was compared between sections at different days. while the number of apoptotie ceils was low until the end of the second week and at day 24, we observed an about eight fold increase of fluorescent nuclei towards the end of the third week. we conclude that programmed cell death is involved in the structural maturation of the lung. brunner, a., wallrapp, ch., pollack, i, twardzik, t. and schneuwly, s. lehrstuhl genetik, biozentrum universit~t w~rzburg, mutants in the giant lens (g/l) gene show a strong disturbance in ommatidial development. in the absence of any gene product, additional phetoreceptors, cone cells and pigment cells develop. opposite effects can be seen in flies in which the gene product of the giant lens gene can be ectopically expressed by heat shock. a second very typical phenotype is the disturbance of photoreceptor axon guidance. molecular analysis of gil shows that it encodes a secreted protein of 444aa containing three evolutionary conserved cystein-motives very similar to egf-like repeats. we propose that gil functions as a secreted signal, most likely a lateral inhibitor for the development of specific cell fates and that gil, either directly or indirectly, is involved in targeting photoreceptor axons into the brain. the decrease in cellularity during scar establishment is mediated through apoptosis desmouliere, a., redard, m., darby, i., and g. gabbiani department of pathology, cmu, 1 rue michel server, 1211 gen~ve 4 dudng the healing of an open wound, granulation tissue formation is characterized by replication and accumulation of fibroblastic cells, many of which acquire morphological and biochemical features of smooth muscle cells and have been named myofibroblasts (sch0rch et el., histology for pathologists, t992). as the wound evolves into a scar, there is an important decrease in ceuuladty, including disappearance of myofibroblasts. the question adses as to which process is responsible for myofibroblast disappearance. during a previous investigation on the expression of (z-smooth muscle actin in myofibroblasts, we have obsewed that in late phases of wound healing, many of myofibroblasts show signs of apoptosis end suggested that this type of cell death is responsible for the disappearance of myofibroblasts (darby et al., lab. invest. 63:21, 1990) . we have tested this hypothesis by means of electron microscopy and morphometry and by in situ end-labeling of fragmented dna (wijsman et al., j. histochem. cytochem. 41:7, t 993) . our results show that the number of apoptotic cells increases as the wound closes and suggest that this may be the mechanism for the disappearance of myofibroblasts as well as for the evolution of granulation tissue into a scar. (supported by the swiss national science foundation, grant n~ s01-16 r. jaggl, a. marti and b. jehn. universit~t bern, akef, tiefenaustr. 120, 3004 bern at weaning the mammary gland undergoes a reductive remodelling process (involution) which is associated with the cessation of milk protein gene expression and apoptosis of milk-produclng epithelial cells. this process can be reversed by returning the pups to the mother within 1 day. elevated nuclear protein kinase a (pka) activity was observed from one day post-lactation, paralleled by increased c-los, junb, ]und and to a lesser extent c-]un mrna levels. ap-1 dna binding activity was transiently induced and the ap-1 complex was shown to consist principally of cfos/jund. oct-1 dna binding activity and oct-1 protein were gradually lost from the gland over the first four days of involution, whereas oct-1 m_rna levels remained unchanged. comparing nuclear extracts from normal mammary glands with nuclear extracts from glands which had been cleared of all epithelial cells three weeks after birth revealed that pka activation, ap-1 induction and oct-1 inactivation are all dependent on the presence of the epithelial compartment. the increased fos/jtm expression and the inactivation of oct-1 may be consequences of the increased pka activity. when involution is reversed, both, pica activity and ap-1 dna binding activity (and fos andjun mrna levels) are reduced to basal levels. our data suggests a role for pka and ap-1 on progranlmed cell death of manlnmry epithelial ceils. bcl-2~ does not require membrane attachment for its survival activity c. borner*, i. martinout, c. mattmann*, m. irmler*, e. sch&rrer*, j.-c. martinou-j-, and j. tschopp*. * institute of biochemistry, university of lausanne, 1066 epalinges, 1 institute of molecular biology, glaxo inc.,1228 plan los ouates. 8cl-2(z is a mitochondrial or perinuclear-associated oncoprotein that prolongs the life span of a variety of cell types by interfering with programmed cell death. how it exerts this activity is unknown but it is believed that membrane attachment is required. to identify critical regions in bcl-2o~ for subcellular localization and survival activity, we created by site-directed mutagenesis, various mutations in regions which are most conserved between the different bcl-2 species. we show here that membrane attachment is not required for the survival activity of bcl-2o< a truncation mutant of bcl-2(z lacking the last 33 amino acids (t3) including the hydrophobic domain is soluble, yet fully active in blocking apoptosis of sympathetic neurons induced by ngf deprivation or l929 fibroblasts induced by tnfc~ treatment. we further provide evidence for a putative functional region in bcl-2 which lies in the conserved domains 4 and 5 upstream of the hydrophobic cooh terminal tail. the breakdown of nuclear dna is considered to be a hallmark of apoptosis. we previously identified the perinuclear membrane localized dnase i as the endonuclease involved in the formation of oligonucleosomal-sized fragments (dna ladder). it is not clear how the nuclease is activated and has access to the dna. we show that in thymocytes induced to undergo apoptosis, lamin breakdown preceded dna laddering. by transfeeting hela cells with a constitutively active cdc2 mutant, nuclear envelope breakdown and typical apoptotic features (ehromatin condensation) were observed. moreover, co-transfection with cdc2 mutant and dnase i led to dna degradation. we propose that apoptosis can be induced by wrongly timed and hence abortive mitosis leading to uncontrolled nuclear membrane disintegration. s02-01 s02-04 platelet-derived growth factor (pdgf) is thought to play an active role in fibrosing diseases. bronchiolitis obliterans-organizing pneumonia (boop) is a condition characterized by intraluminal proliferation of connective tissue inside distal air spaces. to evaluate pdgf expression in boop we performed immunohistoehemistry on lung biopsies from 20 patients and 10 controls free of fibrosis. sedal sections were stained with an antibody against either pdgf or the monoeyte/macrophage marker cd68, in both groups the pdgf ~9 cells were essentially tissue macrophages. using point counting to measure volume fraction (vv) , pdgf-pesitive cells represented 4.65+1.63% (mean+sd) of the volume occupied by lung tissue in the boop cases, and 2,12+0.65% in the controls (!0<0,001). similarily, 10.73+4.69% of the lung tissue was occupied by cd68 e~ macrophages in the boop cases, compared to 5.37:~3.73% in the controls (p