key: cord-338980-pygykil7 authors: Rahaman, Jordon; Siltberg-Liberles, Jessica title: Avoiding Regions Symptomatic of Conformational and Functional Flexibility to Identify Antiviral Targets in Current and Future Coronaviruses date: 2016-11-09 journal: Genome Biol Evol DOI: 10.1093/gbe/evw246 sha: doc_id: 338980 cord_uid: pygykil7 Within the last 15 years, two related coronaviruses (Severe Acute Respiratory Syndrome [SARS]-CoV and Middle East Respiratory Syndrome [MERS]-CoV) expanded their host range to include humans, with increased virulence in their new host. Coronaviruses were recently found to have little intrinsic disorder compared with many other virus families. Because intrinsically disordered regions have been proposed to be important for rewiring interactions between virus and host, we investigated the conservation of intrinsic disorder and secondary structure in coronaviruses in an evolutionary context. We found that regions of intrinsic disorder are rarely conserved among different coronavirus protein families, with the primary exception of the nucleocapsid. Also, secondary structure predictions are only conserved across 50–80% of sites for most protein families, with the implication that 20–50% of sites do not have conserved secondary structure prediction. Furthermore, nonconserved structure sites are significantly less constrained in sequence divergence than either sites conserved in the secondary structure or sites conserved in loop. Avoiding regions symptomatic of conformational flexibility such as disordered sites and sites with nonconserved secondary structure to identify potential broad-specificity antiviral targets, only one sequence motif (five residues or longer) remains from the >10,000 starting sites across all coronaviruses in this study. The identified sequence motif is found within the nonstructural protein (NSP) 12 and constitutes an antiviral target potentially effective against the present day and future coronaviruses. On shorter evolutionary timescales, the SARS and MERS clades have more sequence motifs fulfilling the criteria applied. Interestingly, many motifs map to NSP12 making this a prime target for coronavirus antivirals. Severe Acute Respiratory Syndrome (SARS)-CoV and Middle East Respiratory Syndrome (MERS)-CoV are two closely related zoonotic coronaviruses. Both have successfully crossed the species barrier to allow animal-to-human transmission, and further to allow human-to-human transmission (Song et al. 2005; Reusken et al. 2016) . The SARS outbreak in 2003 had a mortality rate of 10% (Anderson et al. 2010) , and SARS-CoV was considered the most aggressive coronavirus compared to other human coronaviruses that commonly cause mild to moderate infection in their hosts (van der Hoek 2007) . MERS-CoV is the cause of an ongoing outbreak of the respiratory illness MERS (de Groot et al. 2013) . At the time of writing, 1791 MERS cases have been confirmed with a mortality rate of approximately 35% (World Health Organization 2016). Both MERS and SARS have higher mortality rates in elderly and immunosuppressed populations (Gralinski and Baric 2015) . The host changes by MERS-CoV and SARS-CoV suggest that other coronaviruses can potentially cross the species barrier, become zoonotic, and enable human-to-human transmission, ultimately causing high morbidity and mortality. SARS-CoV and MERS-CoV exploited mechanistically different approaches to overcome the human species barrier, but these two viruses have a lot in common (Lu et al. 2015) . Here, we aim to identify the vulnerable regions in the proteomes of coronaviruses that neither SARS-CoV nor MERS-CoV nor their contemporary and forthcoming relatives can proliferate without, and address how to mobilize a defense against the present and future coronaviruses by targeting these regions. SARS-CoV and MERS-CoV are positive (+)-strand RNA viruses encoding approximately 25 protein products. The MERS-CoV proteome is primarily composed of two polyproteins, ORF1a and ORF1ab; the latter is generated by a -1 ribosomal slippage frameshift. These proteins are cleaved into 16 nonstructural proteins (NSPs). NSPs 1-10 are products of both polyproteins, whereas NSPs 12-16 are only yielded by ORF1ab. NSP11 is unique to ORF1a ( van Boheemen et al. 2012) . Structural proteins envelope (E), spike (S), membrane (M), and nucleocapsid (N) are elements of the physical structure that encloses the viral genome and come from distinct reading frames, unlike ORF1a and ORF1ab, which come from overlapping reading frames. Additionally, the structural proteins are the product of subgenomic mRNAs that are joined during discontinuous negative RNA strand synthesis (van Boheemen et al. 2012) . Finally, NS3 protein (NS3), NS4A protein (NS4A), NS4B protein (NS4B), NS5 protein (NS5), and Orf8b protein encompass the remainder of the proteome and also arise from distinct reading frames (van Boheemen et al. 2012) . Our approach utilizes genomic sequence data, which is readily available for viruses known to cause disease. However, because most viruses pose no major threat to their host, they pass by unnoticed leaving the majority of virus genome space uncharted. With the availability of costefficient genome sequencing technology, and recent developments in the field of viral metagenomics, large-scale identification of viral genome space is on the rise (Rosario and Breitbart 2011; Mokili et al. 2012) . By exploring viral diversity, critical components constituting a viral genus' fitness can be evaluated. Examples such as the common influenza virus illustrate the rapidity of viral gene mutation and in order to maintain immune protection, an annual flu vaccination is recommended. Underway efforts aim to generate broadly neutralizing vaccines whose design accounts for the genomic sequences of multiple types of influenza virus to eliminate frequent re-vaccination against the flu Ross 2011, 2012) . Development of broadly neutralizing vaccines often relies on the consensus or ancestral sequences of extant viral sequences in order to provide greater coverage for related viruses (Kesturu et al. 2006) . Unfortunately, consensus sequences can be misleading, and ancestral sequence reconstruction is error-prone for quickly diverging sequences (McCloskey et al. 2014 ). In addition, viruses with compact genomes often express proteins with structural disorder that may undergo structural transformations. Although these transformer proteins, like VP40 in Ebola, are masters at changing their structure, and thus expanding their functional repertoire as needed for the life cycle of the virus (Bornholdt et al. 2013) , flexible regions are potentially important in rewiring protein-protein interactions between the virus and its host (Le Breton et al. 2011; Ortiz et al. 2013; Gitlin et al. 2014) . The flexibility trait of many viral proteins is a complicating factor in vaccine development. For instance, Dengue virus exhibits serotype-specific antibody affinity that causes antibodydependent enhancement, an obstacle in the development of Dengue vaccines that protects against all four serotypes (Flipse and Smit 2015) . To overcome the hurdle posed by structural flexibility, we propose an additional screening step in identifying potential vaccine or antiviral targets that considers the structural flexibility of the viral proteins. The Structural Genomics Initiatives increased their success rate by excluding proteins predicted to be structurally disordered (Slabinski et al. 2007) . A similar approach can perhaps benefit vaccine development. Furthermore, to make this approach robust to potential mutations, minimizing loss in efficacy or resistance, the evolutionary context of sequence and structure must be considered. Thus, we suggest expanding the concept of broadly neutralizing vaccines/antivirals by increasing the diversity of viruses considered if possible. Sites conserved for sequence, structure, and with low disorder propensity among diverse virus protein homologs are very likely to be constrained from 1) changing sequence on evolutionary time scales and 2) undergoing real-time structural transitions. These sites have potential as targets for broad-specificity antivirals or vaccines because conservation makes them broad-specificity and low dynamics avoids targeting a conformational ensemble, which is not only difficult (Yu et al. 2016) , but that may change as the sequence diverges (Siltberg-Liberles et al. 2011) . A recent large-scale study of structural disorder in >2,000 viral genomes in 41 viral families found the amount of disorder in different virus families varying from 2.9% to 23.1% (Pushker et al. 2013) . It was reported that Coronaviridae has very low disorder content (mean disorder 3.68%) (Pushker et al. 2013) . Coronaviridae contains two subfamilies: Coronavirinae and Torovirinae. SARS-CoV and MERS-CoV are part the Coronavirinae subfamily, from here on referred to as coronavirus (CoV). The lack of disorder is intriguing because it may be important for rewiring interactions between viral proteins and host proteins (Ortiz et al. 2013 ) and providing opportunities to acquire novel functional sequence motifs (Gitlin et al. 2014) . Structural disorder has also been proposed to be important for viral viability, enabling multifunctionality and vigor in response to changes in the environment (Xue et al. 2014) . Given the low fraction of structural disorder reported across Coronaviridae, we set out to investigate the conservation of structural disorder and secondary structure across CoV. Sites identified as conserved for structure and lacking disorder can be considered to be vulnerable and druggable in the proteomes of coronaviruses. The structural divergence capacity of these regions is limited, leaving a wider range of the present and emergent coronaviruses susceptible to the effects of potential broadly neutralizing anti-CoV therapies targeting these sites. We will refer to these sites as target sites. Protein sequences were identified by individual BLAST searches with MERS-CoV (Taxonomy ID: 1335626) proteins ORF1ab (YP_009047202.1; polyprotein), S protein (YP_009047204.1), M protein (YP_009047210.1), E protein (YP_009047209.1), and N protein (YP_009047211.1) against coronaviruses. BLAST searches of the ORF1ab protein were performed, using start and end positions as detailed in the ORF1ab NCBI Reference Sequence file, against the refseq_protein database. The sequences retrieved from the BLAST output maintained the following cutoff: >30% sequence identity and >50% coverage relative to MERS-CoV sequence query. The 30% sequence identity and 50% query coverage cutoff strikes a balance between alignment quality and at least 10 sequences for most protein families. NSP1 (YP_009047202.1; 1-193), NSP2 (YP_009047202.1; 194-853), NS3 (YP_009047205.1), NS4A (YP_009047206.1), NS4B (YP_009047207.1), NS5 (YP_009047208.1), ORF8b protein (YP_009047212.1), and NSP11 (YP_009047203.1; 4378-4391) are not included in this study due to <10 BLAST hits. Multiple sequence alignments were constructed for the selected BLAST hits using MAFFT (Katoh et al. 2002) . Phylogenetic trees were constructed using MrBayes 3.2.2 with a four category gamma distribution and the mixed model for amino acid substitution (Huelsenbeck and Ronquist 2001; Ronquist and Huelsenbeck 2003) . Each tree ran for five million generations, with a sample frequency of 100. The final tree was constructed from the last 75% of samples, discarding the first 25% of samples as the default burnin, and using the half-compatible parameter, to avoid weakly supported nodes (i.e., with a posterior probability <0.5). All trees were midpoint rooted. For every protein family, the amino acid substitution rate per site in its multiple sequence alignment was calculated using empirical Bayesian estimation as implemented in Rate4Site (Mayrose et al. 2004) . Substitution rates were calculated using 16 gamma categories, the JTT substitution matrix (Jones et al. 1992) , and the reconstructed phylogenies. The rates were normalized per protein family with an average across all sites equal to zero and SD equal to 1. This means that sites with a rate <0 are evolving slower than average, whereas sites with a rate >0 are evolving faster than average. Intrinsic disorder propensity was inferred using two different predictors: IUPred (default settings; "long" option) (Dosztá nyi et al. 2005a (Dosztá nyi et al. , 2005b and DISOPRED2 (Ward et al. 2004 ) for all proteins. For IUPred, the site-specific continuous disorder propensities for each protein were mapped onto their corresponding position in the multiple sequence alignment as raw disorder propensities and as binary states, order or disorder, using two cutoffs of 0.4 and 0.5. Disorder propensities below the cutoff were assigned order and disorder propensities at the cutoff or above were assigned disorder. For the DISOPRED2 predictions that were inferred using the nr database, the continuous disorder propensities for every site in a protein were mapped onto their corresponding position in the multiple sequence alignment as raw disorder propensities and as binary states, order or disorder, using a cutoff of 5. Consequently, for every protein family (a multiple sequence alignment and its corresponding phylogenetic tree), two continuous matrices and three binary matrices resulted: IUPred 0.4, IUPred 0.5, and DISOPRED2. An additional matrix was generated to indicate sites where the binary order and disorder assignments differ between IUPred 0.4 and DISOPRED2. A similar methodology was employed to analyze secondary structure predicted by PSIPRED (McGuffin et al. 2000) and JPred (Drozdetskiy et al. 2015) . For both predictors, the uniref90 database was used and sites were classified as loops, alpha helices, or beta strands and mapped back onto their corresponding sites in the multiple sequence alignment. This resulted in two three-state matrices for each protein family alignment, one for each predictor, and two binary matrices displaying secondary structure elements (alpha helix and beta strand) or loops. An additional matrix was generated to indicate sites where the secondary structure assignments differ between PSIPRED and JPred. For every protein family, the binary matrices resulting from the different disorder predictions and from the different secondary structure predictions were analyzed in the corresponding evolutionary context using GLOOME. GLOOME (Gain-Loss Mapping Engine) analyzes binary presence and absence patterns in a phylogenetic context . In this study, the Rate4Site option in GLOOME was used to analyze the binary matrices (IUPred 0.4, IUPred 0.5, DISOPRED2, PSIPRED, and JPred) with the corresponding phylogenetic trees to map change of state across sites in each individual protein phylogeny . GLOOME was run with 16 gamma categories and a substitution matrix set to equal rates within each state and transitions between states treated equally. From the binary disorder and order matrices, transition rates between disorder and order or vice versa (DOT) were estimated. From the binary structure and loop matrices, transition rates between structure and loop or vice versa (SLT) were estimated. Similar to Rate4Site, the rates were normalized per protein family with an average across all sites equal to zero and SD equal to 1. This means that sites with a rate <0 are evolving slower than average, while sites with a rate >0 are evolving faster than average. Protein families were visualized in an integrative manner with a phylogenetic tree, any matrix (multiple sequence alignment or predictor based) displayed as a heatmap, and site-specific sequence transition rates using Python packages ETE3 (Huerta-Cepas et al. 2016) and Matplotlib (Hunter 2007) . Amino acid evolutionary rates (SEQ) for all sites across all alignments were aggregated and binned into four possible categories characterized by the distribution of PSIPRED predicted secondary structure at each site. Sites predicted to have a loop across all sequences are "conserved loops; C(L)" and sites predicted to have a helix across all sequences or a strand across all sequences are "conserved helix-strand; C(HS)" (table 3) . Sites predicted to have all three states (helix, strand, and loop) or any combination of loop and one other state are "non-conserved helix, loop, strand; NC(HLS)" and sites predicted to have a mixture of helix and strand are "nonconserved helix-strand; NC(HS)" (table 3) . In all cases, gaps were ignored when classifying combinations of secondary structure at a site or if secondary structure conservation exists at a particular site. Phylogenies were built for all protein products encoded in the MERS-CoV single-stranded RNA genome, except for NSP1, NSP2, NS3, NS4A, NS4B, ORF8b protein, and NSP11, all of which had insufficient sequence data (<10 sequence hits with BLAST). NSP12 is often used as a measure for newly identified coronaviruses. According to the International Committee of Taxonomy of Viruses, a major criterion in determining if a coronavirus is considered novel is pairwise sequence identity below 90% for NSP12 in all comparisons to previously known coronaviruses (Bermingham et al. 2012) . Four main clades, alphacoronavirus, betacoronavirus, gammacoronavirus, and deltacoronavirus ( fig. 1) , are identified in agreement with the taxonomic classifications described by the ICTV (International Committee on Taxonomy of Viruses 2015). Coronaviruses not listed by the ICTV are assumed to be a part of the clade in which representatives with known classifications are situated in our NSP12 phylogeny. The MERS clade and SARS clade are sister clades in the NSP12 phylogeny. The HKU1 clade and EQU clade are also sister clades. Together these four clades form the Betacoronavirus clade, in accordance with the ICTV classification (International Committee on Taxonomy of Viruses 2015). Betacoronavirus is represented in all phylogenies although the order of the individual subclades varies. Alphacoronavirus is often found as the sister clade or outgroup to betacoronavirus. Deltacoronavirus or gammacoronavirus are the most distantly related to the betacoronavirus. In the nucleocapsid phylogeny, gammacoronavirus is the first outgroup clade to betacoronavirus, and alphacoronavirus is the most distant outgroup. Most NSP trees exhibit some unresolved nodes at junctures immediately preceding terminal nodes. As an effect of the 50% majority rule, most of the 546 resolved nodes are well supported with posterior probability >0.9 for 82% and >0.99 for 68% (supplementary fig. S1 , Supplementary Material online). Most trees follow the NSP12 topology for the main clades, with minor clade rearrangements. It should be noted that for NSP5, the entire alphacoronavirus clade is placed within the betacoronavirus clade, as a sister clade to the MERS clade (supplementary fig. S1 , Supplementary Material online). This may be due to increased sequence divergence rates or due to recombination. Recombination events are rather frequent in coronaviruses (Su et al. 2016) , and the MERS clade potentially underwent multiple recombination events as part of the host change (Zhang et al. 2016) . The phylogenies for membrane protein, spike protein, NSP5, and NSP8-NSP16 demonstrate (with the given BLAST cutoffs) recoverable protein homologs such that all coronaviruses are represented (i.e., all coronaviruses represented in the NSP12 phylogeny). Nucleocapsid, NSP4, and NSP7 have recoverable homologs in all clades except deltacoronavirus. NSP3 and NSP6 homologs are too divergent in deltacoronavirus and/or gammacoronavirus relative to MERS-CoV. Envelope appears specific to betacoronavirus ( fig. 1) , but it is a short protein that has been found to diverge rapidly and is likely present outside betacoronavirus (Fehr and Perlman 2015) . Because different protein families yield slightly different phylogenies, for the remaining evolutionary analyses, every protein family was analyzed in the context of its own phylogeny. For all protein families, structural disorder propensities were predicted using IUPred (Dosztá nyi et al. 2005a (Dosztá nyi et al. , 2005b and DISOPRED2 (Ward et al. 2004) . To verify the robustness of the binary IUPred and DISOPRED2 predictions, the binary assignments were compared on a site-by-site basis (table 1) . When converted to binary (i.e., two states per site disordered or ordered) IUPred 0.4 and IUPred 0.5 are in good agreement with the larger differences seen for NSP8, NSP9, and nucleocapsid (7.5%, 6.5%, and 19.0%, respectively) (table 1). Comparing IUPred 0.4 or IUPred 0.5 to DISOPRED2, large differences are in particular seen for nucleocapsid (38.7% and 29.7% respectively) and NSP8 (23.5% and 25.9%, respectively) (table 1) . For nucleocapsid, regions that are found to be disordered by IUPred 0.4 are found to be ordered by IUPred 0.5 and DISOPRED2 ( fig. 2 and supplementary fig. S2 , Supplementary Material online). For NSP8, regions that are only slightly disordered in a few sequences according to IUPred 0.4 and IUPred 0.5, DISOPRED2 predicts disorder to be conserved for all sequences (fig. 3) . To quantify the fraction of disordered sites per protein family, we report the IUPred 0.4 results only for simplicity (table 1). In general, IUPred 0.4 predicts more disorder than DISOPRED2, but several protein families have almost no disordered sites. NSP3 and NSP8-10 have some variation in disorder content for different viruses. Based on the fraction of disorder, nucleocapsid is the only highly disordered protein among the CoVs in this study, even if NSPs 8-10 have outliers that are >20% disordered. To compare the disorder-to-order transition rates (DOT) for all protein families where the binary matrices of disorder and order include both states, the quadrant count ratio (QCR) was estimated as a measure of association in assigning slower than average vs. faster than average transition rates. For IUPred 0.4 vs. IUPred 0.5, for IUPred 0.5 vs DISOPRED2, and for IUPred 0.4 vs. DISOPRED2, the QCRs (table 2) . For nucleocapsid and NSP8, the positive associations are weaker, suggesting that many sites have IUPred disorder propensity in the 0.4 to 0.5 range and large differences between IUPred and DISOPRED2, in accordance with the large disagreement between the binary assignment of these predictors (tables 1 and 2). For all protein families, secondary structure elements were predicted using PSIPRED (McGuffin et al. 2000) and JPred (Drozdetskiy et al. 2015) . For most protein families, the disagreement between secondary structure predictors is greater than for the disorder predictors (table 1). In fact, 15 of the 17 protein families compared disagree at more than 10% of alignment sites, and two of these disagree at more than 20% of sites. To compare the binary structureto-loop transitions (SLT), QCR was estimated as a measure of association for SLT based on the different predictors. In general, there is a moderate positive association between SLT for PSIPRED vs. SLT for JPred that is weaker than for the different DOT comparisons (table 2) . It should be noted that SLT does not differentiate between alpha helix and beta strand, but considers both as "structure." This is a correct assumption if protein structure is conserved and consistently predicted, but for some protein families that is not the case. Four protein families (NSP3, NSP12, NSP13, and SPIKE) have more than 40% of their sites found within the NC(HLS) category with non-conserved helix, strand, and loop (two or three states present at the same site) (table 3). For NSP13, JPred predicts 72% of all sites to be a mixture of helix, strand, and loop, or any combination of loop and one other structural element ( fig. 4) . Envelope and NSP6 have 13% and 12% of their respective sites in the NC(HS) category. Considering only the PSIPRED predictions, the NC(HS) category has 245 sites across all 17 protein families. That is one-tenth the size of the next smallest set which is C(HS) with 2275 sites. Next, C(L) has 3344 sites, and the largest category is NC(HLS) with 4257 sites. Comparing the evolutionary sequence rates for the sites in the different categories, based on PSIPRED predictions only, reveals that sites in the C(HS) category are evolving at a slower rate than all other categories. NC(HS) is only just significantly different (P = 4.62EÀ03) from C(HS), and is not significantly different from NC(HLS) and C(L) (P = 1.85EÀ02 and P = 8.33EÀ01, respectively). However, NC(HLS) and C(L) are significantly different from each other, and both are significantly different from C(HS) (P = 1.82EÀ46 and P = 2.33EÀ21, respectively) ( fig. 5 ). For regions with five or more consecutive sites that were 100% conserved in sequence across 1) all CoV or 2) across the MERS and SARS clades, the information of structural disorder prediction from IUPred and DISOPRED2 was used to identify all ungapped sites that were consistently predicted to have 100% conserved order. Next, the information of secondary structure prediction from PSIPRED and JPred was used to narrow down this list further by only including sites that are not changing their predicted secondary structure state for both predictors. Applying the aforementioned filters to the initial 10,000 sites resulted in one (1) region of five residues or more conserved across all CoV within the N-terminal domain of NSP12: DNQDL (table 4) . Interestingly, this region is in the vicinity of sites found important for nucleotidylating activity across the order Nidovirales (Lehmann et al. 2015) . Considering only the sequences in the SARS and MERS clades, 21 sequence regions of five residues or more were found in seven protein families (table 4) . For NSP5, NSP7, and NSP14, experimentally determined structures show that most regions are surface accessible ( fig. 6) . Some of the identified target sites are known for their functional importance. For instance, C145 in the middle of GSCGS in NSP5 is part of the catalytic dyad in the NSP5 protease (Yang et al. 2003) . For NSP12 and NSP13, which have the majority of all sites, no structures are available. The sites adjacent to DNQDL are also conserved in the SARS and MERS clades, and five additional target sites, conserved for the SARS and MERS clades, are found in the C-terminal direction relative to the DNQDL motif (table 4) . Continuing into the RNA-dependent RNA polymerase domain (RdRP) in NSP12, four additional regions of target sites are found, and the last three regions are found in the C-terminal part. Importantly, in RdRP and in the C-terminal part are sites that are also conserved across all CoVs in this study. NSP13 has four regions of target sites distributed across the protein. 3. -The evolutionary context of intrinsic disorder in NSP8. The phylogenetic tree was built using the multiple sequence alignments for NSP8. (A) The multiple sequence alignment is colored by amino acid according to scale, arranged based on TOP-IDP disorder promoting propensity of the amino acids (Campen et al. 2008) , and gray denotes gaps. (B) IUPred disorder propensity per site in the multiple sequence alignment. Blue-to-white-to-red shows disorder propensity according to the scale for IUPred 0.4. (C) IUPred disorder propensity per site in the multiple sequence alignment. Blue-to-white-to-red shows disorder propensity according to the scale for IUPred 0.5. (D) DISOPRED2 disorder propensity per site in the multiple sequence alignment. Blue-to-white-tored shows disorder propensity according to the scale. Above the multiple sequence alignment, the normalized evolutionary rates per site for amino acid substitution (SEQ) and the DOT for the binary transformations of B-D are shown. Heat maps visualized with the Python packages ETE3 (Huerta-Cepas et al. 2016) and Matplotlib (Hunter 2007) . See supplementary figures S2 and S4, Supplementary Material online for additional graphics for every protein family. We have analyzed the protein evolution of the genetic components that make up the MERS-CoV proteome. As previously established, MERS-CoV has the same genomic makeup as HKU4-CoV and HKU5-CoV in the MERS clade (Woo et al. 2012) . Some protein products are only found in the MERS clade, and these were excluded from this study due to insufficient data. Furthermore, for other protein products, some clades may not be represented in our protein families if their proteins were too divergent. This was an important factor in determining the applied BLAST hit cutoffs, as relaxing cutoffs produced alignments with more gaps and increasing stringency reduced the representative pool. Because alignment quality is important due to the sensitivity of both Rate4Site and for phylogenetic reconstruction, the chosen cutoffs are suitable. We note some clade-specific differences in recoverable homologs between different CoV, but many components are shared among them ( fig. 1 ). Viral proteins often possess multifunctionality, mediated by a conformational change in response to environment-specific factors (Xue et al. 2014) . Although conformational flexibility is important for function, it also offers flexibility in what sequence motifs are on display. If these sequences are rapidly diverging, different sequence motifs will be displayed, reinforcing the notion that flexible regions are potentially important in rewiring protein-protein interactions between virus and host (Gitlin et al. 2014) . Although most CoV proteins have almost no intrinsic disorder, several CoV protein families have homologous sites that display loop in some sequences, helix in others and strands in some (table 3, supplementary fig. S3 , Supplementary Material online). These sites are not necessarily disordered but they may be conformationally flexible in realtime (with secondary structure transitions in the same sequence, making them difficult to predict) or on evolutionary time-scales (so that different secondary structure elements actually are present in different sequences). The C(HS) and C(L) sites make up approximately 50-80% of most multiple sequence alignments. With the common expectation that protein structure is more conserved than sequence these numbers are surprisingly low. Neither PSIPRED nor JPred consistently predicts the same state for 20-50% of all sites in these multiple sequence alignments. The accuracy of PSIPRED and JPred's secondary structure predictions are about 80% (Bryson et al. 2005; Drozdetskiy et al. 2015) . PSIPRED has been found to rarely predict an alpha helix instead of a beta strand and vice versa, and most of the PSIPRED errors are due to secondary structure not being predicted (Li et al. 2014) . When secondary structure is not conserved for the same site in a multiple sequence alignment, it suggests that the secondary structure prediction may be 1) inaccurate, 2) not predicted with high confidence, or 3) the regions are indeed metamorphic; they can transition from one element to another. Although (1) is difficult to address without experimentally determined structures for all sequences, (2) and (3) are not necessarily incompatible interpretations because low confidence secondary structure prediction could indicate metamorphic secondary structure regions. Metamorphic secondary structure regions have interesting consequences for conformational and functional flexibility. It should be noted that, despite the low amount of disordered sites in most CoV proteins, several regions are not conserved in disorder propensity across all sequences, but sometimes the different predictors disagree as in the case of NSP8. Clade-specific disordered regions resulting from indel events suggest that they are not essential to the critical functions of the protein, but could cause gain-and-loss of interactions with its hosts. However, when disorder propensity is only mildly fading for a region that is present across the protein family, it may be important for the fundamental function of the protein. The virus structural proteins that interact to form the virion commonly include an envelope protein, a membrane protein, and a capsid protein that together form the machinery that encases, transports, and releases the virus. The interactions between the structural proteins are often regulated by conformational changes like VP40 in Ebola (Bornholdt et al. 2013 ) and Envelope protein from Dengue virus (Zheng et al. 2014 ). Conformational changes to another. It should also be noted that two MERS clade specific inserts around position 241 and toward the C-terminal are consistently predicted to be highly disordered. With inserts and changing structural dynamics between clades or viruses, the questions become 1) which sequence motif are displayed and 2) to what extent are these sequence motifs displayed? Furthermore, based on the inconsistent prediction of secondary structure elements, the possibility that CoVs are more conformationally flexible than their intrinsic disorder content implies is noteworthy. Altogether, this suggests that various mechanisms for rewiring conformational and functional space are operating in the coronaviruses studied here. If regions symptomatic of conformational and functional flexibility can be avoided in order to identify broad-specificity antiviral targets with potential to be effective against coronaviruses of today and in the future, coronaviruses as a group may become more attractive drug targets for the pharmaceutical industry in the event an additional coronavirus changes host to include humans or increase its virulence. Supplementary tables S1 and S2 and figures S1-S5 are available at Genome Biology and Evolution online (http://www. gbe. oxfordjournals.org/). Update on SARS research and other possibly zoonotic coronaviruses The Protein Data Bank Severe respiratory illness caused by a novel coronavirus Structural rearrangement of ebola virus VP40 begets multiple functions in the virus life cycle Flavivirus NS3 and NS5 proteins interaction network: a high-throughput yeast two-hybrid screen Protein structure prediction servers at University College London TOP-IDP-scale: a new amino acid scale measuring propensity for intrinsic disorder GLOOME: gain loss mapping engine Inference and characterization of horizontally transferred gene families using stochastic mapping Middle East respiratory syndrome coronavirus (MERS-CoV): announcement of the Coronavirus Study Group IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins JPred4: a protein secondary structure prediction server Coronaviruses: an overview of their replication and pathogenesis The Complexity of a Dengue Vaccine: A Review of the Human Antibody Response Target sites shown in 3D context. (A) NSP5 dimer, based on PDB id 1UK4 (Yang et al. 2003). (B) NSP7, based on PDB id 5F22 (unpublished). (C) NSP14, based on PDB id 5C8T A computationally optimized broadly reactive antigen (COBRA) based H5N1 VLP vaccine elicits broadly reactive antibodies in mice and ferrets Computationally optimized antigens to overcome influenza viral diversity Rapid evolution of virus sequences in intrinsically disordered protein regions Molecular pathology of emerging coronavirus infections MRBAYES: Bayesian inference of phylogenetic trees ETE 3: Reconstruction, analysis, and visualization of phylogenomic data Virus taxonomy: classification and nomenclature of viruses: Ninth Report of the International Committee on Taxonomy of Viruses The rapid generation of mutation data matrices from protein sequences MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform Minimization of genetic distances by the consensus, ancestral, and center-of-tree (COT) sequences for HIV-1 variants within an infected individual and the design of reagents to test immune reactivity Discovery of an essential nucleotidylating activity associated with a newly delineated conserved domain in the RNA polymerase-containing protein of all nidoviruses Bayesian model of protein primary sequence for secondary structure prediction Bat-to-human: spike features determining 'host jump' of coronaviruses SARS-CoV, MERS-CoV, and beyond Structural basis and functional analysis of the SARS coronavirus nsp14-nsp10 complex Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior An evaluation of phylogenetic methods for reconstructing transmitted HIV variants using longitudinal clonal HIV sequence data The PSIPRED protein structure prediction server Metagenomics and future perspectives in virus discovery Rapid evolutionary dynamics of structural disorder as a potential driving force for biological divergence in flaviviruses Marked variability in the extent of protein disorder within and between viral families Cross host transmission in the emergence of MERS coronavirus MrBayes 3: Bayesian phylogenetic inference under mixed models Exploring the viral world through metagenomics The evolution of protein structures and structural Ensembles under functional constraint The challenge of protein structure determination-lessons from structural genomics Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human Epidemiology, genetic recombination, and pathogenesis of coronaviruses Genomic characterization of a newly discovered coronavirus associated with acute respiratory distress syndrome in humans The DISOPRED server for the prediction of protein disorder Genetic relatedness of the novel human group C betacoronavirus to Tylonycteris bat coronavirus HKU4 and Pipistrellus bat coronavirus HKU5 World Health Organization. 2016. WHO j Middle East respiratory syndrome coronavirus Structural disorder in viral proteins The crystal structures of severe acute respiratory syndrome virus main protease and its complex with an inhibitor Structure-based inhibitor design for the intrinsically disordered protein c-Myc Evolutionary dynamics of MERS-CoV: potential recombination, positive selection and transmission A toggle switch controls the low pH-triggered rearrangement and maturation of the dengue virus envelope proteins We thank Joseph Ahrens, Janelle Nunez-Castilla, and Helena Gomes Dos Santos for assistance in the lab and for helpful discussions. The authors would also like to acknowledge the Instructional & Research Computing Center (IRCC) at Florida International University for providing HPC computing resources that have contributed to the research results reported within this article, web: http://ircc.fiu.edu.