key: cord-338055-2d6n4cve authors: Hassan, Sk. Sarif; Ghosh, Shinjini; Attrish, Diksha; Choudhury, Pabitra Pal; Seyran, Murat; Pizzol, Damiano; Adadi, Parise; El-Aziz, Tarek Mohamed Abd; Soares, Antonio; Kandimalla, Ramesh; Lundstrom, Kenneth; Tambuwala, Murtaza; Aljabali, Alaa A. A.; Lal, Amos; Azad, Gajendra Kumar; Uversky, Vladimir N.; Sherchan, Samendra P.; Baetas-da-Cruz, Wagner; Uhal, Bruce D.; Rezaei, Nima; Brufsky, Adam M. title: A unique view of SARS-CoV-2 through the lens of ORF8 protein date: 2020-08-26 journal: bioRxiv DOI: 10.1101/2020.08.25.267328 sha: doc_id: 338055 cord_uid: 2d6n4cve Immune evasion is one of the unique characteristics of COVID-19 attributed to the ORF8 protein of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). This protein is involved in modulating the host adaptive immunity through downregulating MHC (Major Histocompatibility Complex) molecules and innate immune responses by surpassing the interferon mediated antiviral response of the host. To understand the immune perspective of the host with respect to the ORF8 protein, a comprehensive study of the ORF8 protein as well as mutations possessed by it, is performed. Chemical and structural properties of ORF8 proteins from different hosts, that is human, bat and pangolin, suggests that the ORF8 of SARS-CoV-2 and Bat RaTG13-CoV are very much closer related than that of Pangolin-CoV. Eighty-seven mutations across unique variants of ORF8 (SARS-CoV-2) are grouped into four classes based on their predicted effects. Based on geolocations and timescale of collection, a possible flow of mutations was built. Furthermore, conclusive flows of amalgamation of mutations were endorsed upon sequence similarity and amino acid conservation phylogenies. Therefore, this study seeks to highlight the uniqueness of rapid evolving SARS-CoV-2 through the ORF8. Severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2) is a novel coronavirus whose first outbreak was reported in December 2019 in Wuhan, China, where a cluster of pneumonia cases was detected, and on 11th March, 2020, WHO declared this outbreak a pandemic [1, 2, 3] . As of 22nd August 2020, a total of 22.9 million confirmed COVID-19 cases have been reported worldwide with 7,99,350 deaths [4] . SARS-CoV-2 belongs to the family Coronaviridae and has 55% 5 nucleotide similarity and 30% protein sequence similarity with SARS-CoV, which caused the previous outbreak of SARS in 2002 [5, 6] . SASR-CoV2 is a single-stranded RNA virus of positive polarity whose genome is approximately 30 kb in length and encodes for 16 non-structural proteins, four structural and six accessory proteins [7, 8] . The SARS-CoV-2 has a total of six accessory proteins including 3a, 6, 7a, 7b, 8 and 10 [9, 10] . Among these accessory proteins, ORF8 is a significantly exclusive protein as it is different from another known SARS-CoV and thus associated with high efficiency 10 in pathogenicity transmission [11, 12] . The SARS-CoV-2 ORF8 displays arrays of functions; inhibition of interferon 1, promoting viral replication, inducing apoptosis and modulating ER stress [13, 14, 15] . The SARS-CoV-2 ORF8 is a 121 amino acid long protein, which has an N-terminal hydrophobic signal peptide (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) aa) and an ORF8 chain (16-121 aa) [16, 17] . The functional motif (VLVVL) of SARS-CoV-ORF8b, which is responsible for induction of cell stress pathways and activation of macrophages is absent from the SARS-CoV-2 ORF8 protein [18] . In 15 the later stages of the SARS-CoV epidemic it was found that a 29 nucleotide deletion in the ORF8 protein caused it to split into ORF8a (39 aa) and ORF8b (84aa) rendering it functionless while the SARS-CoV-2 ORF8 is intact [19] . Also, the SARS-CoV ORF8 had a function in interspecies transmission and viral replication efficiency as a reported 382 nucleotide deletion, which included ORF8ab resulted in a reduced ability of viral replication in human cells [20] . However, the SARS-CoV-2 ORF8 mainly acts as an immune-modulator by down-regulating MHC class I molecules, therefore protecting the 20 infected cells against cytotoxic T cell killing of target cells. Simultaneously it was proposed forth that it is a potential inhibitor of type 1 interferon signalling pathway which is a key component of antiviral host immune response [21, 22] . The ORF8 also regulates unfolded protein response (UPR) induced due to ER stress by triggering ATF-6 activation, thus promoting infected cell survivability for its own benefit [23] . Since this protein impacts various host pathogen processes and developed various strategies that allow it to escape through host immune responses, it becomes important to study the 25 mutations in the ORF8 to develop a better understanding of the viral infectivity and for developing efficient therapeutic drugs against it [24] . In this present study, we identified the distinct mutations present across unique variants of the SARS-CoV-2 ORF8 and classified them according to their predicted effect on the host, i.e disease or neutral and the consequences on protein structural stability. Furthermore, we compared the ORF8 protein of SARS-CoV-2 with that of Bat-RaTG13-CoV and 30 Pangolin-CoV ORF8 and tried to determine the evolutionary relationships with respect to sequence similarity and discussed the rising concern on its originality. In addition we made a study on polarity and charge of the SARS-CoV-2 ORF8 mutations in two of its distinct domains and explored the possible effect on functionality changes. Following this, we present the possible flow of mutations considering different geographical locations and chronological time scale simultaneously, validating with sequence-based and amino acid conservation-based phylogeny and thereby predicted the possible route 35 taken through the course of assimilation of mutations. As of 14th August, 2020, there were 10,314 complete genomes of SARS-CoV-2 available in the NCBI database and accordingly each genome contains one of the accessory proteins ORF8 and among them only 127 sequences were found to 40 be unique. The amino acid sequences of the ORF8 were exported in fasta format using file operations through Matlab [25] . Among these 127 unique ORF8 protein sequences, only 96 sequences possess various mutations and the remaining sequences do not either possess any mutations or possess ambiguous mutations. In this present study, we only concentrate on 96 ORF8 proteins, which are listed in Table 1 . In order to find mutations, we hereby consider the reference ORF8 protein as the ORF8 sequence (YP 009724396.1) of the SARS-CoV-2 genome (NC 045512) from China: Wuhan [26] . 45 Only the mutations L3F, T14A, K44R, F104Y and V114I are embedded in the ORF8 proteins of Bat-CoV. Mutations in proteins are responsible for several genetic orders/disorders. Identifying these mutations requires novel detection methods, which have been reported in the literature [27] . In this study, each unique ORF8 sequence was aligned using NCBI protein p-blast and sometimes using omega blast suites to determine the mismatches and thereby the missense mutations (amino acid changes) are identified [28, 29] . For the effect of identified mutations, a web-server 'Meta-SNP' was used and also for the structural effects of mutations, another web-server 'I-MUTANT' was used [30, 31] . The web-server 60 'QUARK' was used for prediction of the secondary structure of ORF8 proteins [32, 33] . A protein sequence of ORF8 is composed of twenty different amino acids with various frequencies. Per-residue disorder distribution in sequences of query proteins was evaluated by PONDR-VSL2 [40] , which is one of the more accurate standalone disorder predictors [41, 42, 43] . The per-residue disorder predisposition scores are on a scale from 0 to 1, where values of 0 indicate fully ordered residues, and values of 1 indicate fully disordered residues. Values above the threshold of 0.5 are considered disordered residues, whereas residues with disorder scores between 0.25 and 0.5 are considered highly flexible, and residues with disorder scores between 0.1 and 0.25 are taken as moderately flexible. The SARS-CoV-2 ORF8 protein (YP 009724396) is a 121 amino acid long protein which has an N-terminal hydrophobic signal peptide (1-15 aa) and an ORF8 chain (16-121 aa). A schematic representation of ORF8 (SARS-CoV-2) is presented in Fig.2 . It was observed that the total number (63) of hydrophilic residues is more than that (58) of the hydrophobic residues. However, from the predicted secondary structure (Fig.3) , it was found that the highest solubility score is four, indicating that although hydrophilic residues are higher in number, however, they are not highly exposed to the external environment since they are folded inside of the protein therefore this protein is poorly soluble. We further predicted the secondary structure as well as solvent accessibility of ORF8 proteins of SARS-CoV-2, Bat- RaTG13-CoV and Pangolin-CoV using the ab-initio webserver QUARK (Fig.3 ) and tried to perceive the differences. Comparing the ORF8 secondary structures of SARS-CoV-2, Bat-RaTG13-CoV and Pangolin-CoV, the following changes are found at four different locations as presented in Table 2 . Table 2 , it is inferred that the secondary structures of ORF8 (SARS-CoV-2) and that of Bat-RaTG13 are much closer related than the ORF8 of Pangolin-CoV. 90 It was obtained that the largest conserved region in the ORF8 protein of SARS-CoV-2 is 'PFTINCQE' (in D2), eight amino acids long (Table 3 ) as seen in Fig.4 . It turned out that the 6.6% region is 100% evolutionary conserved across all 96 distinct variants of the 121 amino acid long ORF8 protein. Most of the conserved regions in SARS-CoV-2 ORF8 (Table 3 and Fig.4 ) lie around helix-coil and strand-coil junctions signifying the functional importance of these regions. Helix regions also have conserved amino acids. It can be hypothesised 95 that these junctions are involved in protein-protein interactions. Therefore, these regions are conserved in nature. The ORF8 protein of SARS-CoV-2 has only 55.4% nucleotide similarity and 30% protein identity with that of SARS-CoV as shown in Fig.5 . Although the SARS-CoV-2 ORF8 has differentiating genome characteristics but exhibit high functional similarity with that of SARS-CoV ORF8ab: • The SARS-CoV ORF8ab original protein was found to have an N-terminal hydrophobic signal sequence that directs its transport to the endoplasmic reticulum (ER). However, after deletion of 29 nucleotides, which split the ORF8ab protein into ORF8a and ORF8b, it was found that only ORF8a was able to translocate to the ER while ORF8b remained distributed throughout the cell. Likewise, the SARS-CoV-2 ORF8 protein also contains an N-terminal hydrophobic signal peptide (1-15 aa), which is involved in the same function. • The ER has an internal oxidative environment akin to other organelles, which is necessary for protein folding and oxidation processes. Due to this oxidative environment, formation of intra or inter-molecular cysteine bonds between unpaired cysteine residues takes place as the SARS-CoV ORF8ab protein is an ER resident protein and there are ten cysteine residues present, which exhibit disulphide linkages and form homomultimeric complexes within the ER. Similarly, the ORF8 of SARS-CoV-2 has also seven cysteines, which may be expected to form these types of disulphide 110 linkages. • The SARS-CoV ORF8ab is characterized by the presence of an asparagine residue at position 81 with the Asn-Val-Thr motif responsible for N-linked glycosylation, whereas SARS-CoV-2 ORF8 has an N-linked glycosylation site at Asn (78) and the motif is Asn-Tyr-Thr. • The SARS-CoV-2 ORF8 is found to have both protein-protein and protein-DNA interactions while SARS-CoV ORF8ab shows only protein-protein interactions. Further, it was found that the ORF8 protein of SARS-CoV-2 is very much similar (95%) to that of Bat-CoV RaTG13 on the basis of sequence similarity as well as on the basis of phylogenetic relationships as shown in Fig.6 . As we can see from the sequence alignment, there are only six amino acid differences between the SARS-CoV-2 ORF8 and Bat-CoV RaTG13. All of these mutations in the ORF8 protein with respect to the reference ORF8 sequence of 120 Bat-RaTG13 were found to be neutral type as predicted through webserver Meta-SNP and all of them had a decreasing effect on stability of the protein as determined using the server (I-MUTANT). We have also aligned the Pangolin-CoV ORF8 sequence with that of SARS-CoV-2 and found that there is a sequence similarity of 88%, as depicted in Fig.7 . We observed a difference of 15 amino acid residues between the Pangolin-CoV and the SARS-CoV-2 ORF8. It was 125 analysed that in both the Bat-CoV RaTG13 and the Pangolin-CoV ORF8 protein the mutations L10I, V65A and S84L are occurred. So, it can be hypothesised that SARS-CoV-2 may have originated from Pangolin-CoV or Bat-CoV RaTG13 ORF8. Also, the comparison of frequencies of the hydrophobic, hydrophilic and charged amino acids was performed, which are present among the four different ORF8 proteins of SARS-CoV, SARS-CoV-2, Bat RaTG13-CoV and Pangolin-CoV was 130 made. As seen in Table 4 , SARS-CoV-2 ORF8, SARS-CoV ORF8ab, Bat-CoV RaTG13 ORF8 and Pangolin-CoV ORF8, are all similar in terms of hydrophobicity and hydrophilicity and it is known that hydrophobicity and hydrophilicity play an important role in protein folding which determines the tertiary structure of the protein and thereby affect the function. The ORF8 sequences of SARS-CoV-2, Bat-CoV RaTG13 and Pangolin-CoV have almost the same positive and negative charged amino acids, therefore we can say that probably they have similar kind of electrostatic and hydrophobic interactions, 135 which also contribute to the functionality of the proteins. Again, for the SARS-CoV ORF8ab it was found that the number of positive and negative charged amino acids are closely similar to the SARS-CoV-2 ORF8. Although, the SARS-CoV sequence bears less similarity with the SARS-CoV-2 but they are probably similar in terms of electrostatic and hydrophobic interactions. Furthermore, we analysed the three ORF8 sequences and checked for their molecular weight, isoelectric point (PI), 140 hydropathy, net charge and extinction coefficient using a peptide property calculator (https : //pepcalc.com/) and found that all the properties are almost identical, which shows that the ORF8 protein of Bat-CoV RaTG13, Pangolin-CoV and SARS-CoV-2 are very closely related from the chemical aspects of amino acid residues (Fig.8 ). indicating that ORF8 of the Pangolin-CoV is more negatively charged as compared to the SARS-CoV-2 ORF8. Also, since these three proteins have a similar molecular weight, PI, extinction coefficient and nature of hydropathy plot, it would be difficult to differentiate these three proteins by biophysical techniques on the basis of these properties. The ORF8 of SARS-CoV-2 is significantly different from the ORF8 of Pangolin-CoV and it seems that the ORF8 150 protein (SARS-CoV-2) is imitating the properties as well the structure of ORF8 of Bat RaTG13-CoV as a blueprint. The differences between the ORF8 of SARS-CoV-2 and Bat-CoV RaTG13 and Pangolin-CoV can be further demonstrated by the analysis of the per-residue intrinsic disorder predispositions of these proteins. Results of this analysis are shown in Fig.9A , which illustrates that the intrinsic disorder propensity of the ORF8 from SARS-CoV-2 is closer to that of the ORF8 from Bat-CoV RaTG13 than to the disorder potential of ORF8 from Pangolin-CoV. This is in agreement 155 with the results of other analyses conducted in this study. Analysis is conducted using PONDR-VSL2 algorithm [40] , which is one of the more accurate standalone disorder predictors [41, 42, 43] . A disorder threshold is indicated as a thin line (at score = 0.5). Residues/regions with the disorder scores > 0.5 are considered as disordered. Each of the ORF8 amino acid sequences (fasta formatted) are aligned with respect to the ORF8 protein (YP 009724396.1) from China-Wuhan using multiple sequence alignment tools (NCBI Blastp suite) and found the mutations and their associated positions were detected accordingly [29] . It is noted that a mutation from an amino acid A 1 to A 2 at a position p 160 is denoted by Based on observed mutations, it is noticed that amino acids threonine (T) and tryptophan (W) are found to be most vulnerable to mutate to various amino acids. It is noteworthy that the SARS-CoV-2 ORF8 is rapidly undergoing different type of mutations, indicating that it is a highly evolving protein, whereas the Bat-CoV ORF8 is highly conserved (Fig.6) and the Pangolin-CoV ORF8 is 100% conserved (Fig.7) . A pie chart presenting the frequency distribution of various mutations is shown in Fig.11 . The N-terminal signal peptide of ORF8 (D1) of SARS-CoV-2 is hydrophobic in nature. We further analysed (Table 5) the mutations and observed that hydrophobic to hydrophobic mutations are dominating, indicating that hydrophobicity of the domain is maintained and thus we can postulate that there is probably no functional change in the hydrophobic N-terminal signal peptide. Further, it was found that there was a change from hydrophilicity to hydrophobicity in two positions, thus enhancing Distinct non-synonymous mutations and the associated frequency of mutations, predicted effect (using Meta-SNP) as well as the predicted change of structural stability (using I-MUTANT) due to mutation(s) are presented in Table 6 . The most frequent mutation in the ORF8 proteins turned out to be L84S (hydrophobic (L) to non-charged hydrophilic (S)) which is a clade (S) determining mutation with frequency 23 [44] . It is observed from Based on predicted effects and change of structural stability, mutations are grouped into four classes as shown in Table 190 7. In Table 8 , the list of unique ORF8 protein IDs and their associated mutations with domain(s) and the predicted effects and changes of structural stability are presented. Further, based on the three different types of mutations viz. neutral, disease and mix of neutral & disease, all the ORF8 proteins are classified into three groups which are adumbrated in Table 9 . Also the corresponding pie chart is given in Fig.12 . Note that, other than these two strains, there are other sixty-four other ORF8 sequences, which do not possess any of these two strain-determining mutations. This clarifies that the ORF8 protein is certainly one of the fundamental proteins which directs the pathogenicity of a variety of strains of SARS-CoV-2. To study the evolution of mutations and to observe the relationships among three ORF8 proteins, we compared SARS- CoV-2 ORF8 with that of Bat-CoV RaTG13 and Pangolin-CoV ORF8. The detailed analysis of all mutations is presented in Table 12 . Based on Table 12 , it can be suggested that reverse mutations will lead to the same sequence for the SARS-CoV-2 ORF8 protein will have the same sequence as that of Bat-CoV RaTG13 and Pangolin-CoV ORF8 proteins in the near future. We can also conclude that SARS-CoV-2 is showing genetically reverse engineering when compared with Bat-CoV 220 RaTG13 and Pangolin-CoV ORF8. The reversal of mutation also happened here and it was found that the frequency of Leucine to Serine mutation at 84th position is quite high. predisposition in the vicinity of residue 50, whereas the lowest disorder propensity in the vicinity of residue 70 is found in variants QKV06506.1 and QKQ29929.1. Finally, although variant QMU91370.1 is almost indistinguishable from variant QJS56890.1 within the first 40 residues, its intrinsic disorder predisposition in the vicinity of residue 70 is one of the lowest among all the proteins analysed in this study. Interestingly, comparison of the Fig.9 (A) and Fig.9 (B) show that the variability in the disorder predisposition between many variants of the protein ORF8 from SARS-CoV-2 isolates is 235 noticeably greater than that between the reference ORF8 from SARS-CoV-2 and ORF8 proteins from Bat-CoV RaTG13 and Pangolin-CoV. Here we present five different possible mutation flows according to the date of collection of the virus sample from patients [45] . Sequence homology-and amino acid compositions-based phylogenies have been drawn for the ORF8 proteins 240 associated in each flow. In this flow of mutations (Fig.13 In order to support these mutation flows, we analysed the protein sequence similarity based on phylogeny and amino acid 255 composition. The reference ORF8 sequence YP 009724396 is found to be much more similar to the variants QMT48896.1 and QMI92505.1, which are more similar to each other as depicted in the sequence based phylogeny (Fig.14 (left) ). This sequence based similarity of the ORF8 proteins QMT48896.1 and QMI92505.1 is illustrated in the chronology of mutations as shown in Fig.13 . Similarly, the mutation flow of the sequences QMT96239.1 and QMU92030.1 is supported by the respective sequence-based similarity. The network of five ORF8 protein variants from the US is justified based on similar amino acid compositions/conservations across the five sequences as shown in Fig.14 (right) . In this flow of mutations (Fig.15 ), we observed one sequence with first-order mutations, i.e where only one mutation accumulated in the sequence. Additionally, four sequences (all are from the US) were identified with second order mutations 265 stating that four sequences were found to have two mutations. to the hydrophobic phenylalanine (F), so it may account for disrupting the ionic interactions. As it is a neutral mutation, the sequence accumulated two neutral mutations. The protein sequence QLH58953.1 acquired a second mutation, P38S, which was found to be of disease-increasing type and the polarity also changed from hydrophobic to hydrophilic, thus 270 indicating these mutations may have some significant importance. The protein sequence QLH58821.1 possesses a second mutation, V62L, which was found to be of neutral type with no change in polarity. Here, this sequence accumulated two neutral mutations, which may account for some functional changes. By comparing both the sequence-based phylogeny ( Fig.16 (left) ) and amino acid conservation-based phylogeny (Fig.16 (right)), we found that according to sequence-based phylogeny the Australian sequence is closely related to the ORF8 275 Wuhan sequence. However, according to the pathway, it should be closely related to both the Wuhan sequence and the sequences having second order mutation. This can be attributed to the presence of 119 amino acid residues instead of 121 aa residues. In this case, the sequence has a two amino acid deletion, therefore, it is present at first node. Here we analysed the US sequences considering the Wuhan sequence (YP 009724396.1) as the reference and found one sequence, QKC05159.1, with a single mutation and seven sequences with two mutations each (Fig.17 ). The first sequence, QKC05159.1, contained the L84S mutation (strain determining mutation), which was of neutral type. However, the polarity changed from hydrophobic to hydrophilic, which may account for some significant change of function. The sequences that accumulated 2nd mutation along with L84S are as follows: • QMS53022.1: This protein sequence acquired a second mutation at position 11, which changed the hydrophilic amino acid threonine (T) to hydrophobic amino acid isoleucine (I) therefore affecting the ionic interactions. This 290 mutation was found to be a diseased-increasing type, so it may affect the structure of the protein. • QMT50804.1: This sequence gained a second mutation, E19D, which was predicted to be of disease-increasing type with no change in polarity. The sequence first accumulated a neutral mutation then a disease-increasing mutation, signifying that these mutations may have some functional importance. • QJD48694.1: H112Q occurred as second mutation in this sequence, which was found to be of disease-increasing 295 type with no change in polarity. Consequently, these mutations may contribute to immune evasion property of the virus. • QKV06506.1: This sequence possesses the S67F mutation, which was predicted to be of neutral type, which changed the hydrophilic serine (S) to the hydrophobic phenylalanine (F), thus interfering with the ionic interactions that may increase or decrease the affinity of the viral protein for a particular host cell protein. • QKV40062.1: This sequence acquired a second mutation at Q72H, which was found to be a neutral mutation and no change in polarity was observed. As this sequence accumulated two neutral mutations, it can be assumed that neutral mutations also have a significant importance. • QKV07730.1: The T11A mutation occurred as the second mutation in this sequence, which was predicted to be of disease-increasing type and the polarity was changed from hydrophilic to hydrophobic, hence the structure and 305 function of the protein are expected to differ. From the sequence based phylogeny (Fig.18 (left) ) it was observed that the Wuhan sequence was the first to originate. Although, QKC05159.1 is the first sequence in our flow considering the time, it was found that in the phylogenetic tree it is present at fourth node instead of second node, which is probably due to the presence of ambiguous mutations in this sequence. It was also determined that QKV07730.1 is very similar to QMT50804.1 and again QMT28672.1 is more similar 310 to both of them. All the other sequences having second order mutations are closely related to each other and follow the chronology. From the amino acid based analysis (Fig.18(right) ) it was found that the Wuhan sequence has a high conservation similarity with that of QKV06506.1, thus proving that this sequence was identified chronologically after the Wuhan sequence followed by QKC05159.1 and QMT28672.1, which again are very similar to each other. In the possible flow of mutations (Fig.19) , we have found one sequence with a single mutation, six sequences with two mutations, and another two sequences with three mutations. The US sequence QKC05159.1 was identified to have the L84S mutation, which is a strain determining mutation and was predicted to be a neutral mutation where polarity was changed from hydrophobic to hydrophilic. The sequences that accumulated second mutations along with L84S are as following and 320 it should be noted that the mutational accumulation occurred in a single strain: • QLH01196.1: The A65S mutation occurred as a second mutation in this US sequence, which was found to be 325 of neutral type. However, the polarity changed from hydrophobic to hydrophilic, thus potentially influencing the function of the protein. • QMU91334.1: This US sequence possesses the D63G mutation, which was predicted to be of neutral type. However, the polarity changed from hydrophilic to hydrophobic so, this sequence accumulated two neutral mutations, which may allow the virus to evolve in terms of virulence. serine (S). The mutation was neutral, thus accumulating two neutral and one disease-increasing mutations, being of 335 significant importance for the evolution of the virus. We identified one more sequence QLH57924.1, which possesses another third mutation, F16L, which was predicted to be a neutral mutation and no polarity change was observed. This sequence acquired three neutral mutations that may promote virus survival. • QKU37052.1: This sequence with the W45L mutation was reported in Saudi Arabia, which was found to be of disease-increasing type with no polarity change. Therefore, this sequence also accumulated one neutral and one 340 disease-increasing mutation, which may affect both the structure and function of the protein. • QMT96539.1: The F104S mutation was reported in the US sequence, which was found to be of disease-increasing type and the polarity changed from hydrophobic to hydrophilic. Altogether, the sequence possesses one neutral and one disease-increasing mutation that may allow the virus to acquire new properties for better survival strategies. Sequence-based phylogeny (Fig.20 (left) ) suggested that the Wuhan sequence originated first. Due to the presence 345 of an ambiguous amino acid sequence, QKC05159.1 did not show close similarity to the Wuhan sequence. Based on sequence based phylogeny (Fig.20 (left) ) it was observed that Wuhan sequence originated first. Due to presence of ambiguous amino acids sequence QKC05159.1 was not observed in close proximity to Wuhan sequence. QKG86865.1 and QLH57924.1 were found to have third order mutations and they are assumed to be closely related by the flow and the same has been supported by amino acid conservation-based phylogeny (Fig.20 (right) ). Flow-V QJR88780.1 (Australia) possesses the mutation L84S with reference to the Wuhan ORF8 sequence YP 009724396.1 (Fig.21) . Another sequence QJR88936.1 was reported, which possesses a second mutation, V62L. This mutation was predicted to be neutral with no change in polarity. However, the hydrophobicity increased. This sequence belongs to a particular strain and acquired two neutral mutations, indicating that these mutations may play some important role in the function of ORF8a. As can be seen from both the sequence-based phylogeny and amino acid conservation-based phylogeny (Fig.22) , the Wuhan sequence has originated earlier and the sequences QJR88780.1 and QJR88936.1 are more closely related to each other than to the Wuhan sequence as both sequences have one common mutation not present in the Wuhan sequence. Among SARS-CoV-2 proteins, the ORF8 accessory protein is crucial because it plays a vital role in bypassing the host immune surveillance mechanism. This protein is found to have a wide variety of mutations and among them L84S (23) and S24L (7) have highest frequency of occurrence, which bears distinct functional significance as well. It has been reported that L84S and S24L show antagonistic effects on the protein folding stability of SARS-CoV-2 [44]. L84S destabilizes protein folding, therefore up-regulating the host-immune activity and S24L favours folding stability positively, thus enhancing the 365 functionality of ORF8 protein. L84S is already established as a strain determining mutation and since according to our studies both L84S and S24L do not occur together in a single sequence of SARS-CoV-2 ORF8 protein, it is proposed that virus with the S24L mutation is a new strain altogether. We also observed that hydrophobic to hydrophobic mutations are dominant in the D1 domain. Therefore, hydrophobicity is an important property for the N-terminal signal peptide. However, in the D2 domain, hydrophobic to hydrophilic mutations are observed more frequently, consequently making the 370 ionic interactions more favourable and allowing the protein to evolve in terms of better efficacy in pathogenicity. The ORF8 sequence of SARS-CoV-2 shows 93% similarity with the Bat-CoV RaTG13 and 88% similarity with that of the Pangolin-CoV ORF8. Thus, the ORF8 protein of SARS-CoV-2 can be considered as a valuable candidate for evolutionary deterministic studies and for the identification of the origin of SARS-CoV-2 as a whole. We also analysed a wide variety of mutations in the SARS-CoV-2 ORF8, where we compared it with the ORF8 of Bat-CoV RaTG13 and 375 Pangolin-CoV in relation to charge and hydrophobicity perspectives and we found that the Bat-CoV RaTG13 ORF8 protein exhibits exactly the same properties as that of the SARS-CoV-2 ORF8 protein, whereas the properties of the Pangolin-CoV ORF8 are relatively less similar to the SARS-CoV-2 ORF8. Furthermore, to study the evolutionary nature of mutations in the ORF8, we aligned three bat sequences and found that two of them were exactly the same and there were only six amino acid differences in the third with respect to the other two sequences. So, only two variants were identified for Bat-CoV. Therefore, it shows that the rate of occurrence of mutations is slow in the Bat-CoV RaTG13 ORF8. However, for pangolins no differences were observed among four Pangolin-CoV ORF8 sequences and therefore only a single variant of ORF8 is present. Based on sequence alignment, biochemical characteristics and secondary structural analysis, the Bat-CoV, the Pangolin-CoV and the SARS-CoV-2 ORF8 displayed a high similarity index. Additionally, in the ORF8 of SARS-CoV-2, certain mutations were found to exhibit exact reversal with respect to bats and pangolins and therefore pointing towards the genomic origin of SARS-CoV-2. However, unlike Bat-CoV and Pangolin-CoV, the mutational distribution of the ORF8 (SARS-CoV-2) is widespread ranging from the position 3 to 121, having no defined conserved region. This surprises the scientific community enormously. Further this property differentiates the SARS-CoV-2 ORF8 from that of Bat-CoV and Pangolin-CoV, thus raising the question over the natural trail of evolution of mutations in SARS-CoV-2. We further predicted the types and effects of mutations of 95 sequences and grouped them into four domains and found that diseased type mutations with decreasing effect on stability are more prominent. Consequently, it is hypothesized that these mutations are promoting the viral survival rate. Furthermore, we tracked the possible flow of mutations in accordance to time and geographic locations and validated our proposal with respect to sequence-based and amino acid conservation-based phylogeny and therefore putting forward the order of accumulation of mutations. The authors do not have any conflicts of interest to declare. The epidemiology and pathogenesis of coronavirus disease (covid-19) outbreak The explosive epidemic outbreak of novel coronavirus disease 2019 (covid-19) and the persistent threat of respiratory tract infectious diseases to global health security The effect of human mobility and control measures on the covid-19 epidemic in china Coronavirus disease 2019 (covid-19): situation report Editorial-differences and similarities between severe acute respiratory syndrome (sars)-coronavirus (cov) and sars-cov-2. would a rose by another name smell as 420 sweet? Systematic comparison of two animal-tohuman transmitted human coronaviruses: Sars-cov-2 and sars-cov A genomic perspective on the origin and emergence of sars-cov-2 Genomic diversity 425 of sars-cov-2 in coronavirus disease 2019 patients The orf6, orf8 and nucleocapsid proteins of sars-cov-2 inhibit type i interferon signaling pathway The architecture of sars-cov-2 transcriptome The orf8 protein of sars-cov-2 mediates immune evasion through potently downregulating mhc-i, bioRxiv Sars-cov-2 orf8 and sars-cov orf8ab: Genomic divergence and functional convergence Genomic characterization of a novel sars-cov-2 Novel immunoglobulin domain proteins provide insights into evolution and pathogenesis of sars-cov-2-related viruses Extended orf8 gene region is valuable in the epidemiological investigation of sars-similar coronavirus Functional pangenome 440 analysis provides insights into the origin, function and pathways to therapy of sars-cov-2 coronavirus Functional pangenome analysis shows key features of e protein are preserved in sars and sars-cov-2 Severe acute respiratory syndrome coronavirus 445 2: From gene structure to pathogenic mechanisms and potential therapy Sars coronavirus accessory gene expression and function Understanding genomic diversity, pan-genome, and evolution of sars-cov-2 Immune evasion via sars-cov-2 orf8 protein? Host immune response and immunobiology of human sars-cov-2 infection The 8ab protein of sars-cov is a luminal er membraneassociated protein and induces the activation of atf6 Mechanisms of severe acute respiratory syndrome pathogenesis and innate immunomodulation Nosocomial outbreak of 2019 novel coronavirus pneumonia in wuhan, china Current methods of mutation detection Magic-blast, an accurate rna-seq aligner for long and short reads The embl-ebi search and sequence analysis tools apis in 2019 Collective judgment predicts disease-associated single nucleotide variants Casadio, I-mutant2. 0: predicting stability changes upon mutation from the protein 470 sequence or structure Ab initio protein structure assembly using continuous structure fragments and optimized knowledgebased force field Toward optimal fragment generations for ab initio protein structure assembly Use of amino acid sequence data in phylogeny and evaluation of methods using computer simulation Phylogeny of the serpin superfamily: implications of patterns of amino acid conservation for structure and function Dna barcode analysis: 480 a comparison of phylogenetic and statistical classification methods Molecular conservation and differential mutation on orf3a gene in indian sars-cov2 genomes Molecular conservation and differential mutation on orf3a gene in indian sars-cov2 genomes Molecular conservation and differential mutation on orf3a gene in indian sars-cov2 genomes Exploiting heterogeneous sequence properties improves prediction of protein disorder Comprehensive review of methods for prediction of intrinsic disorder and its 490 molecular functions Comprehensive comparative assessment of in-silico predictors of disordered regions Accurate prediction of disorder in protein chains with a comprehensive and empirically designed consensus Implications of sars-cov-2 mutations for genomic rna structure and host microrna targeting Pathogenetic perspective of missense mutations of orf3a protein of sars-cov2 Authors thank Prof. Bidyut Roy of Human Genetics Unit, Indian Statistical Institute, Kolkata, Indis for his kind support for structure predictions.