International Journal of Computers, Communications & Control Vol. I (2006), No. 3, pp. 25-32 Pathogen Variability. A Genomic Signal Approach Paul Dan Cristea Abstract: The conversion of genomic symbolic sequences into digital signals has been applied for the analysis pathogen variability. Results are given on the variability of Human Immunodeficiency Virus, type 1, subtype F, isolated in Romania, and of the type A avian influenza virus H5N1, for which sequences have been downloaded from GenBank [1]. Nucleotide sequence analysis is corroborated with techniques based on the genomic signal approach to detect pathogen resistance to antiretrovi- ral treatment. In the case of protease (PR) inhibitors, it is found that the treatment induces single nucleotide polimorphisms (SNPs) in specific sites. For moderate re- sistance, the changes affect the PR enzyme only at the level of the protein, whereas for multiple drug resistance, the RNA gene secondary structure also changes. Keywords: Genomic signals, Pathogen variability, HIV, Influenza, Orthomyxoviri- dae, Drug resistance 1 Introduction As shown in a series of previous papers [2-4], the conversion of nucleotide and amino acid sequences into digital signals offers the possibility to apply signal processing methods for the analysis of genomic data. The genomic signal conversion used in our work is a one-to-one mapping of symbolic genomic sequences into complex signals, as described in [2]. The idea is to conserve all the information in the initial symbolic sequence, while bringing in foreground some features significant for the subsequent processing and analysis. This direct method has proven its potential in revealing large scale features of DNA sequences, maintained at the scale of whole genomes or chromosomes, including both coding and non-coding regions. One of the most conspicuous results is that the unwrapped phase of DNA complex genomic signals varies almost linearly along all investigated chromosomes, for both prokaryotes and eukaryotes. The slope is specific for various taxa and chromosomes. This regularity of the genomic signals reveals a corresponding large scale regularity in the distribution of pairs of successive nucleotides, which is similar to Chargaff’s first order rules for the frequencies of occurrence of the nucleotides [5]. We applied the same genomic signal approach for studying the variability of several pathogens, in- cluding the Human Immunodeficiency Virus, type 1 (HIV-1), subtype F, isolated from Romanian patients at the National Institute of Infectious Diseases “Prof. Dr. Matei Bals”, Bucharest [3], and the avian in- fluenza virus type A, based on genomic sequences downloaded from GenBank [1].We have used mainly the phase analysis of the complex genomic signals attached to the nucleotide sequences describing viral genes, as well as the analysis of the corresponding secondary RNA structure and of the phylogenetic neighbor-joining trees for some of these genes. The focus of the study is primarily on the enzyme changes involved in generating pathogen resistance to multiple drug treatment. A novel methodology for describing sets of related genomic signals, based on a common reference and on individual differences has been developed. Variability signals with respect to average, median and maximum flat references, and digital derivatives of genomic signals are applied to this purpose. Applying this method, it has been found that the mutations in the genes of the analyzed viruses occur only in some specific, well defined locations, while the largest part of their genome remains unchanged. The mutations conferring drug resistance are a subset of all mutations occurring in the studied viruses. On the other hand, for the case of HIV protease, it has been shown that the changes in response to the antiretroviral drug treatment occur not only at the level of the final enzyme product, preventing the action Copyright c© 2006 by CCC Publications Selected from ICCCC 2006 (invited paper) 26 Paul Dan Cristea of the drug on the active protease catalytic site, but also at the level of protease gene RNA secondary structure. These type of changes have been found only for multiple drug resistant viruses. 2 Symbolic Sequence Conversion For convenience we repeat here the mapping used in our work for the representation of the nu- cleotides [2] a = 1 + j, c = −1− j, g = −1 + j,t = 1− j (1) Apart of the mapping of the four nucleotides (a, c, g,t), the complete genomic signal representation of nucleotide sequences also comprises the mapping of all the other IUPAC symbols for nucleotide classes: s = {c, g} - strongly bonded, w = {a,t} - weakly bonded, r = {a, g} - purines, y = {c,t} - pyrimidines, m = {a, c} - amine, k = {g,t} - ketone, b = {c, g,t} = ¬a, d = {a, g,t} = ¬c, h = {a, c,t} = ¬g, v = {a, c, g} = ¬t, and n = {a, c, g,t} [2]. These symbols occur in the nucleotide sequences generated by genotyping because of the multiplicities determined either by the variability within the virus population or by noise. But this is not the case of the consensus sequences downloaded from GenBank [1], which are curated to contain only the (a, c, g,t) nucleotide symbols. The mapping in equation (1) has the advantage of conserving all the information in the initial symbolic sequence, as it uses a bijective mapping, while being as little biased as possible. 3 Representation by Reference and Variation To study the variability of the genomic signals in a given set, for example, the signals for multi- ple resistant viruses, it is convenient to use a description comprising two types of components: (1) the reference - a certain signal considered to best describe the common variation of all components in the considered cluster; (2) the difference of each signal in the cluster with respect to the common reference. In such an approach, it is important to introduce in the common reference as much as possible of the vari- ation shared by all the signals, and keep for the individual differences of each signal only the variations belonging actually to the that signal, without external variation. The reference can be chosen as one of the following possibilities: • average (mean) of the signals, or another linear combination of the signals; • median - the signal in the central position, or the average of the pair of signals placed centrally; • maximum flat signal - a modified median that keeps better local variations on the signals where they occur avoiding spurious transfers on other signals. When the reference equals the average, the dispersion of the cluster of signals is minimum, i.e., the sum of the squares of the individual differences between each signal and the reference is minimized. But the average, as any other linear combination, has the important disadvantage that a localized variation of only one of the signals is transmitted to the reference, so that all the other signals will have an apparent variation of opposite sign in that point. The median reference performs better, being a nonlinear function of the signals in the cluster, so that it decouples the common reference from the local variations of each of the individual signals. The median reference minimizes the sum of the absolute values of the differences between each signal and the reference. A variation localized on only one of the signals is no longer transmitted to the reference, so that it does not affect the variation with respect to the reference of the other signals. The exception occurs when the signal on which the localized variation occurs is just the median. The maximum flat (MaxFlat) reference is equal to the median wherever the median has no variations which are not shared by other signals. Elsewhere, the MaxFlat reference assumes the minimal variation Pathogen Variability. A Genomic Signal Approach 27 that corresponds to its trend, if possible remaining constant. Consequently, the variation signals show better the changes that occur in each individual signal, with less "crosstalk". The digital derivatives of the variation signals show only the actual changes, caused by the variability in each of the signals and, for genomic signals, correspond directly to the SNPs. 4 HIV-1 Subtype F Variability A phase analysis has been performed on a segment of about 1302 base pairs, approximately align- ing with the standard sequence of HIV-1 (NC001802) in GenBank [1] over the interval 1799..2430 bp. This segment, which is currently used for the standard identification and assessment of HIV-1 strains, comprises the protease (PR) gene and almost two thirds of the reverse transcriptase (RT) gene. The PR and RT segments are contiguous and have been analyzed both together, as one entity, and indepen- dently, as two distinct encoding regions. The PR gene has the length 297 bp and is located in the first interval (1..297 bp) of the sequenced DNA segment, respectively along the 1799..2095 bp region of the NC001802 sequence. The RT encoding segment that has been analyzed has a length of 1005 bp and is located in the second interval (298..1302 bp) of the analyzed DNA segment, respectively along the 2096..3100 bp region of the NC001802 sequence. The entire RT gene has 1680 bp located in the interval 2096..3775 of the sequence. Figures 1 and 2 show the cumulated and unwrapped phase of genomic signals for the protease (PR) genes from nine instances of HIV type 1, F clade [1, 6]. Three cases come from treatment naïve pa- tients (S - sensitive), three from patients that developed resistance to one of the drugs (R), and three with multiple resistance to ther antiretroviral treatment (M). The cumulated phase is proportional to the unbalance in the number of nucleotides (statistics of first order) along the nucleic acid strand given by: 3(nG − nC) + (nA − nT ), up to a π/4 factor, whereas the unwrapped phase is proportional to the differ- ence between the number of direct and inverse nucleotide transitions (statistics of second order) along the nucleic acid strand (n+ − n−), with a π/2 factor [2]. Figures 3 and 4 give the same informtion for the segment comprising 1005 bp of reverse transcriptase (RT) genes, out of the total of 1680 bp in this gene, for the same isolates in Figs. 1 and 2. As expected, the cumulated phase varies less than the un- wrapped phase for these instances, as all mutations are of the SNP type and affect more the nucleotide pair distribution than the nucleotide distribution itself. Even for the unwrapped phase, the variation of the signal along the strand is quite similar for most of the sequences, but the local changes cumulate along the strands. Because of the mutations are local, the general shape of the phase signals are similar. It is also to be noticed that all the genomic material in these sequences is encoding and uses the same reading frame. The vertical strips in these figures mark the positions of the mutations (SNPs) that induce resistance to protease inhibitors (Indinavir, Ritonavir, Saquinavir, Nelfinavir, and Amprenavir) [1]. The mutations that lead to multiple drug resistance are concentrated in several sites. In most of the remaining genome, the viruses have the same longitudinal structure. The sequences display mutations in several other locations. The effect of the mutations can easier be seen on the unwrapped phase, which is more sensitive to SNPs. The successive mutations of the SNP type do not induce the divergence that could be expected, so that the signals do not actually diverge from one another. On the contrary, the signals tend to cluster, as the variations tend to compensate each other, so that the overall span of the signals does not increase directly with the number of mutations and the number of signals. This is another proof of the fact that, from the structural point of view, a genomic sequence satisfies more restrictions than a "plain text", which must just correspond to a certain semantics and to certain grammar rules, and resembles more to a “poem”, which additionally obeys rules of symmetry, giving its “rhythm” and “rhyme”. The recurrence of such patterned structures is reflected in simple mathematical rules satisfied by the corresponding genomic signals. 28 Paul Dan Cristea The representation can be improved by using the reference-difference description, choosing the max- imum flat (MaxFlat) reference, as shown in Fig. 5 for the unwrapped phase in Fig. 4. In this case, the largest possible part of the common behavior of the signals is introduced in the reference signal, whereas each individual variation signal maintains only the changes occurring in that particular signal, or to the class it belongs to. The reference signal is no longer necessarily equal in each interval with one of the signals, even when the number of signals is odd. The digital derivatives of the difference signals, shown in Fig. 6 show only the actual changes caused by the variability in each of the signals. In the case of HIV, these changes correspond directly to the SNPs. For multiple resistant strains, the pulses correspond to the sites known from literature to confer resistance to various drugs. Figure 1: Cumulated phase expressed by 3(nG − nC) + (nA − nT ) [2] for the protease (PR) gene of nine isolates of HIV-1, subtype F, showing sensitivity (S), resistance (R) and multiple resistance (M) to drugs. Figure 2: Unwrapped phase expressed by n+ − n− [2] for the protease gene of the isolates of HIV-1 in Fig. 1. HIV-1 makes many of its proteins in one long chain, and protease (PR) has the essential role of cutting this ’polyprotein’ into the proper pieces, with the proper timing. Consequently, PR has been chosen as an important target for the current drug anti-HIV therapy. PR is a small enzyme, comprising two identical peptide chains, each of 99 amino acids long, which are encoded by the same gene of 297 nucleotides. The two chains form a tunnel that holds the polyprotein, which is cut at an active site located in the center of the tunnel. Drugs bind to PR, blocking its action. Studying the estimated secondary structure Pathogen Variability. A Genomic Signal Approach 29 Figure 3: Cumulated phase of RT genomic signals for the isolates shown in Figs. 1 and 2. Figure 4: Unwrapped phase for the RT gene in the isolates shown in Figs. 1 and 2. of the PR RNA for the nine virions previously analyzed, it can be shown [3] that the structures are quite similar for drug sensitive and drug simple resistant viruses. This result is consistent with the generally accepted model stating that the genomic changes of HIV, which induce resistance to drugs, operates at the level of the protein (the final protease enzime), preventing the blocking of its catalytic site. On the other hand, it is found the remarkable fact that, for drug multiple resistant strains, there is a significant change in the RNA secondary structure. Large loops and bulges are replaced with similar, but smaller, less vulnerable, closed-loop structures. These results indicate that there is a certain action of the drug at the level of the protease RNA, effect that becomes evident when mutations conferring multiple drug resistance occur. 30 Paul Dan Cristea Figure 5: The unwrapped phase in Fig. 4 shown with respect to the MaxFlat reference. Figure 6: Digital derivatives of the variation signals in Fig. 5. Pathogen Variability. A Genomic Signal Approach 31 5 Variability of Hemagglutinin gene of influenza H5N1 virus The influenza virus envelope embeds two specific antigenic glycoproteins that project out of the virion surface, the Hemagglutinin (HA) and the Neuraminidase (NA). Many different combinations of HA and NA proteins are possible, but only the H1N1 (Spanish endemic), H1N2 (Asian epidemic), and H3N2 (Hong Kong epidemic) subtypes have circulated worldwide among humans. HA protein selec- tively binds to the sialic acid of the host cell surface receptors, thus recognizing the cells that the virus can invade [4, 6]. Figure 7 gives the cumulated phase of the HA gene for H5N1 viruses isolated from two humans (AF046080, AF046097) and one chicken (AF046088), in Hong Kong, in 1997 [6, 7]. The genes for viruses isolated close in time are similar, even when crossing the inter-species barrier, whereas a large variation can be seen for genes isolated at larger time intervals. Only several SNPs are found in Fig. 8 which gives the difference cumulated phases with respect to the MaxFlat reference. The same result has been obtained for all the genes in the eight segments of the H5N1 virus [4, 6]. Figure 7: Cumulated phase of the HA gene, H5N1 virus (accessions AF046080, 88, 97 [1, 6]). Figure 8: Differences of HA gene cumulated phases in Fig.7 with respect to the MaxFlat reference. 6 Further Work Further work will be focused on: 32 Paul Dan Cristea • the dynamics of Influenza Type A viruses that have crossed till now the species barrier from birds to humans, and which hold the potential to become highly contagious and highly lethal in humans, including the H5N1 subtype, • extending the study from the nucleotide to the amino acid level, which could be more significant from the phenotypic point of view, • using genomic signals for helping clustering viruses in classes. Acknowledgments The sequences of HIV presented in this paper have been genotyped by Dr. Dan Otelea from the National Institute of Infectious Diseases “Prof. Dr. Matei Bals”, Bucharest, Romania. Results referring to the study of HIV variability have been previously jointly published [3]. References [1] National Center for Biotechnology Information, National Institutes of Health, Na- tional Library of Medicine, National Center for Biotechnology Information, GenBank, http://www.ncbi.nlm.nih.gov/genoms [2] P. D. Cristea, "Representation and Analysis of DNA sequences", in Genomic Signal Processing and Statistics, Editors E.G. Dougherty, I. Shmulevici, Jie Chen, Z. J. Wang, Book Series on Signal Processing and Communications, Hidawi, 2005, pp.15-65. [3] P. D. Cristea, D. Otelea, Rodica Tuduce, "Study of HIV Variability Based on Genomic Signal Anal- ysis of Protease and Reverse Transcriptase Genes", EMBC’05, Sept. 2005, Shanghai, China. [4] P. D. Cristea, "Genomic Signal Analysis of Pathogen Variability", SPIE, BO24, paper 5699-52, San Jose, Jan, 2005, 12 pg. [5] E. Chargaff, "Structure and function of nucleic acids as cell constituents", Fed. Proc., 10, pp. 654- 659, 1951. [6] D.L. Suarez,.et.al., Comparisons of highly virulent H5N1 influenza A viruses isolated from humans and chickens from Hong Kong, J. Virology, vol. 72 (8), pp. 6678-6688, 1998. (AF046080-99). [7] E. Ghedin, N. Sengamalay, M. Shumway et. al., "Large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution", Nature, vol. 437, Oct. 2005, pp.1162-1166. [8] M.S. Hirsch et al., "Antiretroviral Drug Resistance Testing in Adult HIV-1 Infection", in Proc. Rec- ommendations of an International AIDS Society - USA Panel, JAMA, vol. 283, no. 18, May 10, 2000, pp.2417-2426. Paul Dan Cristea University POLITEHNICA of Bucharest Biomedical Engineering Center Address: Spl. Independentei 313, sect. 6 060042 Bucharest, Romania E-mail: pcristea@dsp.pub.ro