Adi Priyana 129 Exon prediction on DNA-genes of Plasmodium falciparum based on coding sequence structure using hidden Markov model Suhartati Agoes*a, Dadang Gunawan**, Sardy S**, and Hoedojo*** UNIVERSA MEDICINA Juli-September 2007Juli-September 2007Juli-September 2007Juli-September 2007Juli-September 2007 Vol.26 - No.3 Vol.26 - No.3 Vol.26 - No.3 Vol.26 - No.3 Vol.26 - No.3 ABSTRACT * Electrical Engineering Department, Faculty of Industrial Technology, Trisakti University ** Electrical Engineering Department, Faculty of Engineering, Indonesia University *** Department of Parasitology, Medical Faculty, Trisakti University Korespondensi aIr. Suhartati Agoes, MT Electrical Engineering Department, Faculty of Industrial Technology, Trisakti University Jl. Kyai Tapa No.1, Grogol Jakarta 11440 Telp. 08164836613 Email: suhartati_agoes@yahoo.com Universa Medicina 2007; 26: 129- 36. BACKGROUND A hidden Markov model (HMM) is used for exon prediction on DNA of genes Plasmodium falciparum that has a model structure based on exon region structure in coding sequence (CDS). The objective research was to develop a new structure model to predict exon on DNA-genes of Plasmodium falciparum based on CDS structure using the HMM system. METHODS Model design in CDS, between two exon regions can be found one intron region and the model state number is used for its region. Its state number is used by separating start codon from first exon region and stop codon from the last exon region up to 9. The Viterbi algorithm and the backward-forward method for transition as well as emission states are used for training process. Furthermore, Viterbi and Baum-Welch algorithms are used for the testing process. The correlation coefficient (CC) was used as performance indicator, as the ratio of the estimated state in the output and the original state in the input of the model. RESULTS The simulation results has shown that the CC values depend on the given of the backward-forward transition state values randomly. The model with state number 9 showed the highest average of CC values of 0.7289 for Viterbi algorithm, and is 0.7166 for Baum-Welch algorithm. However, the lowest average of CC values has been found for the model with state number five. Its values are 0.6735 by using Viterbi algorithm and 0.6661 by using Baum-Welch algorithm. CONCLUSION The new structure model based on HMM system was valid to predict exon on DNA-genes of Plasmodium falciparum. Keywords: Exon Prediction, DNA-gene, coding sequence, Hidden Markov Model 130 Agoes, Gunawan, Sardy, et al Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM Prediksi ekson DNA-gen Plasmodium falciparum berdasarkan struktur coding sequence dengan menggunakan model hidden Markov Suhartati Agoes*a, Dadang Gunawan**, Sardy S**, dan Hoedojo*** LATAR BELAKANG Sebuah hidden Markov model (HMM) yang digunakan untuk memprediksi ekson gen DNA Palsmodium falciparum mempunyai struktur model berdasarkan struktur gen DNA pada coding sequence (CDS). Penelitian ini bertujuan untuk mengembangkan sebuah model stuktur baru untuk prediksi ekson gen DNA Plasmodium falciparum berdasarkan struktur CDS dengan menggunakan sistem HMM. METODE Rancangan model pada CDS, di antara dua lokasi ekson dapat diketahui sebuah lokasi intron dan jumlah state model dapat dilakukan pada lokasi tersebut. Jumlah state dilakukan dengan memisahkan codon start dari ekson pertama dan codon stop dari ekson terakhir hingga mencapai 9. Algoritma Viterbi dan metode backward-forward transisi state serta emisi state digunakan untuk proses training. Sedangkan untuk proses testing menggunakan algoritma Viterbi dan Baum-Welch. Sebagai kinerja model digunakan Correlation Coefficient (CC) yang didapat dari perbandingan state estimasi pada output dan state asli pada input model. HASIL Hasil simulasi menunjukkan bahwa nilai CC tergantung pada pemberian nilai acak state transisi backward-forward. Model dengan jumlah state 9, menunjukkan nilai CC rata-rata tertinggi adalah 0,7289 untuk algoritma Viterbi dan 0,7166 untuk algoritma Baum-Welch. KESIMPULAN Struktur model berdasarkan sistem HMM sahih untuk memprediksi ekson gen DNA Plamodium falciparum Kata kunci : Prediksi ekson, gen-DNA, coding sequence, model hidden Markov * Jurusan Teknik Elektro Fakultas Teknologi Industri Universitas Trisakti ** Jurusan Teknik Elektro, Fakultas Teknik Universitas Indonesia *** Bagian Parasitologi Fakultas Kedokteran Universitas Trisakti Korespondensi aIr. Suhartati Agoes, MT Jurusan Teknik Elektro Fakultas Teknologi Industri Universitas Trisakti Jl. Kyai Tapa No.1, Grogol Jakarta 11440 Telp. 08164836613 Email: suhartati_agoes@yahoo.com Universa Medicina 2007; 26: 129-36. ABSTRAK INTRODUCTION I n c o d i n g s e q u e n c e ( C D S ) , t h e e x o n prediction on DNA-gene is very important to find the protein structure. Genome P. falciparum belongs to genome eukaryotic and has a long DNA genome, consisting of several exons and introns alternately located. After the splicing process in DNA sequence, some regions of exon in CDS will be found to produce the protein.(1) 131 T h e h i d d e n M a r k o v m o d e l ( H M M ) structure is one of the model used to predict the exon in desoxyribonucleai acid (DNA) which is based on the exon region structure in CDS. The HMM structure of the model: start codon (Methyanine/ATG), exon, intron and stop c o d o n ( TA A , o r T G A , o r TA G ) h a s b e e n d e m o n s t r a t e d b y R a b i n e r, H e n d e r s o n a n d Krogh.(2-4) This research developed a new model of structures for exon prediction on DNA of genes Plasmodium falciparum was based on the work of Nicorici and Anantharaman.(5,6) The significant difference of this model is it at least has two exon regions used for exon prediction like in CDS. Usually the 5’ boundary of introns in most eukaryotes contains the dinucleotide Guanine-Thymine (GT), and the 3’ boundary contains the dinucleotide Adenine-Guanine (AG). Furthermore, in intron regions of the dinucleotide GT were separated to be nucleotide Guanine (G) and Thymine (T), and dinucleotide AG to be nucleotide A and G. METHOD Research design The HMM framework in this experiment used Viterbi algorithm for the training process and both Viterbi and Baum-Welch algorithms for the testing process. The same sequence has been done for the training and testing process. The program is written in Matlab 7.0, and Bio- i n f o r m a t i c s ’ t o o l b o x t o g e n e r a t e D N A sequences in Genbank format and has the functions of HMM training and testing. The first experiment for the model based on CDS has been done with separated start codon (codon: ATG) from first exon and stop codon (codon: TAA, or TAG, or TGA) from the last exon, its model has the state number of 5 (Figure 1). Furthermore, the second experiment u s e d s e p a r a t e d d i n u c l e o t i d e G T a n d dinucleotide AG from the intron region, its model has the state number of 7 (Figure 2). The last experiment in the model also used separated d i n u c l e o t i d e G T i n t o G a n d T s t a t e s a n d dinucleotide AG into A and G states; the state number of the model is 9 (Figure 3). The emission values of the models can be performed a s t h e m a t r i x , t h e c o l u m n s a r e t h e b a s e s numbered on DNA sequences and the rows are the states number of the models designed. The state transition values of the models can be shown also in the figures. Samples A genome sequence of P. falciparum using 3D7 clone has 23-megabase nuclear genomes consisting of 14 chromosomes, encodes about 5,300 genes, and it has the most (A+T)-rich genome sequenced to date.(8-10) The data set of experiments for this simulation has 152 DNA sequences of genes from genome Plasmodium f a l c i p a r u m a t : c h r o m o s o m e 1 ( L o c u s : N C _ 0 0 4 3 2 5 ) , c h r o m o s o m e 2 ( L o c u s : N C _ 0 0 0 9 1 0 ) , c h r o m o s o m e 3 ( L o c u s : N C _ 0 0 0 5 2 1 ) , c h r o m o s o m e 4 ( L o c u s : N C _ 0 0 4 3 1 8 ) , c h r o m o s o m e 5 ( L o c u s : N C _ 0 0 4 3 2 6 ) , a n d c h r o m o s o m e 9 ( L o c u s : NC_004330) in Genbank format.(11,12) Genbank format describes the CDS and original DNA sequences of genes P. falciparum. In CDS of genes contains at least two exon regions and maximum 10 exon regions. The minimum length sequence of genes is 684 base pairs (bp) and maximum length is 10095 bp. HMM provides a good probability method for discrete sequences model of data like DNA sequences (alphabet of four letters: A, C, G and T). Prediction of exon in DNA of genes P. falciparum is based on exon region in CDS and it has at least two exon regions. The structure of the model based on CDS structure as shown in Figure 4, which separated is start codon from the first exon and separated stop codon from the last exon. Universa Medicina Vol. 26 No.3 132 Agoes, Gunawan, Sardy, et al Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM In the inputs are DNA sequences and set the state number of the sequences depends on the structure model. Based on HMM method, the Viterbi algorithm used for the training and needs the transition state values and emission values for the process. The value of transition state is random and the value of emission can be found with the distribution number of DNA nucleotide in each states. The result of HMM training is the estimated transition state and emission state values. The estimated transition and emission values were used for HMM testing p r o c e s s w i t h Vi t e r b i a n d B a u m - We l c h algorithms. The results of HMM testing process are the estimated states of the model.(2) The parameter of performance indicator used the Figure 3. The HMM structure with 9 states Exon Intron ExonStart Stop Figure 4. The HMM structure based on CDS 9 Start E G T I A G 1 2 3 4 5 6 7 E Stop 8 1 1 0.9 0.1 0.9 0.9 0.1 0.1 0.9 0.9 0.1 0.1 0.9 0.1 0.8 0.1 0.1 5 0 .9 1 E IS ta rt S to p 2 3 4 1 0 .5 0 .5 0 .1 0 .8 1 0 .1 9 1 1 0 -6E Figure 1. The HMM structure with 5 states S tart E G T I A G E Stop 1 2 3 4 5 6 7 1 10.85 0.1 0.05 0.7 0.3 0.9 0.1 0.7 0.9 0.1 0.3 Figure 2. The HMM structure with 7 states 133 correlation coefficient (CC), as a ratio of the estimated states from the output and the original states from the input of the model(1,7) and the formula is the following: Where: TP= True Positive; FP= False Positive; TN= True Negative; FN= False Negative Statistical analysis The emission matrix of the model with state number 5 is: e = [ 0.3333 0 0.3333 0.3333 ; 0.4342 0.1094 0.1507 0.3056 ; 0.4210 0.0599 0.0703 0.4487 ; 0.4215 0.1173 0.1622 0.2990 ; 0.5592 0 0.1075 0.3333 ] The emission matrix of the model with state number 7 is: e = [ 0.3333 0 0.3333 0.3333 ; 0.4342 0.1094 0.1507 0.3056 ; 0 0 0.5000 0.5000 ; 0.4210 0.0599 0.0703 0.4487 ; 0.5000 0 0.5000 0 ; 0.4215 0.1173 0.1622 0.2990 ; 0.5592 0 0.1075 0.3333 ] The emission matrix of the model with state number 9 is: e = [ 0.3333 0 0.3333 0.3333 ; 0.4342 0.1094 0.1507 0.3056 ; 0 0 1.0000 0 ; 0 0 0 1.0000 ; 0.4210 0.0599 0.0703 0.4487 ; 1.0000 0 0 0 ; 0 0 1.0000 0 ; 0.4215 0.1173 0.1622 0.2990 ; 0.5592 0 0.1075 0.3333 ] FN)FP).(TNFP).(TPFN).(TN(TP (FP.FN)-(TP.TN) CC ++++ = The transition state values are randomly, but the first state on the model has the minimum transition state value of 0 and the last state has the maximum transition state value of 1. On the other hand, the emission state values were found from the distribution number of DNA nucleotide i n e a c h s t a t e d e p e n d i n g o n t h e m o d e l s . Calculation of the CC is by using the above equation with the assumptions that the exon is positive and intron is negative. RESULTS Based on exon regions structure in CDS, the simulation results for the model with state number 5 above and transition state values randomly has found the average of CC value in Table 1. Table 2 and Table 3 shows the average of CC values of the model with state number 7 and state number 9. The transition state value in the first state of all structures on the models is 0 and the last state is 1. The other transition state values except in Table 2 are 0.1 and the h i g h e s t a v e r a g e o f C C v a l u e s f o r t h i s experiments are shown in Figure 5. The model based on exon regions structure in CDS for state number nine resulted in the highest average of CC. Its values are 0.7289 by using Viterbi algorithm and 0.7166 by using Baum-Welch algorithm. However, the lowest average of CC values has been found for the model with state number five. Its values are 0.6735 by using Viterbi algorithm and 0.6661 by using Baum-Welch algorithm. DISCUSSION All these were based on the model of exon regions structure in CDS, at the state number nine resulted in the highest average of CC values. It is also shown that the emission state value of t h e m o d e l s h a v e t h e ( A + T ) - r i c h g e n o m e sequences of genes Plasmodium falciparum. Universa Medicina Vol. 26 No.3 134 Agoes, Gunawan, Sardy, et al Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM Transition state values Corr.Coef.(CC) Exps tr11 tr22 tr33 tr44 tr55 Viterbi B-Welch 1 0 0.5 0.9 0.80 1 0.7054 0.6750 2 0 0.5 0.9 0.81 1 0.7061 0.6759 3 0 0.5 0.9 0.82 1 0.7054 0.6750 4 0 0.5 0.9 0.83 1 0.7054 0.6750 5 0 0.5 0.9 0.84 1 0.7045 0.6741 6 0 0.5 0.9 0.85 1 0.7054 0.6750 7 0 0.5 0.9 0.86 1 0.7062 0.6744 8 0 0.5 0.9 0.87 1 0.0012 0.0015 9 0 0.5 0.9 0.88 1 0.0011 0.0013 10 0 0.5 0.9 0.89 1 0.0011 0.0013 Table 1. The average of CC values for the model with state number 5 Transition state values Corr.Coef.(CC) Exps tr11 tr22 tr33 tr44 tr55 tr66 tr77 Viterbi B-Welch 1 0 0.4 0.2 0.8 0.10 0.7 1 0.6713 0.6643 2 0 0.4 0.2 0.8 0.20 0.7 1 0.6713 0.6643 3 0 0.4 0.2 0.8 0.30 0.7 1 0.6733 0.6660 4 0 0.4 0.2 0.8 0.40 0.7 1 0.6731 0.6660 5 0 0.4 0.2 0.8 0.50 0.7 1 0.6735 0.6661 6 0 0.4 0.2 0.8 0.60 0.7 1 -0.0642 -0.0729 7 0 0.4 0.2 0.8 0.59 0.7 1 -0.0568 -0.0639 8 0 0.4 0.2 0.8 0.58 0.7 1 0.6735 0.6661 9 0 0.4 0.2 0.8 0.56 0.7 1 0.6735 0.6661 10 0 0.4 0.2 0.8 0.55 0.7 1 0.6735 0.6661 Table 2. The average of CC values for the model with state number 7 Transition state values Corr.Coef.(CC) Exps tr11 tr22 tr33 tr44 tr55 tr66 tr77 tr88 tr99 Viterbi B-Welch 1 0 0.3 0.2 0.2 0.80 0.10 0.10 0.70 1 0.7242 0.7124 2 0 0.3 0.1 0.1 0.79 0.10 0.10 0.70 1 0.7242 0.7124 3 0 0.3 0.1 0.1 0.78 0.09 0.09 0.70 1 0.7242 0.7124 4 0 0.3 0.1 0.1 0.78 0.20 0.20 0.70 1 0.7258 0.7120 5 0 0.3 0.1 0.1 0.80 0.20 0.20 0.80 1 0.7258 0.7120 6 0 0.3 0.1 0.1 0.80 0.15 0.15 0.74 1 0.7258 0.7120 7 0 0.3 0.1 0.1 0.80 0.15 0.15 0.80 1 0.7258 0.7120 8 0 0.3 0.1 0.1 0.80 0.15 0.15 0.70 1 0.7258 0.7120 9 0 0.8 0.2 0.2 0.80 0.10 0.10 0.70 1 0.6787 0.6755 10 0 0.9 0.2 0.2 0.80 0.10 0.10 0.70 1 0.0451 0.0437 Table 3. The average of CC values for the model with state number 9 135 This can also be showed in their matrix values described above. In order to match with Markov’s chain in eukaryotic gene structure for c o n n e c t i n g t h e e x o n a n d i n t r o n , t h e n t h e backward-forward method are used in this HMM system. The experiments were done by Viterbi algorithm for the training process and the both Viterbi and Baum-Welch algorithms for the testing process. A set of 152 sequences is then used as data to train the model and a different HMM is produced for each set of sequences. HMM have been previously used in sequence analysis to produce an HMM that represents a sequence profile (a profile HMM), to analyze sequence composition and patterns, to locate genes by p r e d i c t i n g o p e n r e a d i n g f r a m e o r c o d i n g sequence, and to produce protein structure predictions. This HMM generates sequences 0.62 0.64 0.66 0.68 0.7 0.72 0.74 C orrelation C oefficient (C C ) T he highest average of CC values Viterbi 0.7061 0.6735 0.7258 Ba um-We lch 0.6759 0.6661 0.712 5 7 9 w i t h v a r i o u s c o m b i n a t i o n s o f m a t c h e d , mismatched, insertions and deletions, and gives these a probability, depending on the values of the various parameters in the model. A total of 424 predicted protein-coding g e n e s a n d 3 t R N A g e n e s o f P l a s m o d i u m falciparum have been identified on chromosomes 2 and 3, the two chromosomes for which s e q u e n c i n g h a v e b e e n c o m p l e t e d . ( 1 3 , 1 4 ) Approximately 37% (158 genes) of three genes have a readily identifiable homologue in another species. Such similarity allows a function to be implied, with many of those identified being involved in parasite metabolism. Interestingly, a comparison of the predicted protein sequences with their homologues in other species showed that the majority of Plasmodium falciparum proteins have insertion of low complexity sequences, often runs of a single amino acid Figure 5. The highest average of CC values of the experiments Universa Medicina Vol. 26 No.3 136 Agoes, Gunawan, Sardy, et al Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM r e s i d u e ( t y p i c a l l y a s p a r a g i n e s , l y s i n e o r glutamine acid), or tandem arrays of a short peptide repeat sequence. Examples of such regions have been identified previously, and are polymorphic between different parasite isolates. Perhaps it will open up new avenues for drug and vaccine development. Perhaps it will make a genetically modified mosquito that cannot act as vector a reality once processes are in place to deal with ethical, legal and social issues. But any new developments are unlikely to happen before the 2010 deadline set for Roll Back Malaria. New drugs based on knowledge about the genome will probably not be affordable to those countries that need it the most.(15) CONCLUSION This research with a new structure model was using for exon prediction on DNA-genes of Plasmodium falciparum based on CDS structure by using the HMM system. ACKNOWLEDGMENT T h e a u t h o r s w o u l d l i k e t o t h a n k d r. Herawati Sudoyo, PhD, dr. Helena MS and Hidayat Trimarsanto B.Sc from the Eijkman Institute, for their valuable suggestions. References 1. Samatova N.F. Computational gene finding using HMMs. UT-Battelle Information Center, 1201 Oak Ridge Turnpike, Suite 100, Oak Ridge, TN 37830. Institute Oak Ridge National Laboratory; 2003. 2. Lawrence R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceeding of The IEEE. 1989; 77: 257-68. 3. Henderson J, Salzberg S, Fasman K.H. Finding genes in DNA with a hidden Markov model. J Comput Biol 1997; 4: 127-42. 4. Krogh A. An introduction to hidden Markov models for biological sequences. In Computational methods in molecular biology. Salzberg SL, Searls DB, Kasif S. editors. Denmark. Center for Biological Sequence Analysis, Technical University of Denmark. 1998. p. 45-63. 5. Nicorici D, Astola J, Tobus I. Computational identification of exons in DNA with a hidden Markov model. Tampere International Center for Signal Processing, Tampere University of Technology, Finland. Available at: http:// www.gensips.gatech.edu/processings/contributed/ CP2-06.pdf. Accessed May 12, 2005. 6. Anantharaman T. Finding genes in genomic DNA The GENESCAN System. Available at : http:// www.biostat.wisc.edu/bmi776. Accessed June 6, 2005. 7. Vaisman I. Bioinformatics and gene discovery. Bioinformatics Tutorial. North Carolina, United State: University of North Carolina at Chapel Hill. 1998. 8. Hall N, Gardner M.J, Hyman RW, Lasonder E, Wilson RJM, Scherf A, et al. Sequence of plasmodium falciparum chromosomes 1, 3-9 and 13. Nature 2002; 419: 527-31. 9. Gardner M.J, Hall N, Hyman RW, Hinterberg K, Mattei D, Wellem TE, et al. Sequence of plasmodium falciparum chromosomes 2, 10, 11 and 14. Nature 2002; 419: 531-4. 10. Hyman RW, Gardner M.J, Hall N, De Bruin D, Scherf A, Day KP, et al. Sequence of plasmodium falciparum chromosome 12. Nature 2002; 419: 534-7. 11. Anastassiou D. Genomic signal processing. IEEE Signal Processing Magazine 2001; 18: 4. 12. Alphey L. DNA sequencing. Manchester UK : Bios Scientific Publishers Limited; 1997. 13. Gardner M.J, Hall N, Fung E, White O, Berriman M, Hyman RW, et al. Chromosome 2 sequence of the human malaria parasite Plasmodium falciparum. Science 1998; 282: 1126-32. 14. Bowman S, Lawson D, Basham D, Brown D, Chillingworth T, Churcher C. M, et al. The complete nucleotide sequence of chromosome 3 of Plasmodium falciparum. Nature 1999; 400: 532- 8. 15. Anonymous. Malaria after the genomes. The Lancet 2002; 360: 1107. 129 Exon prediction on DNA-genes of Plasmodium falciparum based on coding sequence structure using hidden Markov model Suhartati Agoes*a, Dadang Gunawan**, Sardy S**, and Hoedojo*** UNIVERSA MEDICINA Juli-September 2007Juli-September 2007Juli-September 2007Juli-September 2007Juli-September 2007 Vol.26 - No.3 Vol.26 - No.3 Vol.26 - No.3 Vol.26 - No.3 Vol.26 - No.3 ABSTRACT * Electrical Engineering Department, Faculty of Industrial Technology, Trisakti University ** Electrical Engineering Department, Faculty of Engineering, Indonesia University *** Department of Parasitology, Medical Faculty, Trisakti University Korespondensi aIr. Suhartati Agoes, MT Electrical Engineering Department, Faculty of Industrial Technology, Trisakti University Jl. Kyai Tapa No.1, Grogol Jakarta 11440 Telp. 08164836613 Email: suhartati_agoes@yahoo.com Universa Medicina 2007; 26: 129- 36. BACKGROUND A hidden Markov model (HMM) is used for exon prediction on DNA of genes Plasmodium falciparum that has a model structure based on exon region structure in coding sequence (CDS). The objective research was to develop a new structure model to predict exon on DNA-genes of Plasmodium falciparum based on CDS structure using the HMM system. METHODS Model design in CDS, between two exon regions can be found one intron region and the model state number is used for its region. Its state number is used by separating start codon from first exon region and stop codon from the last exon region up to 9. The Viterbi algorithm and the backward-forward method for transition as well as emission states are used for training process. Furthermore, Viterbi and Baum-Welch algorithms are used for the testing process. The correlation coefficient (CC) was used as performance indicator, as the ratio of the estimated state in the output and the original state in the input of the model. RESULTS The simulation results has shown that the CC values depend on the given of the backward-forward transition state values randomly. The model with state number 9 showed the highest average of CC values of 0.7289 for Viterbi algorithm, and is 0.7166 for Baum-Welch algorithm. However, the lowest average of CC values has been found for the model with state number five. Its values are 0.6735 by using Viterbi algorithm and 0.6661 by using Baum-Welch algorithm. CONCLUSION The new structure model based on HMM system was valid to predict exon on DNA-genes of Plasmodium falciparum. Keywords: Exon Prediction, DNA-gene, coding sequence, Hidden Markov Model 130 Agoes, Gunawan, Sardy, et al Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM Prediksi ekson DNA-gen Plasmodium falciparum berdasarkan struktur coding sequence dengan menggunakan model hidden Markov Suhartati Agoes*a, Dadang Gunawan**, Sardy S**, dan Hoedojo*** LATAR BELAKANG Sebuah hidden Markov model (HMM) yang digunakan untuk memprediksi ekson gen DNA Palsmodium falciparum mempunyai struktur model berdasarkan struktur gen DNA pada coding sequence (CDS). Penelitian ini bertujuan untuk mengembangkan sebuah model stuktur baru untuk prediksi ekson gen DNA Plasmodium falciparum berdasarkan struktur CDS dengan menggunakan sistem HMM. METODE Rancangan model pada CDS, di antara dua lokasi ekson dapat diketahui sebuah lokasi intron dan jumlah state model dapat dilakukan pada lokasi tersebut. Jumlah state dilakukan dengan memisahkan codon start dari ekson pertama dan codon stop dari ekson terakhir hingga mencapai 9. Algoritma Viterbi dan metode backward-forward transisi state serta emisi state digunakan untuk proses training. Sedangkan untuk proses testing menggunakan algoritma Viterbi dan Baum-Welch. Sebagai kinerja model digunakan Correlation Coefficient (CC) yang didapat dari perbandingan state estimasi pada output dan state asli pada input model. HASIL Hasil simulasi menunjukkan bahwa nilai CC tergantung pada pemberian nilai acak state transisi backward-forward. Model dengan jumlah state 9, menunjukkan nilai CC rata-rata tertinggi adalah 0,7289 untuk algoritma Viterbi dan 0,7166 untuk algoritma Baum-Welch. KESIMPULAN Struktur model berdasarkan sistem HMM sahih untuk memprediksi ekson gen DNA Plamodium falciparum Kata kunci : Prediksi ekson, gen-DNA, coding sequence, model hidden Markov * Jurusan Teknik Elektro Fakultas Teknologi Industri Universitas Trisakti ** Jurusan Teknik Elektro, Fakultas Teknik Universitas Indonesia *** Bagian Parasitologi Fakultas Kedokteran Universitas Trisakti Korespondensi aIr. Suhartati Agoes, MT Jurusan Teknik Elektro Fakultas Teknologi Industri Universitas Trisakti Jl. Kyai Tapa No.1, Grogol Jakarta 11440 Telp. 08164836613 Email: suhartati_agoes@yahoo.com Universa Medicina 2007; 26: 129-36. ABSTRAK INTRODUCTION I n c o d i n g s e q u e n c e ( C D S ) , t h e e x o n prediction on DNA-gene is very important to find the protein structure. Genome P. falciparum belongs to genome eukaryotic and has a long DNA genome, consisting of several exons and introns alternately located. After the splicing process in DNA sequence, some regions of exon in CDS will be found to produce the protein.(1) 131 T h e h i d d e n M a r k o v m o d e l ( H M M ) structure is one of the model used to predict the exon in desoxyribonucleai acid (DNA) which is based on the exon region structure in CDS. The HMM structure of the model: start codon (Methyanine/ATG), exon, intron and stop c o d o n ( TA A , o r T G A , o r TA G ) h a s b e e n d e m o n s t r a t e d b y R a b i n e r, H e n d e r s o n a n d Krogh.(2-4) This research developed a new model of structures for exon prediction on DNA of genes Plasmodium falciparum was based on the work of Nicorici and Anantharaman.(5,6) The significant difference of this model is it at least has two exon regions used for exon prediction like in CDS. Usually the 5’ boundary of introns in most eukaryotes contains the dinucleotide Guanine-Thymine (GT), and the 3’ boundary contains the dinucleotide Adenine-Guanine (AG). Furthermore, in intron regions of the dinucleotide GT were separated to be nucleotide Guanine (G) and Thymine (T), and dinucleotide AG to be nucleotide A and G. METHOD Research design The HMM framework in this experiment used Viterbi algorithm for the training process and both Viterbi and Baum-Welch algorithms for the testing process. The same sequence has been done for the training and testing process. The program is written in Matlab 7.0, and Bio- i n f o r m a t i c s ’ t o o l b o x t o g e n e r a t e D N A sequences in Genbank format and has the functions of HMM training and testing. The first experiment for the model based on CDS has been done with separated start codon (codon: ATG) from first exon and stop codon (codon: TAA, or TAG, or TGA) from the last exon, its model has the state number of 5 (Figure 1). Furthermore, the second experiment u s e d s e p a r a t e d d i n u c l e o t i d e G T a n d dinucleotide AG from the intron region, its model has the state number of 7 (Figure 2). The last experiment in the model also used separated d i n u c l e o t i d e G T i n t o G a n d T s t a t e s a n d dinucleotide AG into A and G states; the state number of the model is 9 (Figure 3). The emission values of the models can be performed a s t h e m a t r i x , t h e c o l u m n s a r e t h e b a s e s numbered on DNA sequences and the rows are the states number of the models designed. The state transition values of the models can be shown also in the figures. Samples A genome sequence of P. falciparum using 3D7 clone has 23-megabase nuclear genomes consisting of 14 chromosomes, encodes about 5,300 genes, and it has the most (A+T)-rich genome sequenced to date.(8-10) The data set of experiments for this simulation has 152 DNA sequences of genes from genome Plasmodium f a l c i p a r u m a t : c h r o m o s o m e 1 ( L o c u s : N C _ 0 0 4 3 2 5 ) , c h r o m o s o m e 2 ( L o c u s : N C _ 0 0 0 9 1 0 ) , c h r o m o s o m e 3 ( L o c u s : N C _ 0 0 0 5 2 1 ) , c h r o m o s o m e 4 ( L o c u s : N C _ 0 0 4 3 1 8 ) , c h r o m o s o m e 5 ( L o c u s : N C _ 0 0 4 3 2 6 ) , a n d c h r o m o s o m e 9 ( L o c u s : NC_004330) in Genbank format.(11,12) Genbank format describes the CDS and original DNA sequences of genes P. falciparum. In CDS of genes contains at least two exon regions and maximum 10 exon regions. The minimum length sequence of genes is 684 base pairs (bp) and maximum length is 10095 bp. HMM provides a good probability method for discrete sequences model of data like DNA sequences (alphabet of four letters: A, C, G and T). Prediction of exon in DNA of genes P. falciparum is based on exon region in CDS and it has at least two exon regions. The structure of the model based on CDS structure as shown in Figure 4, which separated is start codon from the first exon and separated stop codon from the last exon. Universa Medicina Vol. 26 No.3 132 Agoes, Gunawan, Sardy, et al Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM In the inputs are DNA sequences and set the state number of the sequences depends on the structure model. Based on HMM method, the Viterbi algorithm used for the training and needs the transition state values and emission values for the process. The value of transition state is random and the value of emission can be found with the distribution number of DNA nucleotide in each states. The result of HMM training is the estimated transition state and emission state values. The estimated transition and emission values were used for HMM testing p r o c e s s w i t h Vi t e r b i a n d B a u m - We l c h algorithms. The results of HMM testing process are the estimated states of the model.(2) The parameter of performance indicator used the Figure 3. The HMM structure with 9 states Exon Intron ExonStart Stop Figure 4. The HMM structure based on CDS 9 Start E G T I A G 1 2 3 4 5 6 7 E Stop 8 1 1 0.9 0.1 0.9 0.9 0.1 0.1 0.9 0.9 0.1 0.1 0.9 0.1 0.8 0.1 0.1 5 0 .9 1 E IS ta rt S to p 2 3 4 1 0 .5 0 .5 0 .1 0 .8 1 0 .1 9 1 1 0 -6E Figure 1. The HMM structure with 5 states S tart E G T I A G E Stop 1 2 3 4 5 6 7 1 10.85 0.1 0.05 0.7 0.3 0.9 0.1 0.7 0.9 0.1 0.3 Figure 2. The HMM structure with 7 states 133 correlation coefficient (CC), as a ratio of the estimated states from the output and the original states from the input of the model(1,7) and the formula is the following: Where: TP= True Positive; FP= False Positive; TN= True Negative; FN= False Negative Statistical analysis The emission matrix of the model with state number 5 is: e = [ 0.3333 0 0.3333 0.3333 ; 0.4342 0.1094 0.1507 0.3056 ; 0.4210 0.0599 0.0703 0.4487 ; 0.4215 0.1173 0.1622 0.2990 ; 0.5592 0 0.1075 0.3333 ] The emission matrix of the model with state number 7 is: e = [ 0.3333 0 0.3333 0.3333 ; 0.4342 0.1094 0.1507 0.3056 ; 0 0 0.5000 0.5000 ; 0.4210 0.0599 0.0703 0.4487 ; 0.5000 0 0.5000 0 ; 0.4215 0.1173 0.1622 0.2990 ; 0.5592 0 0.1075 0.3333 ] The emission matrix of the model with state number 9 is: e = [ 0.3333 0 0.3333 0.3333 ; 0.4342 0.1094 0.1507 0.3056 ; 0 0 1.0000 0 ; 0 0 0 1.0000 ; 0.4210 0.0599 0.0703 0.4487 ; 1.0000 0 0 0 ; 0 0 1.0000 0 ; 0.4215 0.1173 0.1622 0.2990 ; 0.5592 0 0.1075 0.3333 ] FN)FP).(TNFP).(TPFN).(TN(TP (FP.FN)-(TP.TN) CC ++++ = The transition state values are randomly, but the first state on the model has the minimum transition state value of 0 and the last state has the maximum transition state value of 1. On the other hand, the emission state values were found from the distribution number of DNA nucleotide i n e a c h s t a t e d e p e n d i n g o n t h e m o d e l s . Calculation of the CC is by using the above equation with the assumptions that the exon is positive and intron is negative. RESULTS Based on exon regions structure in CDS, the simulation results for the model with state number 5 above and transition state values randomly has found the average of CC value in Table 1. Table 2 and Table 3 shows the average of CC values of the model with state number 7 and state number 9. The transition state value in the first state of all structures on the models is 0 and the last state is 1. The other transition state values except in Table 2 are 0.1 and the h i g h e s t a v e r a g e o f C C v a l u e s f o r t h i s experiments are shown in Figure 5. The model based on exon regions structure in CDS for state number nine resulted in the highest average of CC. Its values are 0.7289 by using Viterbi algorithm and 0.7166 by using Baum-Welch algorithm. However, the lowest average of CC values has been found for the model with state number five. Its values are 0.6735 by using Viterbi algorithm and 0.6661 by using Baum-Welch algorithm. DISCUSSION All these were based on the model of exon regions structure in CDS, at the state number nine resulted in the highest average of CC values. It is also shown that the emission state value of t h e m o d e l s h a v e t h e ( A + T ) - r i c h g e n o m e sequences of genes Plasmodium falciparum. Universa Medicina Vol. 26 No.3 134 Agoes, Gunawan, Sardy, et al Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM Transition state values Corr.Coef.(CC) Exps tr11 tr22 tr33 tr44 tr55 Viterbi B-Welch 1 0 0.5 0.9 0.80 1 0.7054 0.6750 2 0 0.5 0.9 0.81 1 0.7061 0.6759 3 0 0.5 0.9 0.82 1 0.7054 0.6750 4 0 0.5 0.9 0.83 1 0.7054 0.6750 5 0 0.5 0.9 0.84 1 0.7045 0.6741 6 0 0.5 0.9 0.85 1 0.7054 0.6750 7 0 0.5 0.9 0.86 1 0.7062 0.6744 8 0 0.5 0.9 0.87 1 0.0012 0.0015 9 0 0.5 0.9 0.88 1 0.0011 0.0013 10 0 0.5 0.9 0.89 1 0.0011 0.0013 Table 1. The average of CC values for the model with state number 5 Transition state values Corr.Coef.(CC) Exps tr11 tr22 tr33 tr44 tr55 tr66 tr77 Viterbi B-Welch 1 0 0.4 0.2 0.8 0.10 0.7 1 0.6713 0.6643 2 0 0.4 0.2 0.8 0.20 0.7 1 0.6713 0.6643 3 0 0.4 0.2 0.8 0.30 0.7 1 0.6733 0.6660 4 0 0.4 0.2 0.8 0.40 0.7 1 0.6731 0.6660 5 0 0.4 0.2 0.8 0.50 0.7 1 0.6735 0.6661 6 0 0.4 0.2 0.8 0.60 0.7 1 -0.0642 -0.0729 7 0 0.4 0.2 0.8 0.59 0.7 1 -0.0568 -0.0639 8 0 0.4 0.2 0.8 0.58 0.7 1 0.6735 0.6661 9 0 0.4 0.2 0.8 0.56 0.7 1 0.6735 0.6661 10 0 0.4 0.2 0.8 0.55 0.7 1 0.6735 0.6661 Table 2. The average of CC values for the model with state number 7 Transition state values Corr.Coef.(CC) Exps tr11 tr22 tr33 tr44 tr55 tr66 tr77 tr88 tr99 Viterbi B-Welch 1 0 0.3 0.2 0.2 0.80 0.10 0.10 0.70 1 0.7242 0.7124 2 0 0.3 0.1 0.1 0.79 0.10 0.10 0.70 1 0.7242 0.7124 3 0 0.3 0.1 0.1 0.78 0.09 0.09 0.70 1 0.7242 0.7124 4 0 0.3 0.1 0.1 0.78 0.20 0.20 0.70 1 0.7258 0.7120 5 0 0.3 0.1 0.1 0.80 0.20 0.20 0.80 1 0.7258 0.7120 6 0 0.3 0.1 0.1 0.80 0.15 0.15 0.74 1 0.7258 0.7120 7 0 0.3 0.1 0.1 0.80 0.15 0.15 0.80 1 0.7258 0.7120 8 0 0.3 0.1 0.1 0.80 0.15 0.15 0.70 1 0.7258 0.7120 9 0 0.8 0.2 0.2 0.80 0.10 0.10 0.70 1 0.6787 0.6755 10 0 0.9 0.2 0.2 0.80 0.10 0.10 0.70 1 0.0451 0.0437 Table 3. The average of CC values for the model with state number 9 135 This can also be showed in their matrix values described above. In order to match with Markov’s chain in eukaryotic gene structure for c o n n e c t i n g t h e e x o n a n d i n t r o n , t h e n t h e backward-forward method are used in this HMM system. The experiments were done by Viterbi algorithm for the training process and the both Viterbi and Baum-Welch algorithms for the testing process. A set of 152 sequences is then used as data to train the model and a different HMM is produced for each set of sequences. HMM have been previously used in sequence analysis to produce an HMM that represents a sequence profile (a profile HMM), to analyze sequence composition and patterns, to locate genes by p r e d i c t i n g o p e n r e a d i n g f r a m e o r c o d i n g sequence, and to produce protein structure predictions. This HMM generates sequences 0.62 0.64 0.66 0.68 0.7 0.72 0.74 C orrelation C oefficient (C C ) T he highest average of CC values Viterbi 0.7061 0.6735 0.7258 Ba um-We lch 0.6759 0.6661 0.712 5 7 9 w i t h v a r i o u s c o m b i n a t i o n s o f m a t c h e d , mismatched, insertions and deletions, and gives these a probability, depending on the values of the various parameters in the model. A total of 424 predicted protein-coding g e n e s a n d 3 t R N A g e n e s o f P l a s m o d i u m falciparum have been identified on chromosomes 2 and 3, the two chromosomes for which s e q u e n c i n g h a v e b e e n c o m p l e t e d . ( 1 3 , 1 4 ) Approximately 37% (158 genes) of three genes have a readily identifiable homologue in another species. Such similarity allows a function to be implied, with many of those identified being involved in parasite metabolism. Interestingly, a comparison of the predicted protein sequences with their homologues in other species showed that the majority of Plasmodium falciparum proteins have insertion of low complexity sequences, often runs of a single amino acid Figure 5. The highest average of CC values of the experiments Universa Medicina Vol. 26 No.3 136 Agoes, Gunawan, Sardy, et al Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM r e s i d u e ( t y p i c a l l y a s p a r a g i n e s , l y s i n e o r glutamine acid), or tandem arrays of a short peptide repeat sequence. Examples of such regions have been identified previously, and are polymorphic between different parasite isolates. Perhaps it will open up new avenues for drug and vaccine development. Perhaps it will make a genetically modified mosquito that cannot act as vector a reality once processes are in place to deal with ethical, legal and social issues. But any new developments are unlikely to happen before the 2010 deadline set for Roll Back Malaria. New drugs based on knowledge about the genome will probably not be affordable to those countries that need it the most.(15) CONCLUSION This research with a new structure model was using for exon prediction on DNA-genes of Plasmodium falciparum based on CDS structure by using the HMM system. ACKNOWLEDGMENT T h e a u t h o r s w o u l d l i k e t o t h a n k d r. Herawati Sudoyo, PhD, dr. Helena MS and Hidayat Trimarsanto B.Sc from the Eijkman Institute, for their valuable suggestions. References 1. Samatova N.F. Computational gene finding using HMMs. UT-Battelle Information Center, 1201 Oak Ridge Turnpike, Suite 100, Oak Ridge, TN 37830. Institute Oak Ridge National Laboratory; 2003. 2. Lawrence R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceeding of The IEEE. 1989; 77: 257-68. 3. Henderson J, Salzberg S, Fasman K.H. Finding genes in DNA with a hidden Markov model. J Comput Biol 1997; 4: 127-42. 4. Krogh A. An introduction to hidden Markov models for biological sequences. In Computational methods in molecular biology. Salzberg SL, Searls DB, Kasif S. editors. Denmark. Center for Biological Sequence Analysis, Technical University of Denmark. 1998. p. 45-63. 5. Nicorici D, Astola J, Tobus I. Computational identification of exons in DNA with a hidden Markov model. Tampere International Center for Signal Processing, Tampere University of Technology, Finland. Available at: http:// www.gensips.gatech.edu/processings/contributed/ CP2-06.pdf. Accessed May 12, 2005. 6. Anantharaman T. Finding genes in genomic DNA The GENESCAN System. Available at : http:// www.biostat.wisc.edu/bmi776. Accessed June 6, 2005. 7. Vaisman I. Bioinformatics and gene discovery. Bioinformatics Tutorial. North Carolina, United State: University of North Carolina at Chapel Hill. 1998. 8. Hall N, Gardner M.J, Hyman RW, Lasonder E, Wilson RJM, Scherf A, et al. Sequence of plasmodium falciparum chromosomes 1, 3-9 and 13. Nature 2002; 419: 527-31. 9. Gardner M.J, Hall N, Hyman RW, Hinterberg K, Mattei D, Wellem TE, et al. Sequence of plasmodium falciparum chromosomes 2, 10, 11 and 14. Nature 2002; 419: 531-4. 10. Hyman RW, Gardner M.J, Hall N, De Bruin D, Scherf A, Day KP, et al. Sequence of plasmodium falciparum chromosome 12. Nature 2002; 419: 534-7. 11. Anastassiou D. Genomic signal processing. IEEE Signal Processing Magazine 2001; 18: 4. 12. Alphey L. DNA sequencing. Manchester UK : Bios Scientific Publishers Limited; 1997. 13. Gardner M.J, Hall N, Fung E, White O, Berriman M, Hyman RW, et al. Chromosome 2 sequence of the human malaria parasite Plasmodium falciparum. Science 1998; 282: 1126-32. 14. Bowman S, Lawson D, Basham D, Brown D, Chillingworth T, Churcher C. M, et al. The complete nucleotide sequence of chromosome 3 of Plasmodium falciparum. Nature 1999; 400: 532- 8. 15. Anonymous. Malaria after the genomes. The Lancet 2002; 360: 1107.