Adi Priyana


129

Exon prediction on DNA-genes of Plasmodium
falciparum based on coding sequence structure using

hidden Markov model

Suhartati Agoes*a, Dadang Gunawan**, Sardy S**, and Hoedojo***

UNIVERSA MEDICINA
Juli-September 2007Juli-September 2007Juli-September 2007Juli-September 2007Juli-September 2007    Vol.26 - No.3   Vol.26 - No.3   Vol.26 - No.3   Vol.26 - No.3   Vol.26 - No.3

ABSTRACT

* Electrical Engineering
Department, Faculty of
Industrial Technology,
Trisakti University
** Electrical Engineering
Department, Faculty of
Engineering,
Indonesia University
*** Department of
Parasitology, Medical Faculty,
Trisakti University

Korespondensi
aIr. Suhartati Agoes, MT
Electrical Engineering
Department, Faculty of
Industrial Technology,
Trisakti University
Jl. Kyai Tapa No.1, Grogol
Jakarta 11440
Telp. 08164836613
Email:
suhartati_agoes@yahoo.com

Universa Medicina 2007; 26: 129-
36.

BACKGROUND
A hidden Markov model (HMM) is used for exon prediction on DNA of genes
Plasmodium falciparum that has a model structure based on exon region structure
in coding sequence (CDS). The objective research was to develop a new structure
model to predict exon on DNA-genes of Plasmodium falciparum based on CDS
structure using the HMM system.

METHODS
Model design in CDS, between two exon regions can be found one intron region
and the model state number is used for its region. Its state number is used by
separating start codon from first exon region and stop codon from the last exon
region up to 9. The Viterbi algorithm and the backward-forward method for
transition as well as emission states are used for training process. Furthermore,
Viterbi and Baum-Welch algorithms are used for the testing process. The
correlation coefficient (CC) was used as performance indicator, as the ratio of
the estimated state in the output and the original state in the input of the model.

RESULTS
The simulation results has shown that the CC values depend on the given of the
backward-forward transition state values randomly. The model with state number
9 showed the highest average of CC values of 0.7289 for Viterbi algorithm, and
is 0.7166 for Baum-Welch algorithm. However, the lowest average of CC values
has been found for the model with state number five. Its values are 0.6735 by
using Viterbi algorithm and 0.6661 by using Baum-Welch algorithm.

CONCLUSION
The new structure model based on HMM system was valid to predict exon on
DNA-genes of Plasmodium falciparum.

Keywords: Exon Prediction, DNA-gene, coding sequence, Hidden Markov Model


130

Agoes, Gunawan, Sardy, et al                                                                                                         Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM

Prediksi ekson DNA-gen Plasmodium falciparum berdasarkan
struktur coding sequence dengan menggunakan

model hidden Markov

Suhartati Agoes*a, Dadang Gunawan**, Sardy S**, dan Hoedojo***

LATAR BELAKANG
Sebuah hidden Markov model (HMM) yang digunakan untuk memprediksi ekson
gen DNA Palsmodium falciparum mempunyai struktur model berdasarkan struktur
gen DNA pada coding sequence (CDS). Penelitian ini bertujuan untuk mengembangkan
sebuah model stuktur baru untuk prediksi ekson gen DNA Plasmodium falciparum
berdasarkan struktur CDS dengan menggunakan sistem HMM.

METODE
Rancangan model pada CDS, di antara dua lokasi ekson dapat diketahui sebuah lokasi
intron dan jumlah state model dapat dilakukan pada lokasi tersebut. Jumlah state
dilakukan dengan memisahkan codon start dari ekson pertama dan codon stop dari
ekson terakhir hingga mencapai 9. Algoritma Viterbi dan metode backward-forward
transisi state serta emisi state digunakan untuk proses training. Sedangkan untuk
proses testing menggunakan algoritma Viterbi dan Baum-Welch. Sebagai kinerja model
digunakan Correlation Coefficient (CC) yang didapat dari perbandingan state estimasi
pada output dan state asli pada input model.

HASIL
Hasil simulasi menunjukkan bahwa nilai CC tergantung pada pemberian nilai acak
state transisi backward-forward. Model dengan jumlah state 9, menunjukkan nilai
CC rata-rata tertinggi adalah 0,7289 untuk algoritma Viterbi dan 0,7166 untuk
algoritma Baum-Welch.

KESIMPULAN
Struktur model berdasarkan sistem HMM sahih untuk memprediksi ekson gen DNA
Plamodium falciparum

Kata kunci : Prediksi ekson, gen-DNA, coding sequence, model hidden Markov

* Jurusan Teknik Elektro
Fakultas Teknologi Industri
Universitas Trisakti
** Jurusan Teknik Elektro,
Fakultas Teknik
Universitas Indonesia
*** Bagian Parasitologi
Fakultas Kedokteran
Universitas Trisakti

Korespondensi
aIr. Suhartati Agoes, MT
Jurusan Teknik Elektro
Fakultas Teknologi Industri
Universitas Trisakti
Jl. Kyai Tapa No.1, Grogol
Jakarta 11440
Telp. 08164836613
Email:
suhartati_agoes@yahoo.com

Universa Medicina 2007; 26: 129-36.

ABSTRAK

INTRODUCTION

I n  c o d i n g  s e q u e n c e  ( C D S ) ,  t h e  e x o n
prediction on DNA-gene is very important to find
the protein structure. Genome P. falciparum

belongs to genome eukaryotic and has a long
DNA genome, consisting of several exons and
introns alternately located. After the splicing
process in DNA sequence, some regions of exon
in CDS will be found to produce the protein.(1)


131

T h e  h i d d e n  M a r k o v  m o d e l  ( H M M )
structure is one of the model used to predict
the exon in desoxyribonucleai acid (DNA)
which is based on the exon region structure in
CDS. The HMM structure of the model: start
codon (Methyanine/ATG), exon, intron and stop
c o d o n  ( TA A ,  o r  T G A ,  o r  TA G )  h a s  b e e n
d e m o n s t r a t e d  b y  R a b i n e r,  H e n d e r s o n  a n d
Krogh.(2-4) This research developed a new model
of structures for exon prediction on DNA of
genes Plasmodium falciparum was based on the
work of Nicorici and Anantharaman.(5,6) The
significant difference of this model is it at least
has two exon regions used for exon prediction
like in CDS. Usually the 5’ boundary of introns
in most eukaryotes contains the dinucleotide
Guanine-Thymine (GT), and the 3’ boundary
contains the dinucleotide Adenine-Guanine
(AG). Furthermore, in intron regions of the
dinucleotide GT were separated to be nucleotide
Guanine (G) and Thymine (T), and dinucleotide
AG to be nucleotide A and G.

METHOD

Research design
The HMM framework in this experiment

used Viterbi algorithm for the training process
and both Viterbi and Baum-Welch algorithms
for the testing process. The same sequence has
been done for the training and testing process.
The program is written in Matlab 7.0, and Bio-
i n f o r m a t i c s ’  t o o l b o x  t o  g e n e r a t e  D N A
sequences in Genbank format and has the
functions of HMM training and testing.

The first experiment for the model based
on CDS has been done with separated start
codon (codon: ATG) from first exon and stop
codon (codon: TAA, or TAG, or TGA) from the
last exon, its model has the state number of 5
(Figure 1). Furthermore, the second experiment
u s e d  s e p a r a t e d  d i n u c l e o t i d e  G T  a n d
dinucleotide AG from the intron region, its

model has the state number of 7 (Figure 2). The
last experiment in the model also used separated
d i n u c l e o t i d e  G T i n t o  G  a n d  T s t a t e s  a n d
dinucleotide AG into A and G states; the state
number of the model is 9 (Figure 3). The
emission values of the models can be performed
a s  t h e  m a t r i x ,  t h e  c o l u m n s  a r e  t h e  b a s e s
numbered on DNA sequences and the rows are
the states number of the models designed. The
state transition values of the models can be
shown also in the figures.

Samples
A genome sequence of P. falciparum using

3D7 clone has 23-megabase nuclear genomes
consisting of 14 chromosomes, encodes about
5,300 genes, and it has the most (A+T)-rich
genome sequenced to date.(8-10) The data set of
experiments for this simulation has 152 DNA
sequences of genes from genome Plasmodium
f a l c i p a r u m  a t :  c h r o m o s o m e  1  ( L o c u s :
N C _ 0 0 4 3 2 5 ) ,  c h r o m o s o m e  2  ( L o c u s :
N C _ 0 0 0 9 1 0 ) ,  c h r o m o s o m e  3  ( L o c u s :
N C _ 0 0 0 5 2 1 ) ,  c h r o m o s o m e  4  ( L o c u s :
N C _ 0 0 4 3 1 8 ) ,  c h r o m o s o m e  5  ( L o c u s :
N C _ 0 0 4 3 2 6 ) ,  a n d  c h r o m o s o m e  9  ( L o c u s :
NC_004330) in Genbank format.(11,12) Genbank
format describes the CDS and original DNA
sequences of genes P. falciparum. In CDS of
genes contains at least two exon regions and
maximum 10 exon regions. The minimum length
sequence of genes is 684 base pairs (bp) and
maximum length is 10095 bp.

HMM provides a good probability method
for discrete sequences model of data like DNA
sequences (alphabet of four letters: A, C, G and
T). Prediction of exon in DNA of genes P.
falciparum is based on exon region in CDS and
it has at least two exon regions. The structure
of the model based on CDS structure as shown
in Figure 4, which separated is start codon from
the first exon and separated stop codon from
the last exon.

Universa Medicina                                                                                                                                                              Vol. 26 No.3


132

Agoes, Gunawan, Sardy, et al                                                                                                         Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM

In the inputs are DNA sequences and set
the state number of the sequences depends on
the structure model. Based on HMM method,
the Viterbi algorithm used for the training and
needs the transition state values and emission
values for the process. The value of transition
state is random and the value of emission can
be found with the distribution number of DNA

nucleotide in each states. The result of HMM
training is the estimated transition state and
emission state values. The estimated transition
and emission values were used for HMM testing
p r o c e s s  w i t h  Vi t e r b i  a n d  B a u m - We l c h
algorithms. The results of HMM testing process
are the estimated states of the model.(2) The
parameter of performance indicator used the

Figure 3. The HMM structure with 9 states

Exon Intron ExonStart Stop 

Figure 4. The HMM structure based on CDS

  9

Start E G T I A G

1    2 3 4 5 6 7

E Stop

  8

1

1

0.9

  0.1

0.9 0.9

  0.1    0.1

0.9   0.9

   0.1 0.1

0.9

   0.1

0.8

   0.1

0.1

5

0 .9

1

E IS ta rt S to p

2 3 4

1

0 .5

0 .5 0 .1

      0 .8 1

0 .1 9

 1

1 0 -6E

Figure 1. The HMM structure with 5 states

S tart E G T I A G E Stop

1    2 3 4 5 6 7

1

10.85

0.1

0.05

0.7

0.3

0.9

0.1

0.7

    
0.9

 0.1 0.3

Figure 2. The HMM structure with 7 states


133

correlation coefficient (CC), as a ratio of the
estimated states from the output and the original
states from the input of the model(1,7) and the
formula is the following:

Where: TP= True Positive; FP= False Positive;
TN= True Negative; FN= False Negative

Statistical analysis

The emission matrix of the model with state
number 5 is:
e = [ 0.3333          0        0.3333     0.3333 ;

0.4342     0.1094     0.1507     0.3056 ;
0.4210     0.0599     0.0703     0.4487 ;
0.4215     0.1173     0.1622     0.2990 ;
0.5592          0        0.1075     0.3333 ]

The emission matrix of the model with state
number 7 is:
e = [ 0.3333         0         0.3333     0.3333 ;

0.4342     0.1094     0.1507     0.3056 ;
           0             0         0.5000     0.5000 ;

0.4210     0.0599     0.0703     0.4487 ;
       0.5000          0         0.5000          0   ;

0.4215     0.1173     0.1622     0.2990 ;
0.5592          0         0.1075     0.3333 ]

The emission matrix of the model with state
number 9 is:
e = [ 0.3333         0         0.3333     0.3333 ;

0.4342     0.1094     0.1507     0.3056 ;
           0             0         1.0000          0    ;
           0             0              0         1.0000 ;

0.4210     0.0599     0.0703     0.4487  ;
       1.0000          0              0              0  ;
           0             0         1.0000          0    ;

0.4215     0.1173     0.1622     0.2990 ;
0.5592          0         0.1075     0.3333 ]

FN)FP).(TNFP).(TPFN).(TN(TP
(FP.FN)-(TP.TN)

CC
++++

=

The transition state values are randomly,
but the first state on the model has the minimum
transition state value of 0 and the last state has
the maximum transition state value of 1. On the
other hand, the emission state values were found
from the distribution number of DNA nucleotide
i n  e a c h  s t a t e  d e p e n d i n g  o n  t h e  m o d e l s .
Calculation of the CC is by using the above
equation with the assumptions that the exon is
positive and intron is negative.

RESULTS

Based on exon regions structure in CDS,
the simulation results for the model with state
number 5 above and transition state values
randomly has found the average of CC value in
Table 1. Table 2 and Table 3 shows the average
of CC values of the model with state number 7
and state number 9. The transition state value
in the first state of all structures on the models
is 0 and the last state is 1. The other transition
state values except in Table 2 are 0.1 and the
h i g h e s t  a v e r a g e  o f  C C  v a l u e s  f o r  t h i s
experiments are shown in Figure 5.

The model based on exon regions structure
in CDS for state number nine resulted in the
highest average of CC. Its values are 0.7289 by
using Viterbi algorithm and 0.7166 by using
Baum-Welch algorithm. However, the lowest
average of CC values has been found for the
model with state number five. Its values are
0.6735 by using Viterbi algorithm and 0.6661
by using Baum-Welch algorithm.

DISCUSSION

All these were based on the model of exon
regions structure in CDS, at the state number
nine resulted in the highest average of CC values.
It is also shown that the emission state value of
t h e  m o d e l s  h a v e  t h e  ( A + T ) - r i c h  g e n o m e
sequences of genes Plasmodium falciparum.

Universa Medicina                                                                                                                                                              Vol. 26 No.3


134

Agoes, Gunawan, Sardy, et al                                                                                                         Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM

Transition state values Corr.Coef.(CC) 
Exps 

tr11 tr22 tr33 tr44 tr55 Viterbi B-Welch 
1 0 0.5 0.9 0.80 1 0.7054 0.6750 
2 0 0.5 0.9 0.81 1 0.7061 0.6759 
3 0 0.5 0.9 0.82 1 0.7054 0.6750 
4 0 0.5 0.9 0.83 1 0.7054 0.6750 
5 0 0.5 0.9 0.84 1 0.7045 0.6741 
6 0 0.5 0.9 0.85 1 0.7054 0.6750 
7 0 0.5 0.9 0.86 1 0.7062 0.6744 
8 0 0.5 0.9 0.87 1 0.0012 0.0015 
9 0 0.5 0.9 0.88 1 0.0011 0.0013 
10 0 0.5 0.9 0.89 1 0.0011 0.0013 

 
Table 1. The average of CC values for the model with state number 5

Transition state values Corr.Coef.(CC) 
Exps 

tr11 tr22 tr33 tr44 tr55 tr66 tr77 Viterbi B-Welch 
1 0 0.4 0.2 0.8 0.10 0.7 1 0.6713 0.6643 
2 0 0.4 0.2 0.8 0.20 0.7 1 0.6713 0.6643 
3 0 0.4 0.2 0.8 0.30 0.7 1 0.6733 0.6660 
4 0 0.4 0.2 0.8 0.40 0.7 1 0.6731 0.6660 
5 0 0.4 0.2 0.8 0.50 0.7 1 0.6735 0.6661 
6 0 0.4 0.2 0.8 0.60 0.7 1 -0.0642 -0.0729 
7 0 0.4 0.2 0.8 0.59 0.7 1 -0.0568 -0.0639 
8 0 0.4 0.2 0.8 0.58 0.7 1 0.6735 0.6661 
9 0 0.4 0.2 0.8 0.56 0.7 1 0.6735 0.6661 

10 0 0.4 0.2 0.8 0.55 0.7 1 0.6735 0.6661 
 

Table 2. The average of CC values for the model with state number 7

Transition state values Corr.Coef.(CC) 
Exps 

tr11 tr22 tr33 tr44 tr55 tr66 tr77 tr88 tr99 Viterbi B-Welch 
1 0 0.3 0.2 0.2 0.80 0.10 0.10 0.70 1 0.7242 0.7124 
2 0 0.3 0.1 0.1 0.79 0.10 0.10 0.70 1 0.7242 0.7124 
3 0 0.3 0.1 0.1 0.78 0.09 0.09 0.70 1 0.7242 0.7124 
4 0 0.3 0.1 0.1 0.78 0.20 0.20 0.70 1 0.7258 0.7120 
5 0 0.3 0.1 0.1 0.80 0.20 0.20 0.80 1 0.7258 0.7120 
6 0 0.3 0.1 0.1 0.80 0.15 0.15 0.74 1 0.7258 0.7120 
7 0 0.3 0.1 0.1 0.80 0.15 0.15 0.80 1 0.7258 0.7120 
8 0 0.3 0.1 0.1 0.80 0.15 0.15 0.70 1 0.7258 0.7120 
9 0 0.8 0.2 0.2 0.80 0.10 0.10 0.70 1 0.6787 0.6755 
10 0 0.9 0.2 0.2 0.80 0.10 0.10 0.70 1 0.0451 0.0437 

 
Table 3. The average of CC values for the model with state number 9


135

This can also be showed in their matrix
values described above. In order to match with
Markov’s chain in eukaryotic gene structure for
c o n n e c t i n g  t h e  e x o n  a n d  i n t r o n ,  t h e n  t h e
backward-forward method are used in this HMM
system. The experiments were done by Viterbi
algorithm for the training process and the both
Viterbi and Baum-Welch algorithms for the
testing process.

A set of 152 sequences is then used as data
to train the model and a different HMM is
produced for each set of sequences. HMM have
been previously used in sequence analysis to
produce an HMM that represents a sequence
profile (a profile HMM), to analyze sequence
composition and patterns, to locate genes by
p r e d i c t i n g  o p e n  r e a d i n g  f r a m e  o r  c o d i n g
sequence, and to produce protein structure
predictions. This HMM generates sequences

0.62

0.64

0.66

0.68

0.7

0.72

0.74

C
orrelation C

oefficient (C
C

)

T he highest average of CC values

Viterbi 0.7061 0.6735 0.7258

Ba um-We lch 0.6759 0.6661 0.712

5 7 9

w i t h  v a r i o u s  c o m b i n a t i o n s  o f  m a t c h e d ,
mismatched, insertions and deletions, and gives
these a probability, depending on the values of
the various parameters in the model.

A total of 424 predicted protein-coding
g e n e s  a n d  3  t R N A  g e n e s  o f  P l a s m o d i u m
falciparum have been identified on chromosomes
2 and 3, the two chromosomes for which
s e q u e n c i n g  h a v e  b e e n  c o m p l e t e d . ( 1 3 , 1 4 )

Approximately 37% (158 genes) of three genes
have a readily identifiable homologue in another
species. Such similarity allows a function to be
implied, with many of those identified being
involved in parasite metabolism. Interestingly,
a comparison of the predicted protein sequences
with their homologues in other species showed
that the majority of Plasmodium falciparum
proteins have insertion of low complexity
sequences, often runs of a single amino acid

Figure 5. The highest average of CC values of the experiments

Universa Medicina                                                                                                                                                              Vol. 26 No.3


136

Agoes, Gunawan, Sardy, et al                                                                                                         Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM

r e s i d u e  ( t y p i c a l l y  a s p a r a g i n e s ,  l y s i n e  o r
glutamine acid), or tandem arrays of a short
peptide repeat sequence. Examples of such
regions have been identified previously, and are
polymorphic between different parasite isolates.
Perhaps it will open up new avenues for drug
and vaccine development. Perhaps it will make
a genetically modified mosquito that cannot act
as vector a reality once processes are in place
to deal with ethical, legal and social issues. But
any new developments are unlikely to happen
before the 2010 deadline set for Roll Back
Malaria. New drugs based on knowledge about
the genome will probably not be affordable to
those countries that need it the most.(15)

CONCLUSION

This research with a new structure model
was using for exon prediction on DNA-genes of
Plasmodium falciparum based on CDS structure
by using the HMM system.

ACKNOWLEDGMENT

T h e  a u t h o r s  w o u l d  l i k e  t o  t h a n k  d r.
Herawati Sudoyo, PhD, dr. Helena MS and
Hidayat Trimarsanto B.Sc from the Eijkman
Institute, for their valuable suggestions.

References

1. Samatova N.F. Computational gene finding using
HMMs. UT-Battelle Information Center, 1201 Oak
Ridge Turnpike, Suite 100, Oak Ridge, TN 37830.
Institute Oak Ridge National Laboratory; 2003.

2. Lawrence R. A tutorial on hidden Markov models
and selected applications in speech recognition.
Proceeding of The IEEE. 1989; 77: 257-68.

3. Henderson J, Salzberg S, Fasman K.H. Finding
genes in DNA with a hidden Markov model. J
Comput Biol 1997; 4: 127-42.

4. Krogh A. An introduction to hidden Markov
models for biological sequences. In
Computational methods in molecular biology.
Salzberg SL, Searls DB, Kasif S. editors.
Denmark. Center for Biological Sequence
Analysis, Technical University of Denmark. 1998.
p. 45-63.

5. Nicorici D, Astola J, Tobus I. Computational
identification of exons in DNA with a hidden
Markov model. Tampere International Center for
Signal Processing, Tampere University of
Technology, Finland. Available at: http://
www.gensips.gatech.edu/processings/contributed/
CP2-06.pdf. Accessed May 12, 2005.

6. Anantharaman T. Finding genes in genomic DNA
The GENESCAN System. Available at : http://
www.biostat.wisc.edu/bmi776. Accessed June 6,
2005.

7. Vaisman I. Bioinformatics and gene discovery.
Bioinformatics Tutorial. North Carolina, United
State: University of North Carolina at Chapel Hill.
1998.

8. Hall N, Gardner M.J, Hyman RW, Lasonder E,
Wilson RJM, Scherf A, et al. Sequence of
plasmodium falciparum chromosomes 1, 3-9 and
13. Nature 2002; 419: 527-31.

9. Gardner M.J, Hall N, Hyman RW, Hinterberg K,
Mattei D, Wellem TE, et al. Sequence of
plasmodium falciparum chromosomes 2, 10, 11
and 14. Nature 2002; 419: 531-4.

10. Hyman RW, Gardner M.J, Hall N, De Bruin D,
Scherf A, Day KP, et al. Sequence of plasmodium
falciparum chromosome 12. Nature 2002; 419:
534-7.

11. Anastassiou D. Genomic signal processing. IEEE
Signal Processing Magazine 2001; 18: 4.

12. Alphey L. DNA sequencing. Manchester UK : Bios
Scientific Publishers Limited; 1997.

13. Gardner M.J, Hall N, Fung E, White O, Berriman
M, Hyman RW, et al. Chromosome 2 sequence of
the human malaria parasite Plasmodium
falciparum. Science 1998; 282: 1126-32.

14. Bowman S, Lawson D, Basham D, Brown D,
Chillingworth T, Churcher C. M, et al. The
complete nucleotide sequence of chromosome 3 of
Plasmodium falciparum. Nature 1999; 400: 532-
8.

15. Anonymous. Malaria after the genomes. The
Lancet 2002; 360: 1107.


129

Exon prediction on DNA-genes of Plasmodium
falciparum based on coding sequence structure using

hidden Markov model

Suhartati Agoes*a, Dadang Gunawan**, Sardy S**, and Hoedojo***

UNIVERSA MEDICINA
Juli-September 2007Juli-September 2007Juli-September 2007Juli-September 2007Juli-September 2007    Vol.26 - No.3   Vol.26 - No.3   Vol.26 - No.3   Vol.26 - No.3   Vol.26 - No.3

ABSTRACT

* Electrical Engineering
Department, Faculty of
Industrial Technology,
Trisakti University
** Electrical Engineering
Department, Faculty of
Engineering,
Indonesia University
*** Department of
Parasitology, Medical Faculty,
Trisakti University

Korespondensi
aIr. Suhartati Agoes, MT
Electrical Engineering
Department, Faculty of
Industrial Technology,
Trisakti University
Jl. Kyai Tapa No.1, Grogol
Jakarta 11440
Telp. 08164836613
Email:
suhartati_agoes@yahoo.com

Universa Medicina 2007; 26: 129-
36.

BACKGROUND
A hidden Markov model (HMM) is used for exon prediction on DNA of genes
Plasmodium falciparum that has a model structure based on exon region structure
in coding sequence (CDS). The objective research was to develop a new structure
model to predict exon on DNA-genes of Plasmodium falciparum based on CDS
structure using the HMM system.

METHODS
Model design in CDS, between two exon regions can be found one intron region
and the model state number is used for its region. Its state number is used by
separating start codon from first exon region and stop codon from the last exon
region up to 9. The Viterbi algorithm and the backward-forward method for
transition as well as emission states are used for training process. Furthermore,
Viterbi and Baum-Welch algorithms are used for the testing process. The
correlation coefficient (CC) was used as performance indicator, as the ratio of
the estimated state in the output and the original state in the input of the model.

RESULTS
The simulation results has shown that the CC values depend on the given of the
backward-forward transition state values randomly. The model with state number
9 showed the highest average of CC values of 0.7289 for Viterbi algorithm, and
is 0.7166 for Baum-Welch algorithm. However, the lowest average of CC values
has been found for the model with state number five. Its values are 0.6735 by
using Viterbi algorithm and 0.6661 by using Baum-Welch algorithm.

CONCLUSION
The new structure model based on HMM system was valid to predict exon on
DNA-genes of Plasmodium falciparum.

Keywords: Exon Prediction, DNA-gene, coding sequence, Hidden Markov Model


130

Agoes, Gunawan, Sardy, et al                                                                                                         Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM

Prediksi ekson DNA-gen Plasmodium falciparum berdasarkan
struktur coding sequence dengan menggunakan

model hidden Markov

Suhartati Agoes*a, Dadang Gunawan**, Sardy S**, dan Hoedojo***

LATAR BELAKANG
Sebuah hidden Markov model (HMM) yang digunakan untuk memprediksi ekson
gen DNA Palsmodium falciparum mempunyai struktur model berdasarkan struktur
gen DNA pada coding sequence (CDS). Penelitian ini bertujuan untuk mengembangkan
sebuah model stuktur baru untuk prediksi ekson gen DNA Plasmodium falciparum
berdasarkan struktur CDS dengan menggunakan sistem HMM.

METODE
Rancangan model pada CDS, di antara dua lokasi ekson dapat diketahui sebuah lokasi
intron dan jumlah state model dapat dilakukan pada lokasi tersebut. Jumlah state
dilakukan dengan memisahkan codon start dari ekson pertama dan codon stop dari
ekson terakhir hingga mencapai 9. Algoritma Viterbi dan metode backward-forward
transisi state serta emisi state digunakan untuk proses training. Sedangkan untuk
proses testing menggunakan algoritma Viterbi dan Baum-Welch. Sebagai kinerja model
digunakan Correlation Coefficient (CC) yang didapat dari perbandingan state estimasi
pada output dan state asli pada input model.

HASIL
Hasil simulasi menunjukkan bahwa nilai CC tergantung pada pemberian nilai acak
state transisi backward-forward. Model dengan jumlah state 9, menunjukkan nilai
CC rata-rata tertinggi adalah 0,7289 untuk algoritma Viterbi dan 0,7166 untuk
algoritma Baum-Welch.

KESIMPULAN
Struktur model berdasarkan sistem HMM sahih untuk memprediksi ekson gen DNA
Plamodium falciparum

Kata kunci : Prediksi ekson, gen-DNA, coding sequence, model hidden Markov

* Jurusan Teknik Elektro
Fakultas Teknologi Industri
Universitas Trisakti
** Jurusan Teknik Elektro,
Fakultas Teknik
Universitas Indonesia
*** Bagian Parasitologi
Fakultas Kedokteran
Universitas Trisakti

Korespondensi
aIr. Suhartati Agoes, MT
Jurusan Teknik Elektro
Fakultas Teknologi Industri
Universitas Trisakti
Jl. Kyai Tapa No.1, Grogol
Jakarta 11440
Telp. 08164836613
Email:
suhartati_agoes@yahoo.com

Universa Medicina 2007; 26: 129-36.

ABSTRAK

INTRODUCTION

I n  c o d i n g  s e q u e n c e  ( C D S ) ,  t h e  e x o n
prediction on DNA-gene is very important to find
the protein structure. Genome P. falciparum

belongs to genome eukaryotic and has a long
DNA genome, consisting of several exons and
introns alternately located. After the splicing
process in DNA sequence, some regions of exon
in CDS will be found to produce the protein.(1)


131

T h e  h i d d e n  M a r k o v  m o d e l  ( H M M )
structure is one of the model used to predict
the exon in desoxyribonucleai acid (DNA)
which is based on the exon region structure in
CDS. The HMM structure of the model: start
codon (Methyanine/ATG), exon, intron and stop
c o d o n  ( TA A ,  o r  T G A ,  o r  TA G )  h a s  b e e n
d e m o n s t r a t e d  b y  R a b i n e r,  H e n d e r s o n  a n d
Krogh.(2-4) This research developed a new model
of structures for exon prediction on DNA of
genes Plasmodium falciparum was based on the
work of Nicorici and Anantharaman.(5,6) The
significant difference of this model is it at least
has two exon regions used for exon prediction
like in CDS. Usually the 5’ boundary of introns
in most eukaryotes contains the dinucleotide
Guanine-Thymine (GT), and the 3’ boundary
contains the dinucleotide Adenine-Guanine
(AG). Furthermore, in intron regions of the
dinucleotide GT were separated to be nucleotide
Guanine (G) and Thymine (T), and dinucleotide
AG to be nucleotide A and G.

METHOD

Research design
The HMM framework in this experiment

used Viterbi algorithm for the training process
and both Viterbi and Baum-Welch algorithms
for the testing process. The same sequence has
been done for the training and testing process.
The program is written in Matlab 7.0, and Bio-
i n f o r m a t i c s ’  t o o l b o x  t o  g e n e r a t e  D N A
sequences in Genbank format and has the
functions of HMM training and testing.

The first experiment for the model based
on CDS has been done with separated start
codon (codon: ATG) from first exon and stop
codon (codon: TAA, or TAG, or TGA) from the
last exon, its model has the state number of 5
(Figure 1). Furthermore, the second experiment
u s e d  s e p a r a t e d  d i n u c l e o t i d e  G T  a n d
dinucleotide AG from the intron region, its

model has the state number of 7 (Figure 2). The
last experiment in the model also used separated
d i n u c l e o t i d e  G T i n t o  G  a n d  T s t a t e s  a n d
dinucleotide AG into A and G states; the state
number of the model is 9 (Figure 3). The
emission values of the models can be performed
a s  t h e  m a t r i x ,  t h e  c o l u m n s  a r e  t h e  b a s e s
numbered on DNA sequences and the rows are
the states number of the models designed. The
state transition values of the models can be
shown also in the figures.

Samples
A genome sequence of P. falciparum using

3D7 clone has 23-megabase nuclear genomes
consisting of 14 chromosomes, encodes about
5,300 genes, and it has the most (A+T)-rich
genome sequenced to date.(8-10) The data set of
experiments for this simulation has 152 DNA
sequences of genes from genome Plasmodium
f a l c i p a r u m  a t :  c h r o m o s o m e  1  ( L o c u s :
N C _ 0 0 4 3 2 5 ) ,  c h r o m o s o m e  2  ( L o c u s :
N C _ 0 0 0 9 1 0 ) ,  c h r o m o s o m e  3  ( L o c u s :
N C _ 0 0 0 5 2 1 ) ,  c h r o m o s o m e  4  ( L o c u s :
N C _ 0 0 4 3 1 8 ) ,  c h r o m o s o m e  5  ( L o c u s :
N C _ 0 0 4 3 2 6 ) ,  a n d  c h r o m o s o m e  9  ( L o c u s :
NC_004330) in Genbank format.(11,12) Genbank
format describes the CDS and original DNA
sequences of genes P. falciparum. In CDS of
genes contains at least two exon regions and
maximum 10 exon regions. The minimum length
sequence of genes is 684 base pairs (bp) and
maximum length is 10095 bp.

HMM provides a good probability method
for discrete sequences model of data like DNA
sequences (alphabet of four letters: A, C, G and
T). Prediction of exon in DNA of genes P.
falciparum is based on exon region in CDS and
it has at least two exon regions. The structure
of the model based on CDS structure as shown
in Figure 4, which separated is start codon from
the first exon and separated stop codon from
the last exon.

Universa Medicina                                                                                                                                                              Vol. 26 No.3


132

Agoes, Gunawan, Sardy, et al                                                                                                         Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM

In the inputs are DNA sequences and set
the state number of the sequences depends on
the structure model. Based on HMM method,
the Viterbi algorithm used for the training and
needs the transition state values and emission
values for the process. The value of transition
state is random and the value of emission can
be found with the distribution number of DNA

nucleotide in each states. The result of HMM
training is the estimated transition state and
emission state values. The estimated transition
and emission values were used for HMM testing
p r o c e s s  w i t h  Vi t e r b i  a n d  B a u m - We l c h
algorithms. The results of HMM testing process
are the estimated states of the model.(2) The
parameter of performance indicator used the

Figure 3. The HMM structure with 9 states

Exon Intron ExonStart Stop 

Figure 4. The HMM structure based on CDS

  9

Start E G T I A G

1    2 3 4 5 6 7

E Stop

  8

1

1

0.9

  0.1

0.9 0.9

  0.1    0.1

0.9   0.9

   0.1 0.1

0.9

   0.1

0.8

   0.1

0.1

5

0 .9

1

E IS ta rt S to p

2 3 4

1

0 .5

0 .5 0 .1

      0 .8 1

0 .1 9

 1

1 0 -6E

Figure 1. The HMM structure with 5 states

S tart E G T I A G E Stop

1    2 3 4 5 6 7

1

10.85

0.1

0.05

0.7

0.3

0.9

0.1

0.7

    
0.9

 0.1 0.3

Figure 2. The HMM structure with 7 states


133

correlation coefficient (CC), as a ratio of the
estimated states from the output and the original
states from the input of the model(1,7) and the
formula is the following:

Where: TP= True Positive; FP= False Positive;
TN= True Negative; FN= False Negative

Statistical analysis

The emission matrix of the model with state
number 5 is:
e = [ 0.3333          0        0.3333     0.3333 ;

0.4342     0.1094     0.1507     0.3056 ;
0.4210     0.0599     0.0703     0.4487 ;
0.4215     0.1173     0.1622     0.2990 ;
0.5592          0        0.1075     0.3333 ]

The emission matrix of the model with state
number 7 is:
e = [ 0.3333         0         0.3333     0.3333 ;

0.4342     0.1094     0.1507     0.3056 ;
           0             0         0.5000     0.5000 ;

0.4210     0.0599     0.0703     0.4487 ;
       0.5000          0         0.5000          0   ;

0.4215     0.1173     0.1622     0.2990 ;
0.5592          0         0.1075     0.3333 ]

The emission matrix of the model with state
number 9 is:
e = [ 0.3333         0         0.3333     0.3333 ;

0.4342     0.1094     0.1507     0.3056 ;
           0             0         1.0000          0    ;
           0             0              0         1.0000 ;

0.4210     0.0599     0.0703     0.4487  ;
       1.0000          0              0              0  ;
           0             0         1.0000          0    ;

0.4215     0.1173     0.1622     0.2990 ;
0.5592          0         0.1075     0.3333 ]

FN)FP).(TNFP).(TPFN).(TN(TP
(FP.FN)-(TP.TN)

CC
++++

=

The transition state values are randomly,
but the first state on the model has the minimum
transition state value of 0 and the last state has
the maximum transition state value of 1. On the
other hand, the emission state values were found
from the distribution number of DNA nucleotide
i n  e a c h  s t a t e  d e p e n d i n g  o n  t h e  m o d e l s .
Calculation of the CC is by using the above
equation with the assumptions that the exon is
positive and intron is negative.

RESULTS

Based on exon regions structure in CDS,
the simulation results for the model with state
number 5 above and transition state values
randomly has found the average of CC value in
Table 1. Table 2 and Table 3 shows the average
of CC values of the model with state number 7
and state number 9. The transition state value
in the first state of all structures on the models
is 0 and the last state is 1. The other transition
state values except in Table 2 are 0.1 and the
h i g h e s t  a v e r a g e  o f  C C  v a l u e s  f o r  t h i s
experiments are shown in Figure 5.

The model based on exon regions structure
in CDS for state number nine resulted in the
highest average of CC. Its values are 0.7289 by
using Viterbi algorithm and 0.7166 by using
Baum-Welch algorithm. However, the lowest
average of CC values has been found for the
model with state number five. Its values are
0.6735 by using Viterbi algorithm and 0.6661
by using Baum-Welch algorithm.

DISCUSSION

All these were based on the model of exon
regions structure in CDS, at the state number
nine resulted in the highest average of CC values.
It is also shown that the emission state value of
t h e  m o d e l s  h a v e  t h e  ( A + T ) - r i c h  g e n o m e
sequences of genes Plasmodium falciparum.

Universa Medicina                                                                                                                                                              Vol. 26 No.3


134

Agoes, Gunawan, Sardy, et al                                                                                                         Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM

Transition state values Corr.Coef.(CC) 
Exps 

tr11 tr22 tr33 tr44 tr55 Viterbi B-Welch 
1 0 0.5 0.9 0.80 1 0.7054 0.6750 
2 0 0.5 0.9 0.81 1 0.7061 0.6759 
3 0 0.5 0.9 0.82 1 0.7054 0.6750 
4 0 0.5 0.9 0.83 1 0.7054 0.6750 
5 0 0.5 0.9 0.84 1 0.7045 0.6741 
6 0 0.5 0.9 0.85 1 0.7054 0.6750 
7 0 0.5 0.9 0.86 1 0.7062 0.6744 
8 0 0.5 0.9 0.87 1 0.0012 0.0015 
9 0 0.5 0.9 0.88 1 0.0011 0.0013 
10 0 0.5 0.9 0.89 1 0.0011 0.0013 

 
Table 1. The average of CC values for the model with state number 5

Transition state values Corr.Coef.(CC) 
Exps 

tr11 tr22 tr33 tr44 tr55 tr66 tr77 Viterbi B-Welch 
1 0 0.4 0.2 0.8 0.10 0.7 1 0.6713 0.6643 
2 0 0.4 0.2 0.8 0.20 0.7 1 0.6713 0.6643 
3 0 0.4 0.2 0.8 0.30 0.7 1 0.6733 0.6660 
4 0 0.4 0.2 0.8 0.40 0.7 1 0.6731 0.6660 
5 0 0.4 0.2 0.8 0.50 0.7 1 0.6735 0.6661 
6 0 0.4 0.2 0.8 0.60 0.7 1 -0.0642 -0.0729 
7 0 0.4 0.2 0.8 0.59 0.7 1 -0.0568 -0.0639 
8 0 0.4 0.2 0.8 0.58 0.7 1 0.6735 0.6661 
9 0 0.4 0.2 0.8 0.56 0.7 1 0.6735 0.6661 

10 0 0.4 0.2 0.8 0.55 0.7 1 0.6735 0.6661 
 

Table 2. The average of CC values for the model with state number 7

Transition state values Corr.Coef.(CC) 
Exps 

tr11 tr22 tr33 tr44 tr55 tr66 tr77 tr88 tr99 Viterbi B-Welch 
1 0 0.3 0.2 0.2 0.80 0.10 0.10 0.70 1 0.7242 0.7124 
2 0 0.3 0.1 0.1 0.79 0.10 0.10 0.70 1 0.7242 0.7124 
3 0 0.3 0.1 0.1 0.78 0.09 0.09 0.70 1 0.7242 0.7124 
4 0 0.3 0.1 0.1 0.78 0.20 0.20 0.70 1 0.7258 0.7120 
5 0 0.3 0.1 0.1 0.80 0.20 0.20 0.80 1 0.7258 0.7120 
6 0 0.3 0.1 0.1 0.80 0.15 0.15 0.74 1 0.7258 0.7120 
7 0 0.3 0.1 0.1 0.80 0.15 0.15 0.80 1 0.7258 0.7120 
8 0 0.3 0.1 0.1 0.80 0.15 0.15 0.70 1 0.7258 0.7120 
9 0 0.8 0.2 0.2 0.80 0.10 0.10 0.70 1 0.6787 0.6755 
10 0 0.9 0.2 0.2 0.80 0.10 0.10 0.70 1 0.0451 0.0437 

 
Table 3. The average of CC values for the model with state number 9


135

This can also be showed in their matrix
values described above. In order to match with
Markov’s chain in eukaryotic gene structure for
c o n n e c t i n g  t h e  e x o n  a n d  i n t r o n ,  t h e n  t h e
backward-forward method are used in this HMM
system. The experiments were done by Viterbi
algorithm for the training process and the both
Viterbi and Baum-Welch algorithms for the
testing process.

A set of 152 sequences is then used as data
to train the model and a different HMM is
produced for each set of sequences. HMM have
been previously used in sequence analysis to
produce an HMM that represents a sequence
profile (a profile HMM), to analyze sequence
composition and patterns, to locate genes by
p r e d i c t i n g  o p e n  r e a d i n g  f r a m e  o r  c o d i n g
sequence, and to produce protein structure
predictions. This HMM generates sequences

0.62

0.64

0.66

0.68

0.7

0.72

0.74

C
orrelation C

oefficient (C
C

)

T he highest average of CC values

Viterbi 0.7061 0.6735 0.7258

Ba um-We lch 0.6759 0.6661 0.712

5 7 9

w i t h  v a r i o u s  c o m b i n a t i o n s  o f  m a t c h e d ,
mismatched, insertions and deletions, and gives
these a probability, depending on the values of
the various parameters in the model.

A total of 424 predicted protein-coding
g e n e s  a n d  3  t R N A  g e n e s  o f  P l a s m o d i u m
falciparum have been identified on chromosomes
2 and 3, the two chromosomes for which
s e q u e n c i n g  h a v e  b e e n  c o m p l e t e d . ( 1 3 , 1 4 )

Approximately 37% (158 genes) of three genes
have a readily identifiable homologue in another
species. Such similarity allows a function to be
implied, with many of those identified being
involved in parasite metabolism. Interestingly,
a comparison of the predicted protein sequences
with their homologues in other species showed
that the majority of Plasmodium falciparum
proteins have insertion of low complexity
sequences, often runs of a single amino acid

Figure 5. The highest average of CC values of the experiments

Universa Medicina                                                                                                                                                              Vol. 26 No.3


136

Agoes, Gunawan, Sardy, et al                                                                                                         Exon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMMExon prediction by using HMM

r e s i d u e  ( t y p i c a l l y  a s p a r a g i n e s ,  l y s i n e  o r
glutamine acid), or tandem arrays of a short
peptide repeat sequence. Examples of such
regions have been identified previously, and are
polymorphic between different parasite isolates.
Perhaps it will open up new avenues for drug
and vaccine development. Perhaps it will make
a genetically modified mosquito that cannot act
as vector a reality once processes are in place
to deal with ethical, legal and social issues. But
any new developments are unlikely to happen
before the 2010 deadline set for Roll Back
Malaria. New drugs based on knowledge about
the genome will probably not be affordable to
those countries that need it the most.(15)

CONCLUSION

This research with a new structure model
was using for exon prediction on DNA-genes of
Plasmodium falciparum based on CDS structure
by using the HMM system.

ACKNOWLEDGMENT

T h e  a u t h o r s  w o u l d  l i k e  t o  t h a n k  d r.
Herawati Sudoyo, PhD, dr. Helena MS and
Hidayat Trimarsanto B.Sc from the Eijkman
Institute, for their valuable suggestions.

References

1. Samatova N.F. Computational gene finding using
HMMs. UT-Battelle Information Center, 1201 Oak
Ridge Turnpike, Suite 100, Oak Ridge, TN 37830.
Institute Oak Ridge National Laboratory; 2003.

2. Lawrence R. A tutorial on hidden Markov models
and selected applications in speech recognition.
Proceeding of The IEEE. 1989; 77: 257-68.

3. Henderson J, Salzberg S, Fasman K.H. Finding
genes in DNA with a hidden Markov model. J
Comput Biol 1997; 4: 127-42.

4. Krogh A. An introduction to hidden Markov
models for biological sequences. In
Computational methods in molecular biology.
Salzberg SL, Searls DB, Kasif S. editors.
Denmark. Center for Biological Sequence
Analysis, Technical University of Denmark. 1998.
p. 45-63.

5. Nicorici D, Astola J, Tobus I. Computational
identification of exons in DNA with a hidden
Markov model. Tampere International Center for
Signal Processing, Tampere University of
Technology, Finland. Available at: http://
www.gensips.gatech.edu/processings/contributed/
CP2-06.pdf. Accessed May 12, 2005.

6. Anantharaman T. Finding genes in genomic DNA
The GENESCAN System. Available at : http://
www.biostat.wisc.edu/bmi776. Accessed June 6,
2005.

7. Vaisman I. Bioinformatics and gene discovery.
Bioinformatics Tutorial. North Carolina, United
State: University of North Carolina at Chapel Hill.
1998.

8. Hall N, Gardner M.J, Hyman RW, Lasonder E,
Wilson RJM, Scherf A, et al. Sequence of
plasmodium falciparum chromosomes 1, 3-9 and
13. Nature 2002; 419: 527-31.

9. Gardner M.J, Hall N, Hyman RW, Hinterberg K,
Mattei D, Wellem TE, et al. Sequence of
plasmodium falciparum chromosomes 2, 10, 11
and 14. Nature 2002; 419: 531-4.

10. Hyman RW, Gardner M.J, Hall N, De Bruin D,
Scherf A, Day KP, et al. Sequence of plasmodium
falciparum chromosome 12. Nature 2002; 419:
534-7.

11. Anastassiou D. Genomic signal processing. IEEE
Signal Processing Magazine 2001; 18: 4.

12. Alphey L. DNA sequencing. Manchester UK : Bios
Scientific Publishers Limited; 1997.

13. Gardner M.J, Hall N, Fung E, White O, Berriman
M, Hyman RW, et al. Chromosome 2 sequence of
the human malaria parasite Plasmodium
falciparum. Science 1998; 282: 1126-32.

14. Bowman S, Lawson D, Basham D, Brown D,
Chillingworth T, Churcher C. M, et al. The
complete nucleotide sequence of chromosome 3 of
Plasmodium falciparum. Nature 1999; 400: 532-
8.

15. Anonymous. Malaria after the genomes. The
Lancet 2002; 360: 1107.