Microsoft Word - BIOL090315-f - edited_checked_final _2_.doc


 SQU  Journal For  Science, 15(2010) 1-8  
© 2010 Sultan Qaboos University 

1 

An Approach to Analyze the Ambiguity 
in RNA Structure   

Shailendra Singh and *Amardeep Singh   

*Department of Computer Science and Engineering, PEC University of Technology, 
Chandigarh- 160012, India, Email: shailendra_sing@yahoo.com, Department of Computer 
Engineering, University College of Engineering (UCoE), Punjabi University, Patiala, India, 
Email: amardeep-dhiman@yahoo.com. 

    RNAطريقة لتحليل عدم الوضوح في شكل ال     

  سيالندرا سنج و أمارديب سنج

 المرتبة الثانية من حيث RNA هما من أهم الجزيئات في الخلية اإلنسانية حيث تحتل ال RNA  وال DNAان  :خالصة
العديد من  هنالك . وتكوين البروتينDNAاهمية الحامض النووي في خلية األنسان وهي تمثل عامل الوصل مابين ال 

 ثم استعمال المعلومات الجينيةتخزين ومن إن .  المتواجدة في مختلف الخاليا ولكل منها مهمة محددةRNAأنواع ال 
تلعب دوراً أساسياً في الخاليا لذ فإن  RNAإن ال .  RNAبواسطة الخلية يعتمد أساساً على المظهر الخارجي وشكل ال 

هنالك عدة طرق . نوع سلسلة النيوكلوتايداتلها لمعرفة العالقة السائدة لكل أو لبعض غالتيمكن اس RNAفهم تركيبة ال 
 والمهام التي RNA والتي يمكن أن تؤدي لمعلومات قد تساعد لمعرفة عمل ال  RNAوأنواع من التحاليل لشكل ال 

ع متعددة من األشكال الثانوية ل في هذه الورقة تمت دراسة ومناقشة أنوا. تؤديها وذلك بدرجات متفاوتة من التفاصيل
RNA وقد تم إختيار أشكال محددة لدراسة االحتماالت المختلفة والتي تبين وضوح الرؤية لشكل ال .RNA  

 
ABSTRACT: DNA and RNA are two very important bio-molecules of the human cell. RNA is 
the second major form of nucleic acid in human cells that plays an intermediary role between 
DNA and functional protein. Several classes of RNA’s are found in cells, each with a / its 
distinct function. Understanding of storage and utilization of a cell’s genetic information is 
based on the structure of RNA.  However,  many experimental results have shown that RNA 
plays another greater role in the cells. RNA sequences which contain signals at the structure 
level can be exploited to detect functional motifs common to all, or a portion of, those 
sequences. Different types of analysis of a structure can provide functional information in 
different degrees of detail. This paper discusses various types of RNA secondary structure 
representation and which structure can be adopted as appropriate for a probabilistic approach 
that avoids ambiguity.  
 
KEYWORDS:  Secondary structure, Stochastic context-free grammar (SCFG), Derivation tree. 

 
SHAILENDRA SINGH and AMARDEEP SINGH  

 2

1.   Introduction 

NA is a biological polymer consisting of monomers called nucleotides. Each nucleotide consists of a 
(ribose) sugar, a phosphate group and a base. There are four  main types of bases: Adenine (A), Cytosine 

(C), Guanine (G), and Uracile (U). The base-paired structure formed by the Watson-Crick base-pairs A-U and C-
G and the wobbling base-pair G-U can be divided into loops, also known as ‘structure elements’. A loop is a 
formation of a base-pair that encloses a chain of nucleotides or other base-pairs. RNA primary structure is 
commonly represented by a string, S, over the alphabet ∑ = {A, G, C, U}. RNA is mostly involved in the 
biological machinery that expresses the genetic information from DNA to RNA. Information is encoded in RNA 
by the linear arrangement of the four different constituent nucleotides. RNA molecules perform a number of 
critical functions. Many of these functions are related to protein synthesis. Some RNA molecules bring genetic 
information from a cell’s chromosomes to its ribosome's, where proteins are assembled. 

The RNA plays a very important role in bio cells. Determining RNA shapes has gained considerable 
importance in the last decade because it is essential for researchers to know the shape of a molecule, in order to 
understand its role within a cell.  A lot of work has been done in the structural analysis of RNA in the 
bioinformatics field, but there exist a large number of challenging problems like the analysis of ambiguities in 
RNA structure, prediction of structure, and predictions of functions performed (Hiroshi, et al. 2005 and, Jizhen, 
et al. 2006).   The structure of an RNA molecule is closely related to its function (Yinglei, S. et al. 2004). For 
this reason, predicting the secondary structure of an RNA molecule based on its primary sequence has been of 
interest to many researchers.  Since RNA structure is essentially governed by base pairing of nucleotides, many 
computational methods and algorithms have been proposed for finding the “optimal base pairing” of RNA in an 
efficient manner (Keum, Y. S. 2006, Mount, D. W. 2004, and Rafael, G. 2006).  

For the computer science community, the primary structure of bio-molecules is just a very long string of 
commands forming long programs written in any computer programming language. This long program in the 
form of a long string is to be processed by compilers, translators etc. using regular expressions, grammars, and 
similar other techniques. According to Noam Chomsky, the Context Free Grammar (CFG) has great importance 
in the Linguistic field, Computer Science, Engineering and Bioinformatics. It is a more powerful class of formal 
grammars than the regular grammar. CFGs are often used to define the syntax of programming languages 
(Byung-Jun, Y. and Vaidyanathan, P.P. 2007). A CFG, also called a Type 2 Grammar, is similar to a regular 
grammar, but it permits a greater variety of production rules. The other methods used for the analysis of RNA 
structure are the free energy based model and conditional log-linear models (CLLMs).  CLLMs are a 
generalization of grammar based models (Dowell, R.D. and Eddy S. R. 2004).  According to the evaluation done 
by its authors, it has accuracies that are better than those of the current probabilistic and physics based models. 
One purpose of this paper is to present an effective method for analyzing the ambiguity in RNA structure and 
estimating a stochastic context-free grammar to model a family of RNA sequences (Yuki, K., Hiroyuki, S. and 
Tadao, K. 2003). 

2.   Analysis of RNA structure 

The importance of grammars in compliers is well-known. The grammars are useful tools to model 
character sequences and, in a certain way, these tools are useful to model molecular biological sequences (Yan, 
D. and Yulei, Z. 2005). Many bioinformatics problems can be reformulated in terms of formal languages, 
producing the corresponding grammar from the available data. Among several utilities contributed by grammars, 
the main contribution is the ability to test by derivations if a sequence is syntactically correct, i.e. if it belongs to 
a determined language. A derivation can be represented as a tree-like structure known as a ‘derivation tree’. This 
tree reflects the syntactical structure of a sequence. It is possible that for a given sequence there may be more 
than one derivation tree. In this case, we say that the grammar is ambiguous. In ambiguous grammar, complexity 
of the derivation rises given that the number of possible trees grows exponentially with the length of the 

R 


AN APPROACH TO ANALYZE THE AMBIGUITY IN RNA STRUCTURE  

 3

sequence to be derived. Stochastic syntactic analysis algorithms for the class of stochastic context free grammars 
(SCFG) have been proposed and their application has been demonstrated in pattern classification problems. 

3.   Context free grammar for RNA 

Type-2 grammars, or CFGs, are used to identify the secondary structure of RNA molecules from the given 
nucleotide sequence when we consider an RNA sequence as a string (or a valid sentence) of a programming 
language. The grammar is a major tool for a parser to build a parse tree to check if the given string is a valid 
sentence. The whole leaves of a parse tree constitute a sentence of the language defined by the grammar. As the 
name ‘context-free grammar’ implies, the non-terminals on the left-hand side of a production rule do not 
consider the context in which it is situated.  

For example, one of the applications of productions as shown in Figure 1 can generate the RNA sequence 
“AGCGUCAGUGACUUGAUGCU” by the following derivation, and the equivalent derivation tree is shown 
in Figure 7. 
3.1   Productions 

P = {  S0→ S1,            S7 → AS8U 
S1→ AS2U,  S8 →GS9U 
S2 → GS3C, S9 → US10 
S3→ CS4G, S10 → GS11 
S4 →GS5U, S11 → AS12 
S5→ US6A        S12 → C     } 

S6 → CS7G 
 

Figure 1. Set of production rules ‘P’ 
    
Figure 1 shows a set of production rules P that generates an RNA sequence for a certain restricted structure, in 
which S0,S1……S12 are non-terminals. A, G, C and U are terminals. Beginning with the start symbol S0, any 
production with S0 left of the arrow can be chosen to replace S0.  If the production S0 → S1 is selected, then the 
symbol S1 replaces S0. This derivation step is written as S0 → S1, where the arrow signifies application of a 
production. Next, if the production S1→ A S2 U is selected, the derivation step is S1→A S2 U. Continuing with 
the same procedure of replacing left-hand side with the right-hand side of an appropriate production, we obtain 
the following derivation terminating with the desired sequence: 

3.2  Derivation 

S0→S1 

→AS2U        (S1→ AS2U)  

→AGS3CU    (S2 → GS3C) 

→AGCS4GCU    (S3→ CS4G) 

→AGCGS5UGCU    (S4 →GS5U) 

→AGCGUS6AUGCU    (S5→ US6A) 

 → AGCGUCS7GAUGCU   (S6 → CS7G)   


SHAILENDRA SINGH and AMARDEEP SINGH  

 4

→AGCGUCAS8UGAUGCU    (S7 → AS8U) 

→AGCGUCAGS9UUGAUGCU    (S8 →GS9U) 

→ AGCGUCAGUS10UUGAUGCU    (S9 → US10) 

→ AGCGUCAGUGS11UUGAUGCU    (S10 → GS11) 

→ AGCGUCAGUGAS12UUGAUGCU    (S11 → AS12) 

→ AGCGUCAGUGACUUGAUGC          (S12 → C) 

4.  Different secondary structure for RNA 

RNA secondary structures can be displayed in different kinds of representations. Depending on the use of 
the RNA molecules, specific representations are more or less useful. The bracket notation as shown in Figure 2 is 
a text-based representation. The structure has been reflected in a string of dots and brackets. Dots denote non-
bonding bases and a pair of brackets indicates a base-pair. A more convenient representation, which expands in 
all directions in a plane and thus is closer to a spatial representation, is the squiggle plot as shown in Figure 3. It 
is the most appropriate plot to easily describe the approximate spatial structure of RNA. Base-pairs are given as 
two bases connected through either a straight line (Watson-Crick base-pairs) or a circle indicating the so-called 
wobbling base-pair, G-U.  

Considering RNA in a more theoretical way, the representations as trees or as arc-annotated sequences are 
well-accepted. In recent years, tree-representations of RNA secondary structures have occurred in the literature, 
and algorithmic applications on trees are performed successfully. Arc-annotated sequences focus on representing 
sequences as straight lines. Arcs indicate base-pairings. This kind of representation is used in this paper mainly 
due to its beneficial representation of single base and base-pair operations. A similar representation to the arc-
annotated sequence is the drawing of this sequence on a circle as shown in Figure 5. Arcs are plotted as curved 
lines inside this circle. The mountain plot as shown in Figure 6 is useful for large RNAs. Plateaus represent 
unpaired regions, and the heights of these mountains are determined by the number of base-pairs in which the 
partial sequences are embedded. Figure 7 shows a derivation tree for a given sequence and Figure 8 shows the 
appropriate way representation of a sequence. 

4.1 Dot-bracket representation 

 
Figure 2. Dot-bracket representation 

 
AN APPROACH TO ANALYZE THE AMBIGUITY IN RNA STRUCTURE  

 5

4.2  Squiggle plot 

 
Figure 3.  Squiggle plot 

4.3   Arc-annotated sequence  

 
Figure 4. Arc-annotated sequence 

4.4  Circle representation 

 
Figure 5. Circle representation 

4.5   Mountain plot representation 

 
Figure 6.  Mountain plot representation 


SHAILENDRA SINGH and AMARDEEP SINGH  

 6

4.6  Derivation Tree Representation 

 
Figure 7. Derivation tree representation 

4.6   Most Appropriate Way Representation 

 
Figure 8. Most appropriate way representation 
 

AN APPROACH TO ANALYZE THE AMBIGUITY IN RNA STRUCTURE  

 7

5.    Assignment of probability on productions 

A SCFG extends the definition of context free grammars by associating a probability to every production 
in the grammar. Consequently every string that the grammar can generate is assigned a probability which is 
equal to the product of the probabilities of the productions used in the string’s derivation. The probability of a 
parse tree can be calculated as a product of the probabilities of the production instances in the tree. There are 
various methods used to determine such probabilities, and using one such method, the assignment of 
probabilities is as shown in Table 1. To derive the trained grammar, the initial grammar was designed by using 
some prior knowledge about the RNA family. 
 
Table 1. Probabilities for the Type 2 grammar, with uniform distribution placed over each set of the same type of 
production. 
 

Category of 
Productions 

Productions Probabilities 

C#1 S0→S1 0.3000 
C#2 S1→AS2U 0.0250 
C#2 S2 →GS3C 0.0250 
C#2 S3→CS4G 0.0250 
C#2 S4 →G S5U 0.0250 
C#2 S5 →US6A 0.0250 
C#2 S6 → CS7G 0.0250 
C#2 S7 →AS8 U 0.0250 
C#2 S8 →GS9U 0.0250 
C#3 S9   → US10 0.0666 
C#3 S10 →GS11 0.0666 
C#3 S11 → AS12 0.0666 
C#4 S12 →  C 0.3000 

 
6.   Conclusion 

A detailed understanding of the functions and interactions of RNA requires knowledge of their structures. 
For many RNA molecules, the secondary structure is highly important to the correct function of the RNA, often 
more so than the actual sequence. One of the problems with CFGs is that they generally have an ambiguity in the 
grammar that results in more than one parse tree for a sequence, and alternative parse trees reflect alternative 
secondary structures.   Thus a grammar often gives several possible secondary structures for one RNA sequence. 
The SCFG is used to overcome the problem of ambiguity. One of the advantages of a SCFG is that it can provide 
the most likely parse tree.  If the grammar and their probabilities are carefully designed, the correct secondary 
structure will appear as the most likely parse tree among the alternatives. The grammar itself may be a valuable 
tool for representing a RNA family or domain. For (long-chain) RNA there are exponentially many possible 
structures which may be assigned to RNA, but assigning the correct one can only be done on the basis of a 
probability distribution. However, the most challenging future problem is to model a family of longer RNA 
sequences, and also the variations of RNAs like mRNA, tRNA, and siRNA. 
 

SHAILENDRA SINGH and AMARDEEP SINGH  

 8

6.  References  

BYUNG-JUN, Y. and VAIDYANATHAN, P.P. 2007. Computational identification and analysis of noncoding 
RNAs. IEEE Signal Processing Magazine, 24(1): 64-74. 

DOWELL, R.D. and EDDY S.R. 2004. Evaluation of Several Lightweight Stochastic Context Free Grammars 
for RNA Secondary Structure Prediction. BMC Bioinformatics, 5: 71-78. 

HIROSHI, M., KENGO, S. and YASUBUMI S. 2005. Pair Stochastic Tree Adjoining Grammars in Yan, D. and 
Yulei, Z. 2005Aligning and Predicting Pseudoknot RNA Structures. Journal of Bioinformatics, 21: 2611-
2617. 

HOPCROFT, J.E. and ULLMAN, J.D. 1979. Introduction to Automata Theory, Languages and Computation. 
Addison Wesley. 

JIZHEN, Z., LIMING, C. and RUSSELL L.M. 2006. Learning the Parameters of Stochastic Grammar Models 
for RNA Structures with Pseudoknots. IEEE International Conference on Granular Computing 2006, 
Atlanta, May 10-12: 170-175.  

KEUM, Y.S. 2006. Recognition and Modeling of RNA Pseudoknots Using Context-Sensitive Pattern Matching.  
International Conference on Hybrid Information Technology (ICHIT-2006), Korea, November 09-11, 
2006, 1: 660-665. 

MOUNT, D.W. 2004. Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, 
Cold Spring Harbor, New York USA. 

RAFAEL, G. 2006. Prediction of RNA Pseudoknotted Secondary Structure using Stochastic Context Free 
Grammars.  CLEI Electronic Journal, 9(2): 221-228. 

ROBIN, D.D. and SEAN, R.E. 2006. Efficient Pairwise RNA Structure Prediction and Alignment Using 
Sequence Alignment Constraints. Journal of BMC Bioinformatics, 7: 400-437. 

SAAD, M. 2007. On the Approximation of Optimal Structures for RNA-RNA Interaction. IEEE Transactions on 
Computational Biology and Bioinformatics, 6(4): 682-688. 

SEARLS, D.B. 2002. The Language of Genes. Nature, 420: 211-217. 
WOODS, D.A. and BATZOGLOU, S. 2006.  CONTRAfold: RNA Secondary Structure Prediction Without 

Physics-Based models. Bioinformatics 22(14): 90-98. 
YAN, D. and YULEI, Z. 2005. Statistical Parser For RNA Secondary Structure Prediction. Proceedings of the 

Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005. 
YINGLEI, S. 2004. Tree Decomposition Based Fast Search of RNA Structures Including Pseudoknots in 

Genomes. Proceedings of IEEE Computational Systems Bioinformatics Conference, Stanford, August 16-
19, 2004. 

YUKI, K., HIROYUKI, S. and TADAO, K. 2003. A Comparative Study on Formal Grammars for Pseudoknots, 
Proceedings of Genome Informatics, 14: 470-471. 

 
Received   15 March 2009               
Accepted   17 May 2010