Introduction



Next: Applications Up: Using Reliability Information to Previous: Abstract
Introduction
The development of sub-optimal algorithms for RNA secondary structure prediction [Williams & Tinoco, 1986,Zuker, 1989b,Zuker, 1989a] helped to mitigate the uncertainty of predicting a secondary structure of a single RNA sequence from thermodynamic data. The mfold algorithm [Zuker, 1989a,Zuker, 1994] predicts suboptimal foldings as well as an ``energy dot plot'', which is a dot plot showing all possible base pairs that can participate in foldings within a specified increment of the predicted minimum folding energy. The collection of sub-optimal folding predictions and, in the case of the Zuker algorithm, the energy dot plot, combine to give the user an idea of how well determined a given prediction is.
A number of heuristic descriptors have been developed in conjunction with the mfold package that describe the propensity of individual bases to participate in base pairs and whether or not a predicted helix is ``well determined''. These descriptors are P-num , S-num and H-num. The first 2 were introduced early [Jaeger et al., 1989,Jaeger et al., 1990], and are computed for individual bases.
P-num is defined from the energy dot plot, and therefore depends on
an (arbitrary) energy increment and whether or not the dot plot has been filtered to
eliminate isolated base pairs or short helices. For the base in a molecule with n bases,
is the total number of dots in the
row and column of the dot plot. In simple words,
is the total number of different base pairs that can be formed
using the
base in all foldings within the
prescribed energy increment. If
is large,
and this is a relative term, then the
base is promiscuous in its association with other bases. We say that it is ``poorly
determined''. In an ensemble of foldings, it will be single stranded or paired with
many different bases. In a particular folding, we cannot say with any certainty how
this base will pair. If
is 0, then the
base must be single stranded.
Otherwise, P-num gives no information of the propensity to be single stranded. This is
furnished by S-num, defined next.
S-num is defined from a collection of optimal and suboptimal
foldings, and is thus independent from any dot plot computations. In a group of
m foldings, is the number of
foldings in which base i is single-stranded, divided by m. Thus
is a sample probability that the
base is single-stranded. A value of S-num
that is close to 0 or 1 is ``good'' in the sense that it tells us with a
high degree of confidence whether the base is paired, or not paired, respectively. We
say that a base is ``well-determined'' if S-num is near 1 (almost certainly
single stranded) or if S-num is near 0 (almost certainly base paired) and P-num
is low.
At a later date [Zuker & Jacobson,
1995], we introduced the notion of H-num, which is an extension of P-num to
helices. For a singe base pair, , we
define
to be
. This is simply the total number of base pairs that can be formed
using the
or
bases, in all foldings within the chosen energy increment. For a
helix, H-num is the average of these values for all the base pairs in the helix.
Helices with relatively low H-num values are said to be ``well-determined'' and those
with relatively high values are said to be ``poorly determined''. We used the H-num
measure to demonstrate that ``well-determined'' helices in optimal foldings are more
likely to be correct than ``poorly determined'' ones [Zuker & Jacobson, 1995]. It is worth adding here that a
``well-determined'' helix does not have to be in an optimal folding.
The recursive computation of rigorous partition functions for the RNA
secondary structure model [McCaskill, 1990] lead to
the development of rigorous statistics to describe uncertainties in RNA folding
predictions. The original work computes base pair probabilities and, as a direct
consequence, probabilities that any base will be single or double stranded. Base pair
probabilities are plotted in what is called a ``boxplot''. This is similar to the
mfold energy dot plot, except that base pairs are plotted as black squares whose
areas are proportional to the probability of that base pair. There is a probability
cutoff, usually , below which base pairs
are not plotted. These ideas have been taken up and expanded on by a theoretical
chemistry group at the University of Vienna. The resulting software has become to be
known as the ``Vienna (RNA) package'' [Hofacker et al., 1994].
Although an experienced user of the mfold package can extract information relatively easily from the dot plot superposition of optimal and sub-optimal foldings, we find that a color annotation of individual foldings simplifies the interpretation of the results that are obtained from these plots. The major innovation described in this work is the annotation of foldings with ``well-definedness'' information. The latter can be the P-num, S-num or H-num measures computed from the current mfold package, or base pair probabilities and other measures as computed by the Vienna package.
Our annotation method was developed to annotate predicted secondary structures. However, structural models that have been created from comparative sequence analysis can also be annotated.
In addition to color we also use another form of annotation to show how closely two structural models are related to one another. The short line segments that denote base pairs in a structure plot can be thickened to denote base pairs that are conserved in a reference folding. This feature can be used to visualize conformational differences between wild-type and mutant genomes, between predicted alternative foldings, and between the phylogenetic and predicted models of an RNA molecules. The application of these approaches to the analysis of several molecules of RNase P RNA and 16S rRNA are given below.
Next: Applications Up: Using Reliability Information to
Previous: Abstract
![]() |
Michael Zuker Institute for Biomedical Computing Washington University in St. Louis August 21 1998. |