PhD Thesis Defense - Archive

Probabilistic Computational Methods for Structural Alignment of RNA Sequences

Arif Harmanci

Prof. Guarav Sharma/Prof. David Mathews

Wednesday, April 21, 2010
1 p.m.

Goergen 110

Abstract

In this thesis, the problem of structural alignment of homologous RNA sequences is addressed. The structural alignment of a given set of RNA sequences is a secondary structure for each sequence, which are common to each other, and a sequence alignment between the sequences that is conforming with the secondary structures. A solution to this problem was proposed by Sankoff as a dynamic programming algorithm, whose time and memory complexities are polynomial in the length of shortest sequence and exponential in the number of input sequences, respectively. Variants of Sankoff's method employ constraints that reduce the computation by restricting the allowed alignments or structures. In the first stage, a new methodology is presented for the purpose of establishing alignment constraints based on nucleotide alignment and insertion posterior probabilities. Using a hidden Markov model, posterior probabilities of alignment and insertion are computed and these probabilities are additively combined to obtain probabilities of co-incidence. The constraints on alignments are computed by adaptively thresholding these probabilities to determine high probability co-incidence events. The proposed constraints are implemented into Dynalign, a free energy minimization algorithm for structural alignment. The probabilistic constraints reduces computation time by a factor of 2 with marginal increase in base pair prediction accuracy. Next, a novel algorithm for structural alignment of two RNA sequences is presented. The joint consideration of common secondary structures and alignment is accomplished by structural alignment over a search space defined by the newly introduced motif called matched helical regions. The matched helical region formulation generalizes previously employed constraints by other methods for structural alignment and thereby better accommodates the structural variability within RNA families. A probabilistic model based on pseudo free energies obtained from precomputed base pairing and alignment probabilities is utilized for scoring structural alignments. Three different approaches are addressed for structural alignment prediction as inferences from the distribution of structural alignments: 1. Maximum a posteriori (MAP) structural alignment, the structural alignment with highest probability in the space, 2. Computation of base pairing probabilities of nucleotides in each sequence as determined by an efficient dynamic programming algorithm, 3. The sampling of structural alignment space for analysis of modes of distribution for structural alignments. Finally, the problem for structure prediction for an arbitrary number of homologous sequences is addressed. To circumvent the computational complexity, the prediction is broken down into prediction of structures of individual sequences by utilizing a low complexity soft-decision folding algorithm for each sequence. For each sequence, the folding algorithm is fed with extrinsic information for base pairing for the corresponding sequence. The extrinsic information serves the purpose of integrating the base pairing information in all other sequences into the folding of the sequence. The computation of extrinsic information involves "mapping" of base pairing probabilities in other sequences to base pairing probabilities of the sequence as directed by probabilities of alignment. The computation of base pairing probabilities and extrinsic information is iteratively performed for each sequence. It is shown that the iterative approach and utilization of extrinsic information in soft-decision folding algorithm improves the accuracy of structure predictions above other higher complexity structure prediction algorithms.