Probalign : Algorithm Wikipedia, the free encyclopedia

Probalign

Probalign is a sequence alignment tool that calculates a maximum expected accuracy alignment using partition function posterior probabilities.^[1] Base pair probabilities are estimated using an estimate similar to Boltzmann distribution. The partition function is calculated using a dynamic programming approach.

Algorithm

The following describes the algorithm used by probalign to determine the base pair probabilities.^[2]

Alignment score

To score an alignment of two sequences two things are needed:

a similarity function $\sigma (x,y)$ (e.g. PAM, BLOSUM,...)
affine gap penalty: $g(k)=\alpha +\beta k$

The score $S(a)$ of an alignment a is defined as:

$S(a)=\sum _{x_{i}-y_{j}\in a}\sigma (x_{i},y_{j})+{\text{gap cost}}$

Now the boltzmann weighted score of an alignment a is:

$e^{\frac {S(a)}{T}}=e^{\frac {\sum _{x_{i}-y_{j}\in a}\sigma (x_{i},y_{j})+{\text{gap cost}}}{T}}=\left(\prod _{x_{i}-y_{i}\in a}e^{\frac {\sigma (x_{i},y_{j})}{T}}\right)\cdot e^{\frac {gapcost}{T}}$

Where $T$ is a scaling factor.

The probability of an alignment assuming boltzmann distribution is given by

$Pr[a|x,y]={\frac {e^{\frac {S(a)}{T}}}{Z}}$

Where $Z$ is the partition function, i.e. the sum of the boltzmann weights of all alignments.

Dynamic programming

Let $Z_{i,j}$ denote the partition function of the prefixes $x_{0},x_{1},...,x_{i}$ and $y_{0},y_{1},...,y_{j}$ . Three different cases are considered:

$Z_{i,j}^{M}:$ the partition function of all alignments of the two prefixes that end in a match.
$Z_{i,j}^{I}:$ the partition function of all alignments of the two prefixes that end in an insertion $(-,y_{j})$ .
$Z_{i,j}^{D}:$ the partition function of all alignments of the two prefixes that end in a deletion $(x_{i},-)$ .

Then we have: $Z_{i,j}=Z_{i,j}^{M}+Z_{i,j}^{D}+Z_{i,j}^{I}$

Initialization

The matrixes are initialized as follows:

$Z_{0,j}^{M}=Z_{i,0}^{M}=0$
$Z_{0,0}^{M}=1$
$Z_{0,j}^{D}=0$
$Z_{i,0}^{I}=0$

Recursion

The partition function for the alignments of two sequences $x$ and $y$ is given by $Z_{|x|,|y|}$ , which can be recursively computed:

$Z_{i,j}^{M}=Z_{i-1,j-1}\cdot e^{\frac {\sigma (x_{i},y_{j})}{T}}$
$Z_{i,j}^{D}=Z_{i-1,j}^{D}\cdot e^{\frac {\beta }{T}}+Z_{i-1,j}^{M}\cdot e^{\frac {g(1)}{T}}+Z_{i-1,j}^{I}\cdot e^{\frac {g(1)}{T}}$
$Z_{i,j}^{I}$ analogously

Base pair probability

Finally the probability that positions $x_{i}$ and $y_{j}$ form a base pair is given by:

$P(x_{i}-y_{j}|x,y)={\frac {Z_{i-1,j-1}\cdot e^{\frac {\sigma (x_{i},y_{j})}{T}}\cdot Z'_{i',j'}}{Z_{|x|,|y|}}}$

$Z',i',j'$ are the respective values for the recalculated $Z$ with inversed base pair strings.

References

^ U. Roshan and D. R. Livesay, Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics, 22(22):2715-21, 2006 (PDF)
^ Lecture "Bioinformatics II" at University of Freiburg