Probalign

Probalign is a sequence alignment tool that calculates a maximum expected accuracy alignment using partition function posterior probabilities.[1] Base pair probabilities are estimated using an estimate similar to Boltzmann distribution. The partition function is calculated using a dynamic programming approach.

Algorithm

The following describes the algorithm used by probalign to determine the base pair probabilities.[2]

Alignment score

To score an alignment of two sequences two things are needed:

  • a similarity function σ ( x , y ) {\displaystyle \sigma (x,y)} (e.g. PAM, BLOSUM,...)
  • affine gap penalty: g ( k ) = α + β k {\displaystyle g(k)=\alpha +\beta k}

The score S ( a ) {\displaystyle S(a)} of an alignment a is defined as:

S ( a ) = x i y j a σ ( x i , y j ) + gap cost {\displaystyle S(a)=\sum _{x_{i}-y_{j}\in a}\sigma (x_{i},y_{j})+{\text{gap cost}}}

Now the boltzmann weighted score of an alignment a is:

e S ( a ) T = e x i y j a σ ( x i , y j ) + gap cost T = ( x i y i a e σ ( x i , y j ) T ) e g a p c o s t T {\displaystyle e^{\frac {S(a)}{T}}=e^{\frac {\sum _{x_{i}-y_{j}\in a}\sigma (x_{i},y_{j})+{\text{gap cost}}}{T}}=\left(\prod _{x_{i}-y_{i}\in a}e^{\frac {\sigma (x_{i},y_{j})}{T}}\right)\cdot e^{\frac {gapcost}{T}}}

Where T {\displaystyle T} is a scaling factor.

The probability of an alignment assuming boltzmann distribution is given by

P r [ a | x , y ] = e S ( a ) T Z {\displaystyle Pr[a|x,y]={\frac {e^{\frac {S(a)}{T}}}{Z}}}

Where Z {\displaystyle Z} is the partition function, i.e. the sum of the boltzmann weights of all alignments.

Dynamic programming

Let Z i , j {\displaystyle Z_{i,j}} denote the partition function of the prefixes x 0 , x 1 , . . . , x i {\displaystyle x_{0},x_{1},...,x_{i}} and y 0 , y 1 , . . . , y j {\displaystyle y_{0},y_{1},...,y_{j}} . Three different cases are considered:

  1. Z i , j M : {\displaystyle Z_{i,j}^{M}:} the partition function of all alignments of the two prefixes that end in a match.
  2. Z i , j I : {\displaystyle Z_{i,j}^{I}:} the partition function of all alignments of the two prefixes that end in an insertion ( , y j ) {\displaystyle (-,y_{j})} .
  3. Z i , j D : {\displaystyle Z_{i,j}^{D}:} the partition function of all alignments of the two prefixes that end in a deletion ( x i , ) {\displaystyle (x_{i},-)} .

Then we have: Z i , j = Z i , j M + Z i , j D + Z i , j I {\displaystyle Z_{i,j}=Z_{i,j}^{M}+Z_{i,j}^{D}+Z_{i,j}^{I}}

Initialization

The matrixes are initialized as follows:

  • Z 0 , j M = Z i , 0 M = 0 {\displaystyle Z_{0,j}^{M}=Z_{i,0}^{M}=0}
  • Z 0 , 0 M = 1 {\displaystyle Z_{0,0}^{M}=1}
  • Z 0 , j D = 0 {\displaystyle Z_{0,j}^{D}=0}
  • Z i , 0 I = 0 {\displaystyle Z_{i,0}^{I}=0}

Recursion

The partition function for the alignments of two sequences x {\displaystyle x} and y {\displaystyle y} is given by Z | x | , | y | {\displaystyle Z_{|x|,|y|}} , which can be recursively computed:

  • Z i , j M = Z i 1 , j 1 e σ ( x i , y j ) T {\displaystyle Z_{i,j}^{M}=Z_{i-1,j-1}\cdot e^{\frac {\sigma (x_{i},y_{j})}{T}}}
  • Z i , j D = Z i 1 , j D e β T + Z i 1 , j M e g ( 1 ) T + Z i 1 , j I e g ( 1 ) T {\displaystyle Z_{i,j}^{D}=Z_{i-1,j}^{D}\cdot e^{\frac {\beta }{T}}+Z_{i-1,j}^{M}\cdot e^{\frac {g(1)}{T}}+Z_{i-1,j}^{I}\cdot e^{\frac {g(1)}{T}}}
  • Z i , j I {\displaystyle Z_{i,j}^{I}} analogously

Base pair probability

Finally the probability that positions x i {\displaystyle x_{i}} and y j {\displaystyle y_{j}} form a base pair is given by:

P ( x i y j | x , y ) = Z i 1 , j 1 e σ ( x i , y j ) T Z i , j Z | x | , | y | {\displaystyle P(x_{i}-y_{j}|x,y)={\frac {Z_{i-1,j-1}\cdot e^{\frac {\sigma (x_{i},y_{j})}{T}}\cdot Z'_{i',j'}}{Z_{|x|,|y|}}}}

Z , i , j {\displaystyle Z',i',j'} are the respective values for the recalculated Z {\displaystyle Z} with inversed base pair strings.

See also

  • ProbCons
  • Multiple Sequence Alignment

References

  1. ^ U. Roshan and D. R. Livesay, Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics, 22(22):2715-21, 2006 (PDF)
  2. ^ Lecture "Bioinformatics II" at University of Freiburg
  • Probalign Webservice