PR2ALIGN: protein sequence alignment using weighted profiles of biochemical properties

Main page: http://pr2align.rit.albany.edu


Summary

Traditional amino acid sequence alignment methods score matches and mismatches by utilizing a single fixed 20x20 matrix obtained by averaging substitution patterns over a large collection of different protein families. Unlike such traditional approaches, PR2ALIGN measures the distance between pairs of aligned residues using a set of weighted amino acid properties and finds an optimal minimal distance global alignment between two amino acid sequences based on the comparison of the multiple profiles of biochemical properties of these sequences. The user can provide any number of amino acid properties (such as hydrophobicity, size, charge, etc) and specify a weight for each property. The higher the weight for a given property, the more this property contributes to the final alignment.

Algorithm

Let X={x1,…,xn} and Y={y1,…,ym} be two amino acid sequences of length n and m residues, respectively. PR2ALIGN uses the following function to calculate the score for a global alignment between X and Y, Score(X,Y):

Where d(xi, yj) is the distance between aligned pair of characters xi and yj (the i-th character in sequence X and the j-th character in sequence Y), and g is the affine gap penalty function computed in the following way:

g(r) = A + (r-1)*B

Where A is the gap opening penalty, B is the gap extension penalty (A >= 0 and B >= 0), and r is the gap length. Biochemical properties of each of the 20 amino acid types are represented by a numerical property vector, P, of dimension k (where k is the number of amino acid properties used for the alignment, k >= 1). Thus, a residue xi in sequence position i is represented by its k biochemical properties:


Then the distance d(xi, yj) is computed according to the following equation:



Where w(b) is the user-supplied weight for the amino acid property b.

The optimal global alignment score, D[n,m], for sequences X and Y is found by applying the dynamic programming recursion:

E[i,j] = min{D[i,j-1]+A, E[i,j-1]+B}                                                   
F[i,j] = min{D[i-1,j]+A, F[i-1,j]+B}
D[i,j] = min{D[i-1,j-1] + d(xi,yj), E[i,j], F[i,j]}

E[0,0] = F[0,0] = D[0,0] = 0
E[i,0] = D[i,0] = g(i)
F[0,j] = D[0,j] = g(j)


Description of the input parameters

Upload amino acid sequences

The user is asked to provide two amino acid sequences to be aligned (sequence 1 and 2). These amino acid sequences must be in FASTA format (see format description below). The maximum allowed sequence length is 5,000 residues.

FASTA sequence file format:

A FASTA file consists of a header line that begins with ">" character, followed by an optional sequence name and the sequence itself:

>Sequence name goes here
MALTNAQILAVIDSWEETVGQFPVITHHVPLGGGLQGTLHCYEIPLAAPYGVGFAKNGPT
RWQYKRTINQVVHRWGSHTVPFLLEPDNINGKTCTASHLCHNTRCHNPLHLCWESLDDNK
GRNWCPGPNGGCVHAVVCLRQGPLYGPGATVAGPQQRGSHFVV

The amino acid sequence should be represented using the standard one-letter amino acid code (upper or lower case), which includes:
twenty characters for the twenty amino acid types:

  A  Alanine          M  Methionine

  C  Cysteine         N  Asparagine

  D  Aspartate        P  Proline

  E  Glutamate        Q  Glutamine

  F  Phenylalanine    R  Arginine

  G  Glycine          S  Serine

  H  Histidine        T  Threonine

  I  Isoleucine       V  Valine

  K  Lysine           W  Tryptophan

  L  Leucine          Y  Tyrosine

Amino acid properties to use for the alignment

The user can either use the amino acid properties provided on the input page ("Use amino acid properties listed below" option) or upload a file with amino acid properties ("Upload a file with amino acid properties" option). Hundreds of amino acid property scales can be retrieved from the AAindex database.

Use the amino acid properties listed below

If this option is selected, the web-server will use the four built-in amino acid properties: Hydrophobicity, Size, Coil propensity, and Thiol group. The user has to provide a weight for each property. The weight must be greater than or equal to 0. The higher the weight for a given property, the more this property affects the final alignment. If the user does not wish to use a particular property, he/she should enter the weigth of 0 for this property. Please note that the user-supplied weights are normalized in such a way that the sum of all weights is equal to 1.0;
By default, the web-server uses property weights and gap penalties optimized for aligning homologous proteins with sequence identity between 30 and 40%. For other ranges of sequence identity please refer to the table below:

Sequence identity

Weight for hydrophobicity

Weight for size

Weight for thiol group

Weight for coil propensity

Gap initiation penalty

Gap extension penalty

0-10%

0.7

0.15

0.05

0.1

0.8

0.2

10-20%

0.3

0.2

0.35

0.15

0.6

0.1

20-30%

0.3

0.2

0.35

0.15

0.7

0.1

30-40%

0.25

0.2

0.4

0.15

0.7

0.1

Above 40%

0.2

0.2

0.35

0.25

0.6

0.1

Autoselect weights and gap penalties

If this option is checked, the web-server will attempt to estimate the expected percentage of sequence identity and will automatically select the property weights and gap penalties based on this expected percentage identity. This option works only for the four built-in amino acid properties listed above. The expected percentage of sequence identity is estimated by aligning the input sequences using a conventional pair-wise sequence alignment with the VTML200 amino acid similarity matrix, gap initiation penalty of -15 and gap extension penalty of -1. The user should be aware that this is just a rough estimate which may differ from the percentage sequence identity displayed in PR2ALIGN output.

Upload a file with amino acid properties

This option allows the user to upload a file with one or more amino acid properties (click here to download an example file). A description of properties fileis provided below (the same format must be used for user-supplied files). The line numbers printed in red are for convinience only and are not a part of the file. The file must begin with a header line (line 1 in the example below). The second line must contain 20 standard amino acid letters in the specified order (line 2 in the example below). For each property, the file must contain a line that begins with "#PROPERTY " followed by the property name (lines 3 and 6 in the example below). The line after the property name must contain 20 comma-delimited numbers quantifying this property for each individual amino acid (lines 4 and 7 in the example below). These numbers must be in the same order as the 20 amino acid letters listed in line 2. For instance, in the example below the first hydrophobicity number of 0.250 corresponds to A, the second hydrophobicity number of -1.760 corresponds to R, etc. The line that begins with "W:"' after each property gives the weight assigned to this property (lines 5 and 8 in the example below). For instance, in the example below "Hydrophobicity" has the weigth of 0.6 and "Size" has the weigth of 0.4.

LINE 1: A header line goes here
LINE 2: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
LINE 3: #PROPERTY Hydrophobicity
LINE 4: 0.250,-1.760,-0.640,-0.720,0.040,-0.690,-0.620,0.160,-0.400,0.730,0.530,-1.100,0.260,0.610,-0.070,-0.260,-0.180,0.370,0.020,0.540
LINE 5: W:0.6
LINE 6: #PROPERTY Size
LINE 7: 27.500,105.000,58.700,40.000,44.600,80.700,62.000,0.000,79.000,93.500,93.500,100.000,94.100,115.500,41.900,29.300,51.300,145.500,117.300,71.500
LINE 8: W:0.4

 

Other alignment parameters

Normalize amino acid properties

If this option is enabled, each amino acid property will be normalized to be in range [0,1]. It is strongly recommended that this option be enabled.

Gap initiation penalty and Gap extension penalty

The web-server uses the affine gap penalty of form g(r) = A + B*(r-1), where A is the gap initiation penalty, B is the gap extension penalty, and r is the gap length. Both A and B must be greater than or equal to 0.

 

Retrieval of the results

We recommend the user to enter his/her e-mail address so that the web-server can automatically e-mail the results. Alternatively, the user may select the second option to get a temporary URL link to the submitted job and wait until the self-reloading page returns the alignment. This temporary link can also be bookmarked and checked later. The results will be kept on the web-server for 24 hours from the moment of submission, and deleted afterwards.

Wait time: Since the web-server executes only one process at a given moment, all submissions are placed in a job queue. The wait time for a given submission will be affected by the number of jobs submitted earlier and waiting in the job queue. If there are no prior submissions in the job queue, the estimated wait time is up to 1 minute. A large number of jobs in the queue will result in a longer wait time.

An example of PR2ALIGN output page

Source code download

The source code of the standalone alignment program in C++ can be downloaded here.

Citation:

I.B.Kuznetsov and M.McDuffie, 2015, PR2ALIGN: a stand-alone software program and a web-server for protein sequence alignment
using weighted biochemical properties of amino acids. BMC Research Notes, 8:187


Please address your questions and comments to Igor Kuznetsov
Web-server design: Michael McDuffie