|
|
PP Help 09: HintsContents
HINTS FOR USERSNoteWhat can you expect from secondary structure prediction?How accurate are the predictions ?The expected levels of
accuracy (PHDsec = 72±11% (three state per-residue accuracy); PHDacc =
75±7% (two-state per-residue accuracy); PHDhtm = 94±6% (two-state
per-residue accuracy)) are valid for typical globular, water-soluble
(PHDsec, PHDacc), or helical transmembrane proteins (PHDhtm) when the
multiple alignment contains many and diverse sequences. High values for
the reliability indices indicate more accurate predictions. (Note: for
alignments with little variation in the sequences, the reliability
indices adopt misleadingly high values.) PHDsec predictions tend to be
relatively accurate for porins; however, for helical membrane proteins
other programs ought to be used.
Confusion between strand and helix? PHD (as well as other methods) focuses on predicting hydrogen bonds. Consequently, occasionally strongly predicted (high reliability index) helices are observed as strands and vice versa (expected accuracy of PHDsec). Strong signal from secondary structure caps? The ends of helices and strands contain a strong signal. However, on average PHD predicts the core of helices and strands more accurately than the caps (B. Rost and C. Sander, 1D secondary structure prediction through evolutionary profiles, in: H. Bohr and S. Brunak (eds.), Protein Structure by Distance Analysis, Amsterdam: IOS Press, 257-276 (1994)). This seems to also hold for other methods (Garnier, priv. comm.). Are internal helices predicted poorly? Steven Benner has indicated that internal buried helices are particularly difficult to predict. On average, this is not the case for PHD predictions (expected accuracy of PHDsec for buried helices). Accessibility useful to provide upper limits for contacts? The predicted solvent accessibility (PHDacc) can be translated into a prediction of the number of water atoms around a given residue. Consequently, PHDacc can be used to derive upper and lower limits for the number of inter-residue contacts of a certain residue (such an estimate could improve predictions of inter-residue contacts). How to predict porins? PHDhtm predicts only transmembrane helices, and PHDsec has been trained on globular, water-soluble proteins. How to predict 1D structure for porins then? As porins are partly accessible to solvent, prediction accuracy of PHDsec was relatively high (70%) for the known structures. Thus, PHDsec appears to be applicable. How to use the prediction of transmembrane helices? One possible application of PHDhtm is to scan, e.g., entire chromosomes for possible transmembrane proteins. The classification as transmembrane protein is not sufficient to have knowledge about function, but may shed some light into the puzzle of genome analyses. When using PHDhtm for this purpose, the user should keep in mind that on average about 5% of the globular proteins are falsely predicted to have transmembrane helices. What about protein design and synthesised peptides? The PHD
networks are trained on naturally evolved proteins. However, the
predictions have proven to be useful in some cases to investigate the
influence of single mutations (e.g. for Chameleon ),
or for Janus, Rost, unpublished). For short poly-peptides, the
following should be taken into account: the network input consists of
17 adjacent residues, thus, shorter sequences may be dominated by the
ends (which are treated as solvent).
In a nutshell: how to avoid pitfalls?70% correct implies 30% incorrect. The most accurate methods for
predicting secondary structure reach sustained levels of about 70%
accuracy. When interpreting predictions for a particular protein it is
often instructive to mark the 30% of the residues you suspect to be
falsely predicted.
Spread of prediction accuracy. An expected accuracy of 70% does NOT imply that for your protein U 70% of all residues are correctly predicted. Instead, values published for prediction accuracy are averaged over hundreds of unique proteins. An expected accuracy of 70±10% (one standard deviation) implies that, on average, for two thirds of all proteins between 60 and 80% of the residues will be predicted correctly (expected accuracy of PHDsec). Thus, prediction accuracy can be higher than 80% or lower than 60% for your protein. Few methods supply well tested indices for the reliability of predictions. Such indices can help to reduce or increase your trust in a particular prediction. Special classes of proteins. Prediction methods are usually derived from knowledge contained in subsets of proteins from databases. Consequently, they should not be applied to classes of proteins which have not been included in the subsets. For example, methods for predicting helices in globular proteins are likely to fail when applied to predict transmembrane helices. In general, results should be taken with caution for proteins with unusual features, such as proline-rich regions, unusually many cysteine bonds, or for domain interfaces. Better alignments yield better predictions. Multiple alignment-based predictions are substantially more accurate than single sequence-based predictions. How many sequences do you need in your alignment to expect an improvement; and how sensitive are prediction methods with respect to errors in the alignment? The more divergent sequences contained in the alignment, the better (two distantly related sequences often improve secondary structure predictions by several percentage points). Regions with few aligned sequences yield less reliable predictions. The sensitivity to alignment errors depends on the methods, e.g., secondary structure prediction is less sensitive to alignment errors than accessibility prediction. Better + worse = even better? Today, several automatic services accomplish secondary structure predictions. Some users fall into the what-is-common-is-correct trap, i.e., they average over all prediction methods and consider identical regions as more reliable. Exceptionally, such a majority vote may be beneficial. However frequently, the result will be the worst-of-all prediction. Often, it is preferable to use reliability indices provided by some methods. Such indices answer the question: how reliably is the tryptophan at position 307 predicted in a surface loop? (Note: the correlation between such indices and prediction accuracy is sufficiently tested for a few methods, only.) 1D structure may or may not be sufficient to infer 3D structure. Say you obtain as prediction for regular secondary structure:
helix-strand-strand-helix-strand-strand (H-E-E-H-E-E). Assume, you find
a protein of known structure with the same motif (H-E-E-H-E-E). Can you
conclude that the two proteins have the same fold? Yes and no, your
guess may be correct, but there are various ways to realise the given
motif by completely different structures. For example, the secondary
structure motif 'H-E-E-H-E-E' is contained in, at least, 16
structurally unrelated proteins.
Nuts and bolts: what to keep in mind?Information content in multiple sequence alignmentIf the multiple sequence alignment contains only a few proteins very
similar to the one you sent (pairwise sequence identity > 90%), the
expected accuracy for 1D structure predictions (secondary structure,
accessibility, transmembrane helices) drops significantly. Note: this
implies a reduction of the expected accuracy for threading. The scores
for expected accuracy (PHDsec, PHDacc., PHDhtm) are valid for typical
alignments as to be found in the HSSP database. The information content
of the alignment is difficult to measure. Two important parameters are:
NOTE HOWEVER:
Cut-off for including homologues in alignmentIn
the multiple sequence alignment returned to you, only homologues down
to levels of 30% pairwise sequence identity over 80 or more residues
are included. This cut-off is five percentage points above the
threshold for structural homology (Sander & Schneider, 1990), in an
attempt to stay clearly off the twilight zone of sequence similarity,
and provide high-quality multiple alignments in an automated fashion.
Quality of multiple sequence alignmentOn average, more residues are falsely aligned for lower levels of
pairwise sequence identity. Down to levels of about 30%, the automatic
MaxHom alignments are usually quite accurate. However, for many
families there are regions for which the 'correct' alignment is, in
principle, not well defined. One way to spot such regions is the
stability of the alignment with respect to including or excluding some
of the aligned sequences. By providing different lists of sequences
("input option 'PIR list'") you can monitor the stability of the
alignment. Often such regions may form surface loops. Predictions may
be less accurate in such regions.
Minimal length of sequencesThe
PHD programs treat N- and C-terminal ends of proteins as solvent
molecules. The size of the input window for predicting 1D structure is
up to 17 residues. Thus, the first and the last 17 residues of your
sequence will 'see solvent'. Especially for short fragments you did cut
out from large proteins, this may result in false predictions.
Insertions in multiple sequence alignment
'Untypical' proteins
Prediction of transmembrane helices (HTM's) and topology
Reliability indices for PHD predictionsThe
reliability indices of the PHD methods correlate well with prediction
accuracy. In other words, residues predicted with high reliability (0 =
low, 9 = high) are more likely to be predicted correctly. However, when
basing the prediction on single sequences (rather than multiple
alignments) the scale has to be shifted. For instance, values of RI > 4 usually imply an expected accuracy of > 80% for PHDsec. When
using a single sequence as input the same level of accuracy is reached
only for residues predicted at RI > 7.
Combination of results with that of other methodsA combination of two prediction methods is likely to improve the accuracy only if the following points are met:
Homologue of known structureAb-initio
prediction (by e.g. PHD) is, in general, less accurate than is homology
modelling. Thus, if we find a protein of known structure that has > 25% pairwise sequence identity to your sequence, you ought to make use
of the known structure by homology modelling.
Prediction-based threading
|
||
|
Copyright © 2008 Burkhard Rost, CUBIC all rights reserved. Terms of Use | Privacy Policy | Contact Information