pp-logo


sign in   

PredictProtein - [Output Example]
Specification of SAF

Specification of SAF (Simple Alignment Format)

Examples for SAF alignments



Compulsory features:

Each ROW

Each row contains two columns, separated by blanks or tabulators.
  • FIRST COLUMN:
    name (protein identifier, shorter than 14 characters, names should be uniqe, no blanks in names accepted)
  • SECOND COLUMN:
    one-letter sequence (any number of characters) insertions: dots (.), hyphens (-), or additional blanks.
    Note: blanks in sequences are ignored, the first residue (one-letter code), or the first insertion (.) will be taken as begin of the sequence line.
  • EXAMPLE:
    t2_11751  GGAPTLPETL NVAGGAPTLP ETLNVAGGAP TLPETLNV
    


Each BLOCK

Blocks contain the sequences of the proteins aligned to the guide sequence (first sequence) between residues N1 and N2.
  • FIRST ROW:
    The first row must be the guide sequence.
  • SECOND - N-th ROW:
    The sequences aligned in the region between residues N1 and N2 of the guide sequence. Note:
    • the order of sequences may differ between blocks
    • NOT all 2-N sequence have to occur in each block
    • C-terminal insertions may be left out for sequences 2-N (but NOT for the guide sequence)
    • BUT: whenever a sequence is present, it should have dots for insertiother than blanks at the N-term of each row
  • COMMENTS:
    • rows beginning with a hash (#) will be ignored
    • rows containing only blanks, dots, numbers will also be ignored (in particular you may insert lines with residue numbers)
  • EXAMPLE:
    t2_11751 GGAPTLPETL NVAGGAPTLP ETLNVAGGAP TLPETLNV
    name_22  .......... .......... .......... ........
    name_1   GGAPTLPETL NVAGGAPTLP ETLNVAGGAP TLPETLNV
    name_2   .......... NVAGGAPTLP 
    


WARNING

The 'freedom' provided by SAF has various consequences:
  • identical names in different rows of the same block are not identified. Instead, whenever this applies, the second, (third, ..) sequences are ignored. For example:
    t2_11751 EFQEDQENVN 
    name-1   ...EDQENvk
    name-1   GGAPTLPETL
    will be interpreted as:
    t2_11751 EFQEDQENVN 
    name-1   ...EDQENvk
    wheras:
    t2_11751 EFQEDQENVN 
    name-1   ...EDQENvk
    name_1   GGAPTLPETL
    is interpreted as 3 different sequences (note the usage of a hyphen (-) in the name of the second protein, and that of an uncerline (_) in the name of the third.
  • spelling mistakes in the names will not be detected
  • missing lines will also not be detected, e.g., you may want to give:
    t2_11751 GGAPTLPETL 
    name_1   DEAPTLPETL 
    t2_11751 NVAGGAPTLP 
    name_1   SEGTTAPTLS 
    t2_11751 ETLNVAGGAP
    name_1   E...VAGGAP
    and you may actually have given
    t2_11751 GGAPTLPETL 
    name_1   DEAPTLPETL 
    t2_11751 NVAGGAPTLP 
    t2_11751 ETLNVAGGAP
    name_1   E...VAGGAP
    then PP would use the following alignment
    t2_11751 GGAPTLPETL ETLNVAGGAP
    name_1   DEAPTLPETL SEGTTAPTLS 
    instead of
    t2_11751 GGAPTLPETL NVAGGAPTLP ETLNVAGGAP
    name_1   DEAPTLPETL SEGTTAPTLS E...VAGGAP

Copyright © 2008 Burkhard Rost, CUBIC all rights reserved. Terms of Use | Privacy Policy | Contact Information