Specification of SAF
Specification of SAF (Simple Alignment Format)
Examples for SAF alignments
Compulsory features:
Each ROW
Each row contains two columns, separated by blanks or tabulators.
- FIRST COLUMN:
name (protein identifier, shorter than 14 characters, names should be uniqe, no blanks in names accepted)
- SECOND COLUMN:
one-letter sequence (any number of characters) insertions: dots (.), hyphens (-), or additional blanks.
Note: blanks in sequences are ignored, the first residue (one-letter code), or the first insertion (.) will be taken as begin of the sequence line.
- EXAMPLE:
t2_11751 GGAPTLPETL NVAGGAPTLP ETLNVAGGAP TLPETLNV
Each BLOCK
Blocks contain the sequences of the proteins aligned to the guide sequence (first sequence) between residues N1 and N2.
- FIRST ROW:
The first row must be the guide sequence.
- SECOND - N-th ROW:
The sequences aligned in the region between residues N1 and N2 of the guide sequence. Note:
- the order of sequences may differ between blocks
- NOT all 2-N sequence have to occur in each block
- C-terminal insertions may be left out for sequences 2-N (but NOT for the guide sequence)
- BUT: whenever a sequence is present, it should have dots for insertiother than blanks at the N-term of each row
- COMMENTS:
- rows beginning with a hash (#) will be ignored
- rows containing only blanks, dots, numbers will also be ignored (in particular you may insert lines with residue numbers)
- EXAMPLE:
t2_11751 GGAPTLPETL NVAGGAPTLP ETLNVAGGAP TLPETLNV
name_22 .......... .......... .......... ........
name_1 GGAPTLPETL NVAGGAPTLP ETLNVAGGAP TLPETLNV
name_2 .......... NVAGGAPTLP
WARNING
The 'freedom' provided by SAF has various consequences:
- identical names in different rows of the same block are not identified. Instead, whenever this applies, the second, (third, ..) sequences are ignored. For example:
t2_11751 EFQEDQENVN
name-1 ...EDQENvk
name-1 GGAPTLPETL
will be interpreted as:
t2_11751 EFQEDQENVN
name-1 ...EDQENvk
wheras:
t2_11751 EFQEDQENVN
name-1 ...EDQENvk
name_1 GGAPTLPETL
is interpreted as 3 different sequences (note the usage of a hyphen (-) in the name of the second protein, and that of an uncerline (_) in the name of the third.
- spelling mistakes in the names will not be detected
- missing lines will also not be detected, e.g., you may want to give:
t2_11751 GGAPTLPETL
name_1 DEAPTLPETL
t2_11751 NVAGGAPTLP
name_1 SEGTTAPTLS
t2_11751 ETLNVAGGAP
name_1 E...VAGGAP
and you may actually have given
t2_11751 GGAPTLPETL
name_1 DEAPTLPETL
t2_11751 NVAGGAPTLP
t2_11751 ETLNVAGGAP
name_1 E...VAGGAP
then PP would use the following alignment
t2_11751 GGAPTLPETL ETLNVAGGAP
name_1 DEAPTLPETL SEGTTAPTLS
instead of
t2_11751 GGAPTLPETL NVAGGAPTLP ETLNVAGGAP
name_1 DEAPTLPETL SEGTTAPTLS E...VAGGAP
|