SNAP.pl (Synonymous Nonsynonymous Analysis Program)

Written by: Bette Korber, T10 MS K710, LANL, Los Alamos, NM 87545
(505) 665-4453, btk@t10.lanl.gov

You should be familiar with the paper by Nei and Gojobori, Mol Biol Evolution, 3:418-426 (1986) before using this program.

This code is written in perl and has been run on UNIX operating systems (SUN and LINUX). It should also work on DOS and MAC)

USAGE:

SNAP.pl filename

INPUT:

Sequences should be provided with codons aligned, in caps, and in frame.

They should be provided in table format:


Seq1	ACTGCCTTTGGC...
Seq2    ACTGCCTATGGG... 

The first field is the sequence name, the second field the sequence, then return to a new line for the second sequence.

Single or double insertions will throw the sequence out of frame, and care must be taken to keep codons intact.

A single insertion could be treated in the following way to keep the codons ACT and GGC intact:


ACTTGCC ==> ACTT--GCC
ACTGGC  ==> ACT---GCC

The letter N should be used to represent ambiguous bases. A dash, '-' for insertions, Only A C G T N - are allowed.

TO DO:

Incorporate a better correction than Jukes-Cantor into the algorithm.

OUTPUT:

This program calculates the number of synonymous vs. non-synonymous base substitutions as described in Nei and Gojobori for all pairwise comparisons of sequences in an alignment. The number of synonymous and non-synonymous codon changes are counted, as well as the number of potential synonymous and non-synonymous changes when comparing two sequences. Ambiguous codons or codons with insertions are excluded from the tally of compared codons. The overall sequence distances are calculated as well as a codon by codon summary.

The pid (process ID) is appended to the output filename, i.e., "outputfilename.[pid]".

The examples below are based on input of 4 HIV protease sequences: 92UG037, RF, 92BRO25, ELI


_______________________________________________________
SUMMARY OF THE SYNONYMOUS AND NONSYNONYMOUS INFORMATION
_______________________________________________________

summary.pid:

Compare Sequences_names           Sd      Sn       S       N      ps      pn      ds      dn   ds/dn
0   1   92UG037   RF           18.00    9.00   68.50  228.50    0.26    0.04    0.32    0.04    8.00
0   2   92UG037   92BRO25      22.50   10.50   69.00  228.00    0.33    0.05    0.43    0.05    9.00
0   3   92UG037   ELI          17.00   10.00   68.50  228.50    0.25    0.04    0.30    0.05    6.68
1   2   RF        92BRO25      20.50   11.50   68.50  228.50    0.30    0.05    0.38    0.05    7.33
1   3   RF        ELI          11.00    5.00   68.00  229.00    0.16    0.02    0.18    0.02    8.22
2   3   92BRO25   ELI          18.50   10.50   68.50  228.50    0.27    0.05    0.33    0.05    7.06
Averages of all pairwise comparisons: ds =  0.62, dn =  0.11, ds/dn =  6.60
Averages of the first sequence compared to others: ds =  0.49, dn =  0.09, ds/dn =  7.09

Compare: Lists the two sequences compared, starting with 0 (4 sequences are seqs 0-3)

Sequences_names: The names of the two sequences being compared.

Sd: The number of observed synonymous substitutions

Sn: The number of observed non-synonymous substitutions

S: The number of potential synonymous substitutions (the average for the two compared sequences)

N: The number of potential non-synonymous substitutions (the average for the two compared sequences)

ps: The proportion of observed synonymous substitutions: Sd/S

pn: The proportion of observed non-synonymous substitutions: Sd/S

ds: The Jukes-Cantor correction for multiple hits of ps

dn: The Jukes-Cantor correction for multiple hits of pn

ds/dn: The ratio of synonymous to non-synonymous substitutions

NOTE1: An example of how Sd and Sn is determined for a single codon:

Imagine you have TTA in one sequence, and CTT in the other. Two bases are different. TTA encodes Leu, as does CTT. If you assume that the bases have changed one at a time, there are two paths:


TTA (L)  --> CTA (L) --> CTT (L) or
TTA (L)  --> TTT (F) --> CTT (L).

From first path, count = .5syn + .5syn

From second path, count = .5nonsyn + .5nonsyn

So this two base change would be 1 synonymous, 1 non-synonymous.

NOTE2: Some thoughts on statistical comparisons of the output:

One must be wary when doing typical statistical analysis of these values. One thing I have found to be very common are distributions of values that are NOT close to a Gaussian distribution, so you should either check to see if you have a Gaussian distribution, or default to the use of non-parametric statistics, like a Wilcoxon rank sum test.

Therefore the averages given at the bottom are only meant as a crude guide.

Also, if one uses the full column of values for all pairwise comparisons (say all values of dn for one set, compared to all values for another set) there is a non-independence of points issue to be considered. An alternative is the use of a sequence like a consensus or a best estimate of an ancestral sequence as the first sequence in the alignment, and then just use the comparison of the first sequence to all others rather than all pairwise comparisons. Sequence 0 is the first in the set, so that would be all lines that start with 0.

NOTE3: NA and mutational saturation.

If ps or pn has a value >= .75, saturation has been reached and a Jukes-Cantor transformation cannot be done, so the value of NA is returned.

If either ds or dn is NA or 0, the ds/dn ratio is not calculated.

PHYLOGENETIC TREE INPUT

Two distance matrices that are compatible input for PHYLIP's neighbor program are created, based on either the ds or dn values. A 5 sequence example is shown below, for dn values:

dndist.17985 (a PHYLIP neighbor infile):


  5
92UG037       0.0000  0.0400  0.0500  0.0500  0.2000
RF            0.0400  0.0000  0.0500  0.0200  0.2200
92BRO25       0.0500  0.0500  0.0000  0.0500  0.2200
ELI           0.0500  0.0200  0.0500  0.0000  0.2200
ANT70         0.2000  0.2200  0.2200  0.2200  0.000

The PHYLIP neighbor output:


  +92UG037
  !
  !     +RF
  !  +--1
--3--2  +ELI
  !  !
  !  +92BRO25
  !
  +----------ANT70

Between        And            Length
-------        ---            ------
   3          92UG037           0.01375
   3             2              0.00875
   2             1              0.01375
   1          RF                0.00833
   1          ELI               0.01167
   2          92BRO25           0.02625
   3          ANT70             0.18625

These are examples of neighbor joining trees that can be produced using the "dsdist.[pid]" and "dndist.[pid]" input files:

ds.treefile.ps: Synonymous Substitution Tree For Protease.

dn.treefile.ps: Nonsynonymous Substitution Tree For Protease.

BACKGROUND ABOUT THE DATA SET

background.pid for the four sequence example used in summary.pid:


The input file has 4 sequences
Sequence Length: 297
Compare Sequences_names     Codons  Compared Ambiguous Indels     Ns
0   1   92UG037   RF          99        99        0        0      0
0   2   92UG037   92BRO25     99        99        0        0      0
0   3   92UG037   ELI         99        99        0        0      0
1   2   RF        92BRO25     99        99        0        0      0
1   3   RF        ELI         99        99        0        0      0
2   3   92BRO25   ELI         99        99        0        0      0

Codons: Total number of codons in the input file (number of bases/3)

Compared: (Codons - Ambiguous)

Ambiguous: Codons that contain either dashes "-" for indels or N's.

(Indel + N's) = Ambiguous

Indels: Codons that contain dashes "-" for indels

Ns: Codons that contain N's.

NOTE: There were no indels or Ns in the protease alignment file.

ALL CODON COMPARISONS

codons.pid

For every codon in each pairwise comparison, it is identified as "identity" if there is no change, synon if the nucleotides change but the encoded amino acid does not, nonsyn if the nucleotides change as well as the encoded amino acid, indel if there is an insertion or deletion, ambiguous if there is an N. Indel supersedes N if both are present.

The Nei and Gojobori syn/non-syn value for each codon is recorded.


This is comparison 0 x 1: 92UG037 U455
Codon# class     1   2   aa1 aa2  syn     non
1      identity  CCT CCT P   P
2      identity  CAA CAA Q   Q
3      identity  ATC ATC I   I
4      identity  ACT ACT T   T
5      identity  CTT CTT L   L
6      identity  TGG TGG W   W
7      identity  CAA CAA Q   Q
8      identity  CGA CGA R   R
9      synon     CCT CCC P   P    1.00    0.00
10     identity  CTT CTT L   L
11     identity  GTC GTC V   V
12     identity  ACA ACA T   T
13     identity  GTA GTA V   V
14     synon     AAG AAA K   K    1.00    0.00
15     identity  ATA ATA I   I
16     synon     GGG GGA G   G    1.00    0.00
17     identity  GGA GGA G   G
18     identity  CAG CAG Q   Q
19     synon     CTA CTG L   L    1.00    0.00
20     nonsynon  AAA ATA K   I    0.00    1.00
21     nonsynon  AAA GAA K   E    0.00    1.00 ...

The codons.pid file is input for the perl script codons-xyplot.pl.

USAGE:


codons-xyplot.pl codons.pid

This will create an output called "codon.data".

"codon.xy" is an input file for the program xyplot. "codon.xy" includes the file "codon.data" and makes an xyplot of this data if the program xyplot is available.

USAGE:


xyplot -ps codon.xy > codon.ps

The information in codon.data is the average behavior of each codon for all pairwise comparisons for indels, syn, and nonsyn mutations. In very variable positions, the value can be over 1, because there are 3 positions in each codon, and more than one change can occur.


#k The number of codons is: 99
#k The average behavior for 153 comparisons:
#k             Cumulative             Per Codon
#k         __________________    __________________
#kcodon    indel  syn  nonsyn    indel  syn  nonsyn  aa
   1       0.00   0.00   0.00    0.00   0.00   0.00  P
   2       0.00   0.47   0.00    0.00   0.47   0.00  Q
   3       0.00   0.47   0.00    0.00   0.00   0.00  I
   4       0.00   0.47   0.21    0.00   0.00   0.21  P
   5       0.00   0.47   0.21    0.00   0.00   0.00  L
   6       0.00   0.47   0.32    0.00   0.00   0.11  W
   7       0.00   0.73   0.74    0.00   0.25   0.42  D
   8       0.00   0.93   0.74    0.00   0.21   0.00  R
   9       0.00   1.32   0.74    0.00   0.39   0.00  P
  10       0.00   1.92   1.03    0.00   0.60   0.29  I
  11       0.00   2.22   1.03    0.00   0.29   0.00  V ...

These are some examples of xyplots for synonymous and nonsynonymous changes for the protease region of HIV-1 pol:

protease.xy: Cumulative Synonymous and Nonsynonymous substitutions in protease.

protease2.xy: Codon-by-codon Synonymous and Nonsynonymous substitutions in protease.


FUTURE PLANS:

Add a better model of evolutionary distance.

Make this site interactive.



NOTICE:

6/15/98

Unless otherwise indicated, this information, consisting of source code, documentation, and executable programs, has been authored by an employee or employees of the University of California under LACC # ______ , operator of the Los Alamos National Laboratory under Contract No. W-7405-ENG-36 with the U.S. Department of Energy. The U.S. Government has rights to use, reproduce, and distribute this information. The public may copy and use this information without charge, make derivative works, distribute, and publicly display provided that this Notice and any statement of authorship are reproduced on all copies. However, the public may not incorporate this information in any commercial or proprietary product. Neither the Government nor the University makes any warranty, express or implied, or assumes any liability or responsibility for the use of this information.