Identify evolutionary signatures from biological sequences
Quickstart for beginners
2020-09-15
Amo (AmoAi V2.7.5) is a computational platform for analyses of bioligical data and applications in bioengineering. This tutorial provides its principles, architecture, and usage for users.
Protein sequence analysis & design
Choose which computational tool to use
Enter a job name (optional)
Enter your name (optional)
Enter your email address (required), a notice email will be sent to this address when job is completed
Select the type of a given file. If the option of multiple aligned sequence is chosen, all analyses are based on the user-defined sequences. Otherwise, a multiple sequence alignment is achived by searching a single sequence provided by user against the database if the option of single sequence is selected
Paste the sequence profile (FASTA format)
Upload the sequence profile (FASTA format)
Upload the associated PDB (optional). The inferred residue communities are mapped to the tertiary structure if the PDB is provided
Offset (optional), set it if the starting index of the first amino acid is not 0
Numer of mutants (optional). It is used to set the number of mutations that can be mutated together
Databases
Protein sequence statistics
In the calculations, the distribution of similarities between pairwise sequences is computed from a generated multiple sequence alignment (MSA) of a given protein (left). The degree of conservation at each position (not include the gaps) in the MSA is measured by relative entropy (right), and it is to indicate how much an amino acid at a specific position could be conserved. The degree of conservation can also how an amino acid at an evolutionarily conserved position is relavent to functionally important sites of protein, moreover, mutations in these positions could be harmful to protein function.
Evolutionary signatures inferred from sequences
The evolutionary process of a protein family can be mathematically modeled by a sequence generator at equilibrium1. The generator produces a sequence with a probability (Eq. 1) from a distribution over the space of all possible sequences in the family.
where is the partition function that normalizes the distribution by summing over the Boltzmann factors of all possible sequences in the family. The total energy of the sequence is
where and are, respectively, site-specific bias terms of an amino acid and coupling terms between pairwise amino acids.
The coupled interactions between pairwise amino acids of the protein are analyzed by spectrum analysis and ranked by the eigenvalues2. To make the residue communities as much statistically independent as possible, the top three residue communities are defined based on two of the top five eigenvalues and their corresponding eigenvectors () of the matrix (Eq. 2) as:
Community I consists of residues at the ith position of ;
Community II includes residues at the ith position of ; and
Community III includes residues at the ith position of .
We use an as a threshold to project the amino acids reduced from the coupling matrix and extract meaningful residue communities.
The reduced matrix of interactions are achived (left). In order to look into the networks of residues that play important roles in protein function, the residue communities are defined based on the eigenvalues and preserve the positive interactions among amino acids (right).
Protein single mutation
Complete single mutagenesis. The matrix that is computed by the SAEC method shows ΔE — the energy difference of each mutant sequence with each mutation τ at the ith site and wild-type sequence, negative values representing favourable while positive representing unfavourable mutations.
Protein coupled mutations
Based on the best-so-far sequence (designed), one can conduct a signle or multiple (coupled) mutants on the WT sequence, and the energy of the mutant sequence is computed for evaluating the mutant(s). As an example, each two amino acids are combined to be mutated together and the energy differences are to measure quanlity of the coupled mutants.
Protein sequence design
The energy trajectories of designing the sequence from the WT sequence, and the WT and mutant sequences are listed below.
WT sequence (energy: -147.990):
123456789|123456789|123456789|123456789|123456789|123
VCSEQAETGPCRAMISRWYFDVTEGKCAPFFYGGCGGNRNNFDTEEYCMAVCG