Sequence Analysis

Release Notes

v2.5.0 series come with new and improved sequence, structure, and dynamics analysis features. See release notes for details.

How to Cite

Bakan A, Meireles LM, Bahar I ProDy: Protein Dynamics Inferred from Theory and Experiments
Bioinformatics 2011 27(11):1575-1577.

Bakan A, Dutta A, Mao W, Liu Y, Chennubhotla C, Lezon TR, Bahar I Evol and ProDy for Bridging Protein Sequence Evolution and Structural Dynamics
Bioinformatics 2014 30(18):2681-2683.

Zhang S, Krieger JM, Zhang Y, Kaya C, Kaynak B, Mikulska-Ruminska K, Doruker P, Li H, Bahar I ProDy 2.0: Increased scale and scope after 10 years of protein dynamics modelling with Python
Bioinformatics 2021 37(20):3657-3659.

Sequence Analysis¶

Evol component of ProDy package brought new fast, flexible, and efficient features for handling multiple sequence alignments and analyzing sequence evolution. Here, we just give a brief introduction to these features. For more detailed examples, see Evol Tutorial.

In [1]: from prody import *

In [2]: from pylab import *

In [3]: ion()

Access Pfam¶

First, let’s fetch an MSA file from Pfam database:

In [4]: filename = fetchPfamMSA('pkinase', alignment='seed')

In [5]: filename

We downloaded the seed alignment for Protein Kinase (Pkinase) family.

Parse MSA¶

As you might guess, we will parse this file using parseMSA() function:

In [6]: msa = parseMSA('pkinase_seed.sth')

In [7]: msa
Out[7]: <MSA: pkinase_seed (38 sequences, 419 residues)>

Sequences¶

You can access Sequence objects by indexing MSA:

In [8]: seq = msa[0]

In [9]: seq
Out[9]: <Sequence: TTK_HUMAN (pkinase_seed[0]; length 419; 267 residues and 152 gaps)>

In [10]: print(seq)
YSILKQIGSGGSSKV..FQVLN.EKKQIYAIKYVNLEEADNQTL......DSYRNEIAYLNKLQQ.HSDKIIRLYDYEIT.............DQYIY..MVMECGNID.....LNSWLK.......KKKS.IDPW.......ERK.SYWKNMLEAVHTIH.QHG...IVHSDLKPANFLIV............DGM........LKLIDFGIANQMQPDTTS................VVKDSQVGTVNYMPPEAIKDMSSSRENGKS...KSKISPKSDVWSLGCILYYMTYGKTPFQQIINQISKLHAIID.................................PNHEI.........................EFPDIPEKDLQDVLKCCLKRDPKQRIS.....IPELLAHPYV

You can also slice MSA objects and iterate over sequences:

In [11]: for seq in msa[:4]:
   ....:    print(repr(seq))
   ....: 
<Sequence: TTK_HUMAN (pkinase_seed[0]; length 419; 267 residues and 152 gaps)>
<Sequence: MKK1_YEAST (pkinase_seed[1]; length 419; 268 residues and 151 gaps)>
<Sequence: STE7_YEAST (pkinase_seed[2]; length 419; 276 residues and 143 gaps)>
<Sequence: BYR1_SCHPO (pkinase_seed[3]; length 419; 255 residues and 164 gaps)>

Analysis¶

Evol component includes several functions for calculating conservation and coevolution properties of amino acids, which are shown in Evol Tutorial. Here, let’s take a look at calcMSAOccupancy() and showMSAOccupancy() functions:

In [12]: occ = calcMSAOccupancy(msa, count=True)

In [13]: occ.min()
Out[13]: 1.0

This shows that an amino acid is present only in one of the sequences in the MSA.

In [14]: showMSAOccupancy(msa, count=True);

../../_images/prody_tutorial_sequence_occ.png

You see that many residues are not present in all sequences. You will see how to refine such MSA instances in Evol Tutorial.