MSA Files¶
This part shows how to parse, refine, filter, slice, and write MSA files.
In [1]: from prody import *
In [2]: from matplotlib.pylab import *
In [3]: ion() # turn interactive mode on
Let’s get the Pfam MSA file for the protein family that contains PIWI_ARCFU:
Parsing MSA files¶
This shows how to use MSAFile
or parseMSA()
to read the
MSA file.
Reading using MSAFile
yields an MSAFile object. Iterating over the
object will yield an object of Sequence
from which labels, sequence
can be obtained. Like any file object, you can over iterate over a
MSAFile
object once.
In [4]: msafile = 'PF02171_seed.sth'
In [5]: msafobj = MSAFile(msafile)
In [6]: msafobj
Out[6]: <MSAFile: PF02171_seed (Stockholm; mode 'rt')>
In [7]: msa_seq_list = list(msafobj)
In [8]: msa_seq_list[0]
Out[8]: <Sequence: TAG76_CAEEL (length 395; 307 residues and 88 gaps)>
parseMSA()
returns an MSA
object. We can parse
compressed files, but reading uncompressed files is much faster.
In [9]: fetchPfamMSA('PF02171', compressed=True)
Out[9]: 'PF02171_full.sth.gz'
In [10]: parseMSA('PF02171_full.sth.gz')
Out[10]: <MSA: PF02171_full (2067 sequences, 1392 residues)>
In [11]: fetchPfamMSA('PF02171', format='fasta')
Out[11]: 'PF02171_full.fasta.gz'
In [12]: parseMSA('PF02171_full.fasta.gz')
Out[12]: <MSA: PF02171_full (2067 sequences, 1392 residues)>
Iterating over a file will yield sequence id, sequence, residue start and end indices:
In [13]: msa = MSAFile('PF02171_seed.sth')
In [14]: seq_list = [seq for seq in msa]
In [15]: seq_list
Out[15]:
[<Sequence: TAG76_CAEEL (length 395; 307 residues and 88 gaps)>,
<Sequence: O16720_CAEEL (length 395; 302 residues and 93 gaps)>,
<Sequence: AGO10_ARATH (length 395; 322 residues and 73 gaps)>,
<Sequence: AGO1_SCHPO (length 395; 300 residues and 95 gaps)>,
<Sequence: AGO6_ARATH (length 395; 311 residues and 84 gaps)>,
<Sequence: AGO4_ARATH (length 395; 309 residues and 86 gaps)>,
<Sequence: YQ53_CAEEL (length 395; 328 residues and 67 gaps)>,
<Sequence: Q21691_CAEEL (length 395; 329 residues and 66 gaps)>,
<Sequence: O62275_CAEEL (length 395; 331 residues and 64 gaps)>,
<Sequence: O16386_CAEEL (length 395; 300 residues and 95 gaps)>,
<Sequence: Q21495_CAEEL (length 395; 285 residues and 110 gaps)>,
<Sequence: O76922_DROME (length 395; 298 residues and 97 gaps)>,
<Sequence: PIWI_DROME (length 395; 292 residues and 103 gaps)>,
<Sequence: Q17567_CAEEL (length 395; 312 residues and 83 gaps)>,
<Sequence: PIWL1_HUMAN (length 395; 293 residues and 102 gaps)>,
<Sequence: PIWI_ARCFU (length 395; 297 residues and 98 gaps)>,
<Sequence: Y1321_METJA (length 395; 274 residues and 121 gaps)>,
<Sequence: O67434_AQUAE (length 395; 276 residues and 119 gaps)>]
Filtering and Slicing¶
This shows how to use the MSAFile
object or MSA
object to
refine MSA using filters and slices.
Filtering¶
Any function that takes label and sequence arguments and returns a boolean value can be used for filtering the sequences. A sequence will be yielded if the function returns True. In the following example, sequences from organism ARATH are filtered:
In [16]: msafobj = MSAFile(msafile, filter=lambda lbl, seq: 'ARATH' in lbl)
In [17]: seq_list2 = [seq for seq in msafobj]
In [18]: seq_list2
Out[18]:
[<Sequence: AGO10_ARATH (length 395; 322 residues and 73 gaps)>,
<Sequence: AGO6_ARATH (length 395; 311 residues and 84 gaps)>,
<Sequence: AGO4_ARATH (length 395; 309 residues and 86 gaps)>]
Slicing¶
A list of integers can be used to slice sequences as follows. This enables selective parsing of the MSA file.
In [19]: msafobj = MSAFile(msafile, slice=list(range(10)) + list(range(374,384)))
In [20]: list(msafobj)[0]
Out[20]: <Sequence: TAG76_CAEEL (length 20; 19 residues and 1 gaps)>
Slicing can also be done using MSA
. The MSA
object offers
other functionalities like querying, indexing, slicing row and columns and
refinement.
MSA objects¶
Indexing¶
Retrieving a sequence at a given index or by label will give an object of
Sequence
. Here’s an example using an index.
In [21]: msa = parseMSA(msafile)
In [22]: seq = msa[0]
In [23]: seq
Out[23]: <Sequence: TAG76_CAEEL (PF02171_seed[0]; length 395; 307 residues and 88 gaps)>
In [24]: str(seq)
Out[24]: 'CIIVVLQS.KNSDI.YMTVKEQSDIVHGIMSQCVLMKNVSRP.........TPATCANIVLKLNMKMGGIN..SRIVADKITNKYLVDQPTM.........VVGIDVTHPTQAEM.......RMNMPSVAAIVANVD.LLPQSYGANVKVQKKCRESVVY.LLD............AIRERIITFYRHTKQ.KPARIIVYRDGVSEGQFSEVLREEIQSIRTACL.......AIAEDFR..PPITYIVVQKRHHARIFCKYQNDM.....................VGKAKNVPPGTTV...DTGIVSPEGFDFYLCSHYGVQGTSRPARYHVLLDECKFTADEI.QSI......TYGMCHTYGRCT....RSVSIPTPVYYADLVATRARCHVK'
Here we retrieve a sequence by UniProt ID:
In [25]: msa['YQ53_CAEEL']
Out[25]: <Sequence: YQ53_CAEEL (PF02171_seed[6]; length 395; 328 residues and 67 gaps)>
Querying¶
You can query whether a sequence in contained in the instance using the UniProt identifier of the sequence as follows:
In [26]: 'YQ53_CAEEL' in msa
Out[26]: True
Slicing¶
Slice an MSA instance to give a new MSA
. object :
In [27]: new_msa = msa[:2]
In [28]: new_msa
Out[28]: <MSA: PF02171_seed (2 sequences, 395 residues)>
Slice using a list of UniProt IDs:
In [29]: msa[['TAG76_CAEEL', 'O16720_CAEEL']]
Out[29]: <MSA: PF02171_seed (2 sequences, 395 residues)>
Retrieve a character or a slice of a sequence:
In [30]: msa[0,0]
Out[30]: <Sequence: TAG76_CAEEL (length 1; 1 residues and 0 gaps)>
In [31]: msa[0,0:10]
Out[31]: <Sequence: TAG76_CAEEL (length 10; 9 residues and 1 gaps)>
Slice MSA rows and columns:
In [32]: msa[:10,20:40]
Out[32]: <MSA: PF02171_seed (10 sequences, 20 residues)>
Merging MSAs¶
mergeMSA()
can be used to merge two or more MSAs. Based on their labels
only those sequences that appear in both MSAs are retained, and concatenated
horizontally to give a joint or merged MSA. This can be useful while evaluating
covariance patterns for proteins with multiple domains or protein-protein
interactions. The example shows merging for the multi-domain receptor
3KG2 containing pfam domains PF01094 and PF00497.
In [33]: fetchPfamMSA('PF01094', alignment='seed')
Out[33]: 'PF01094_seed.sth'
In [34]: fetchPfamMSA('PF00497', alignment='seed')
Out[34]: 'PF00497_seed.sth'
Let’s parse and merge the two files:
In [35]: msa1 = parseMSA('PF01094_seed.sth')
In [36]: msa1
Out[36]: <MSA: PF01094_seed (67 sequences, 686 residues)>
In [37]: msa2 = parseMSA('PF00497_seed.sth')
In [38]: msa2
Out[38]: <MSA: PF00497_seed (230 sequences, 503 residues)>
In [39]: merged = mergeMSA(msa1, msa2)
In [40]: merged
The merged MSA contains 4889 sequences with common labels.
Writing MSAs¶
writeMSA()
can be used to write MSA. It takes filename as input
which should contain appropriate extension that can be ".slx"
or
".sth"
or ".fasta"
or format should be specified as "SELEX"
,
"Stockholm"
or "FASTA"
. Input MSA should be MSAFile
or
MSA
object. Filename can contain ".gz"
extension, in which case
a compressed file will be written.
In [41]: writeMSA('sliced_MSA.gz', msa, format='SELEX')
Out[41]: 'sliced_MSA.gz'
In [42]: writeMSA('sliced_MSA.fasta', msafobj)
Out[42]: 'sliced_MSA.fasta'
writeMSA()
returns the name of the MSA file that is written.