Release Notes

v2.5.0 series come with new and improved sequence, structure, and dynamics analysis features. See release notes for details.

How to Cite

Bakan A, Meireles LM, Bahar I ProDy: Protein Dynamics Inferred from Theory and Experiments
Bioinformatics 2011 27(11):1575-1577.

Bakan A, Dutta A, Mao W, Liu Y, Chennubhotla C, Lezon TR, Bahar I Evol and ProDy for Bridging Protein Sequence Evolution and Structural Dynamics
Bioinformatics 2014 30(18):2681-2683.

Zhang S, Krieger JM, Zhang Y, Kaya C, Kaynak B, Mikulska-Ruminska K, Doruker P, Li H, Bahar I ProDy 2.0: Increased scale and scope after 10 years of protein dynamics modelling with Python
Bioinformatics 2021 37(20):3657-3659.

Multiple Sequence Alignment¶

This module defines MSA analysis functions.

class MSA(msa, title='Unknown', labels=None, **kwargs)[source]¶

Store and manipulate multiple sequence alignments.

msa must be a 2D Numpy character array. labels is a list of sequence labels (or titles). mapping should map label or part of label to sequence index in msa array. If mapping is not given, one will be build from labels.

countLabel(label)[source]¶: Returns the number of sequences that label maps onto.

extend(other)[source]¶: Adds other to this MSA.

getArray()[source]¶: Returns a copy of the MSA character array.

getIndex(label)[source]¶: Returns index of the sequence that label maps onto. If label maps onto multiple sequences or label is a list of labels, a list of indices is returned. If an index for a label is not found, return None.

getLabel(index, full=False)[source]¶: Returns label of the sequence at given index. Residue numbers will be removed from the sequence label, unless full is True.

getLabels(full=False)[source]¶: Returns all labels

getResnums(index)[source]¶: Returns starting and ending residue numbers (resnum) for the sequence at given index.

getTitle()[source]¶: Returns title of the instance.

isAligned()[source]¶: Returns True if MSA is aligned.

iterLabels(full=False)[source]¶: Yield sequence labels. By default the part of the label used for indexing sequences is yielded.

numIndexed()[source]¶: Returns number of sequences that are indexed using the identifier part or all of their labels. The return value should be equal to number of sequences.

numResidues()[source]¶: Returns number of residues (or columns in the MSA), if MSA is aligned.

numSequences()[source]¶: Returns number of sequences.

setTitle(title)[source]¶: Set title of the instance.

split¶: Return split label when iterating or indexing.

refineMSA(msa, index=None, label=None, rowocc=None, seqid=None, colocc=None, **kwargs)[source]¶

Refine msa by removing sequences (rows) and residues (columns) that contain gaps.

Parameters:

Parameters:	msa (`MSA`) – multiple sequence alignment index (int) – remove columns that are gaps in the sequence with that index label (str) – remove columns that are gaps in the sequence matching label, `msa.getIndex(label)` must return a sequence index, a PDB identifier is also acceptable rowocc (float) – row occupancy, sequences with less occupancy will be removed after label refinement is applied seqid (float) – keep unique sequences at specified sequence identity level, unique sequences are identified using `uniqueSequences()` colocc (float) – column occupancy, residue positions with less occupancy will be removed after other refinements are applied keep – keep columns corresponding to residues not resolved in the PDB structure, default is False, applies when label is a PDB identifier type – bool

msa (MSA) – multiple sequence alignment
index (int) – remove columns that are gaps in the sequence with that index
label (str) – remove columns that are gaps in the sequence matching label, msa.getIndex(label) must return a sequence index, a PDB identifier is also acceptable
rowocc (float) – row occupancy, sequences with less occupancy will be removed after label refinement is applied
seqid (float) – keep unique sequences at specified sequence identity level, unique sequences are identified using uniqueSequences()
colocc (float) – column occupancy, residue positions with less occupancy will be removed after other refinements are applied
keep – keep columns corresponding to residues not resolved in the PDB structure, default is False, applies when label is a PDB identifier
type – bool

For Pfam MSA data, label is UniProt entry name for the protein. You may also use PDB structure and chain identifiers, e.g. '1p38' or '1p38A', for label argument and UniProt entry names will be parsed using parsePDBHeader() function (see also Polymer and DBRef).

The order of refinements are applied in the order of arguments. If label and unique is specified, sequence matching label will be kept in the refined MSA although it may be similar to some other sequence.

mergeMSA(*msa, **kwargs)[source]¶

Returns an MSA obtained from merging parts of the sequences of proteins present in multiple msa instances. Sequences are matched based on protein identifiers found in the sequence labels. Order of sequences in the merged MSA will follow the order of sequences in the first msa instance. Note that protein identifiers that map to multiple sequences will be excluded.

MSAs with different identifiers can be merged with the ignore_ids kwarg. This only works when all MSAs have the same number of sequences.

Parameters:	msa (list, tuple, `ndarray`) – a set of `MSA` objects to be analysed ignore_ids (bool) – where to ignore identifiers instead of matching them Default is False

specMergeMSA(*msa, **kwargs)[source]¶: Returns an MSA obtained from merging parts of the sequences of proteins present in multiple msa instances. Sequences are matched based on species section of protein identifiers found in the sequence labels. Order of sequences in the merged MSA will follow the order of sequences in the first msa instance. Note that protein identifiers that map to multiple sequences will be excluded.