`tasks`

These classes define the various prediction task we support. See our tasks tutorial for a full example of the Task class, and this tutorial to learn how to create your own tasks.

This snippet highlights the most common usage for a Task object.

>>> from proteinshake.tasks import EnzymeCommissionTask
>>> dataset = task.dataset.to_graph(eps=8).pyg()
>>> pred = model(dataset[task.train_index]) # assuming you have implemented model() elsewhere
>>> task.evaluate(pred)
{'precision': 0.5333515066547034, 'recall': 0.4799021029676011, 'accuracy': 0.6675514266755143}

`Task`	Base class for task-related utilities.
`GeneOntologyTask`	Predict the Gene Ontology terms describing the functional roles of a given protein in the cell.
`EnzymeClassTask`	Predict the type of reaction catalyzed by the given protein as given by the Enzyme Commission databse.
`ProteinFamilyTask`	Predict the protein family classification of a protein structure which groups proteins into evolutionarily-related families.
`LigandAffinityTask`	Predict the dissociation constant (Kd) for a protein and a small molecule.
`BindingSiteDetectionTask`	Identify the binding residues (binding pocket) of a protein-small molecule binding site. An important step in drug discovery for proteins is to find
`ProteinProteinInterfaceTask`	Identify the binding interface of a protein-protein complex.
`StructuralClassTask`	Predict the SCOP class of a protein structure.
`StructureSimilarityTask`	Predict the structural similarity between two proteins.
`StructureSearchTask`	Retrieve similar proteins to a query based on structural similarity.
`VirtualScreenTask`	Test an affinity scoring model on a virtual screen.

class Task(root='data', split='random', split_similarity_threshold=0.7, **kwargs)[source]

Base class for task-related utilities. This class wraps a proteinshake dataset and exposes split indices, integer-coded labels for classification tasks, and an evaluator function.

Sample usage (assuming you have a model in the namespace):

>>> from proteinshake.tasks import EnzymeClassTask
>>> task = EnzymeClassTask()
>>> data = task.dataset.to_graph(eps=8).pyg()
>>> y_pred = model(data[task.train])
>>> task.evaluate(y_pred)
... {'roc_auc_score': 0.7}

Parameters:

dataset (pytorch.datasets.Dataset) – Dataset to use for this task.
split (str, default='random') – How to split the data. Can be ‘random’, ‘sequence’, or ‘structure’.
split_similarity_threshold (float) – Maximum similarity to allow between train and test samples. Can be any of 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 for split="sequence", and 0.5, 0.6, 0.7, 0.8, 0.9 for split="structure".

class GeneOntologyTask(branch='molecular_function', *args, **kwargs)[source]

Predict the Gene Ontology terms describing the functional roles of a given protein in the cell. This is a protein-level multi-label prediction.

The prediction should be a n_samples x n_classes matrix, where the columns are ordered according to self.classes. If your model does not predict or handle a certain class, assign a zero value.

Task Summary

Input: one protein
Output: n_classes gene ontology terms
Evaluation: Fmax (Radivojac, Predrag, et al. “A large-scale evaluation of computational protein function prediction.” Nature methods 10.3 (2013): 221-227.)

class EnzymeClassTask(ec_level=0, *args, **kwargs)[source]

Predict the type of reaction catalyzed by the given protein as given by the Enzyme Commission databse. The Enzyme Commission classification is hierarchically organized giving rise to one prediction task per level in the hierarchy. We default to the top-most level which specifies the generic class of the enzyme, but this can be changed by setting ec_level when instantiating the task.

This is a protein-level multi-class prediction.

Task Summary

Input: one protein
Output: enzyme class label (7 classes)
Evaluation: Accuracy (Ryu, Jae Yong, Hyun Uk Kim, and Sang Yup Lee. “Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers.” Proceedings of the National Academy of Sciences 116.28 (2019): 13996-14001.)

class ProteinFamilyTask(*args, **kwargs)[source]

Predict the protein family classification of a protein structure which groups proteins into evolutionarily-related families. This is a protein-level multi-class prediction.

Task Summary

Input: one protein
Output: protein family class (5163 classes)
Evaluation: Accuracy (custom task)

class LigandAffinityTask(root='data', split='random', split_similarity_threshold=0.7, **kwargs)[source]

Predict the dissociation constant (Kd) for a protein and a small molecule. Accurate estimates of the binding strength between a protein and a small molecule ligand is a crucial step in understanding protein function regulation and in efficiently searching the massive space of small molecules new therapies. Small molecule ligand information is stored as dataset[i].smiles for a SMILES string, or as pre-computed molecular fingerprints dataset[i].fp_maccs, `dataset[i].fp_morgan_r2.

Task Summary

Input: One protein and one ligand SMILES string
Output: predicted dissociation constant (scalar)
Evaluation: R2 score (Stepniewska-Dziubinska, Marta M., Piotr Zielenkiewicz, and Pawel Siedlecki. “Development and evaluation of a deep learning model for protein–ligand binding affinity prediction.” Bioinformatics 34.21 (2018): 3666-3674.)

class BindingSiteDetectionTask(root='data', split='random', split_similarity_threshold=0.7, **kwargs)[source]

Identify the binding residues (binding pocket) of a protein-small molecule binding site. An important step in drug discovery for proteins is to find

potential cavities where small molecules can bind the protein based on the whole protein’s structure.

Pocket atoms/residues

taken directly from PDBBind annotations.

Task Summary

Input: one protein
Output: binary label for each atom/residue
Evaluation: Matthews Correlation Coefficient (Gallo Cassarino, Tiziano, Lorenza Bordoli, and Torsten Schwede. “Assessment of ligand binding site predictions in CASP10.” Proteins: Structure, Function, and Bioinformatics 82 (2014): 154-163.)

class ProteinProteinInterfaceTask(root='data', split='random', split_similarity_threshold=0.7, **kwargs)[source]

Identify the binding interface of a protein-protein complex. Protein function is driven in large part by binding events between different protein chains to form ‘complexes’. Understanding how proteins interact with each other has implications in unraveling complex biological mechanisms, and designing proteins with desirable interactions. The underlying data is taken from the PDBBind database. All pairs of residues belonging to different chains and coming from different protein chains within 6A of each other (Townshend et al., 2019) are labeled as positive examples.

Task Summary

Input: two protein chains
Output: binary label for each residue in both chains (1 if residue belongs to interface 0 otherwise)
Evaluation: AUROC (Fout, Alex, et al. “Protein interface prediction using graph convolutional networks.” Advances in neural information processing systems 30 (2017))

update_index()[source]: Transform to pairwise indexing

compute_pairs(index)[source]: Grab all pairs of chains that share an interface

class StructuralClassTask(scop_level='SCOP-FA', *args, **kwargs)[source]

Predict the SCOP class of a protein structure. SCOP labels proteins according to a hierarchy of structural and evolutionary information. The top level of the hierarchy SCOP_FA, you can customize the task to use a different level setting scop_level to SCOP_{level}, where level is any of TP=protein type, CL=protein class, CF=fold, SF=superfamily, FA=family. This is a protein-level multi-class prediction.

Task Summary

Input: one protein
Output: SCOP class (3042 classes)
Evaluation: Accuracy (custom task)

class StructureSimilarityTask(*args, **kwargs)[source]

Predict the structural similarity between two proteins. This is a pair-wise protein-level regression task. Ground truth is computed using the TMAlign software. Split indices are stored as tuples which contain two indices in the underlying dataset.

Task Summary

Input: pair of proteins
Output: Local Distance Difference Test score (lDDT)
Evaluation: Spearman correlation (custom task)

update_index()[source]: Transform to pairwise indexing

class StructureSearchTask(min_sim=0.8, *args, **kwargs)[source]

Retrieve similar proteins to a query based on structural similarity. Evaluation is cast in the setting of recommender systems where we wish to retrieve ‘relevant’ documents from a large pool of documents. Here, a protein is a document and the relevant ones are all proteins with a minimum similarity to the query protein.

Task Summary

Input: one protein
Output: list of similar proteins from dataset
Evaluation: precision@k (Aung, Zeyar, and Kian-Lee Tan. “Rapid 3D protein structure database searching using information retrieval techniques.” Bioinformatics 20.7 (2004): 1045-1052.)

property targets: Precompute the set of similar proteins for each query

class VirtualScreenTask(*args, **kwargs)[source]

Test an affinity scoring model on a virtual screen. The goal in a virtual screen is: for a given protein and a library of potential binders, bring the binders to the top of the list. In this task, the model is given a protein and a list of ligands to score. The model scores each ligand in a library with a score representing the likelihood that the protein and ligand will bind. This can be a docking score, energy calculation, or just a probability. Each protein’s ligand library contains a certain number of active molecules (ligands) and a certai (larger) number of decoys (non-binders). We use the predicted scores to sort the whole library and calculate the position of each active ligand in the sorted library. Ligands in the topi percentiles which are known to be active contribute a 1 to the score and those below the cutoff contribute a 0.

Warning

This is a zero-shot task so we use the whole dataset in evaluation. No train/test split.

Task Summary

Input: one protein,
Output: list of molecules sorted by model
Evaluation: Enrichment Factor (Chen, Hongming, et al. “On evaluating molecular-docking methods for pose prediction and enrichment factors.” Journal of chemical information and modeling 46.1 (2006): 401-415). Note: to keep scores between 0 and 1 we also return normalized version of EF which is the mean rank of active compounds (mean_active_rank in evaluation dictionary)