representations

Protein representations take a parsed protein structure file and convert it to a data structure which can be used by deep learning models. We currently support graph, voxel, and point cloud. These classes are called inside the dataset objects and you should not need to access them directly unless you want to add your own.

 from proteinshake.datasets import RCSBDataset
 dataset = RCSBDataset().to_graph(eps=8)

GraphDataset

Graph representation of a protein structure dataset.

PointDataset

Point representation of a protein structure dataset.

VoxelDataset

Voxel representation of a protein structure dataset.

class GraphDataset(proteins, root, name, resolution='residue', eps=None, k=None, weighted_edges=False, verbosity=2)[source]

Graph representation of a protein structure dataset. Converts a protein object to a graph by using a k-nearest-neighbor or epsilon-neighborhood approach. Define either k or eps to determine which one is used.

Parameters:
  • proteins (generator) – A generator of protein objects from a Dataset.

  • size (int) – The size of the dataset.

  • path (str) – Path to save the processed dataset.

  • resolution (str, default 'residue') – Resolution of the proteins to use in the graph representation. Can be ‘atom’ or ‘residue’.

  • eps (float) – The epsilon radius to be used in graph construction (in Angstrom).

  • k (int) – The number of neighbors to be used in the k-NN graph.

  • weighted_edges (bool, default False) – If True, edges are attributed with their euclidean distance. If False, edges are unweighted.

class PointDataset(proteins, root, name, resolution='residue', verbosity=2)[source]

Point representation of a protein structure dataset.

Parameters:
  • proteins (generator) – A generator of protein objects from a Dataset.

  • size (int) – The size of the dataset.

  • path (str) – Path to save the processed dataset.

  • resolution (str, default 'residue') – Resolution of the proteins to use in the graph representation. Can be ‘atom’ or ‘residue’.

class VoxelDataset(proteins, root, name, resolution='residue', gridsize=None, voxelsize=10, aggregation='mean', verbosity=2)[source]

Voxel representation of a protein structure dataset. Voxelizes a protein structure by imposing a regular grid and determining the occupancy of a voxel with amino acids. Voxel features are computed over one-hot encodings of occupying atom/amino acid identities using the aggregation function.

Parameters:
  • proteins (generator) – A generator of protein objects from a Dataset.

  • size (int) – The size of the dataset.

  • path (str) – Path to save the processed dataset.

  • resolution (str, default 'residue') – Resolution of the proteins to use in the graph representation. Can be ‘atom’ or ‘residue’.

  • gridsize (tuple, default None) – The size of the grid in voxels as a 3-tuple of x,y,z edge lengths. If None (default), the dimensions of the largest protein in the dataset is used.

  • voxelsize (float, default 10) – The size of a voxel (in Angstrom).

  • aggregation (str, defaul 'mean') – How to aggregate labels of a voxel.