datasets

These classes define our currently supported datasets. See our datasets tutorial for a quick intro to datasets, and this tutorial to learn how to create your own datasets.

Dataset

Base dataset class.

RCSBDataset

Experimental structures from the RCSB Protein Data Bank.

AlphaFoldDataset

3D structures predicted by AlphaFold.

GeneOntologyDataset

Proteins with annotated Gene Ontology (GO) terms.

EnzymeCommissionDataset

Enzymes with annotated enzyme commission (EC) numbers.

ProteinFamilyDataset

Proteins with annotated protein families (Pfam).

ProteinProteinInterfaceDataset

Protein-protein complexes from PDBBind with annotated interfaces.

ProteinLigandInterfaceDataset

Proteins bound to small molecules from PDBBind with annotated binding site, ligand and affinity information.

SCOPDataset

Proteins with annotated SCOP class.

TMAlignDataset

Proteins that were aligned with TMalign annotated with distance/similarity metrics.

ProteinLigandDecoysDataset

Proteins (targets) from DUDE-Z with annotated ligands and decoys.

class Dataset(root='data', use_precomputed=True, release='latest', only_single_chain=False, check_sequence=False, n_jobs=1, minimum_length=10, maximum_length=2048, exclude_ids=[], skip_signature_check=False, verbosity=2)[source]

Base dataset class. Holds the logic for downloading and parsing PDB files. If use_precomputed=True, fetched pre-processed data from Zenodo. Else, builds the dataset from scratch by executing: download() to fetch structures in PDB format, then parse() is applied to each to extract the relevant info and store it in a protein dictionary which has three outer keys 'protein', 'residue', and 'atom'. Subclassing add_protein_attributes() lets the user include custom attributes.

Note

All child classes inherit these attributes and optionally add their own.

Annotations

Attribute

Key

Sample value

Protein identifier

['protein']['ID']

'1JC8'

Sequence

['protein']['sequence']

'MIWGDSGKL...'

Assigned train/val/test split

['protein']['sequence_split_<CUTOFF>'], protein['protein']['structure_split_<CUTOFF>']

'train'

Residue position on chain

['residue']['residue_number']

[1, 2, 3, ...]

Amino acid type (single letter)

['residue']['residue_type']

['M', 'I', ...]

3D coordinates

[{'residue' | 'atom'}][{'x'|'y'|'z'}]

[5.191, ...]

Solvent accessible surface area

[{'residue'|'atom'}]['SASA']

[242.031, ...]

Relative accessible surface area

['residue']['RSA']

[1.377, ...]

Atom position

['atom']['atom_number']

[1, 2, 3, ...]

Atom type

['atom']['atom_type']

['N', 'CA', ...]

Parameters:
  • root (str, default 'data') – The data root directory to store both raw and parsed data.

  • use_precomputed (bool, default True) – If True, will download the processed dataset from the ProteinShake repository (recommended). If False, will force to download the raw data from the original sources and process them on your device. You can use this option if you wish to create a custom dataset. Using False is compute-intensive, consider increasing n_jobs.

  • release (str, default '12JUL2022') – The tag of the dataset release. See https://github.com/BorgwardtLab/proteinshake/releases for all available releases. “latest” (default) is recommended.

  • only_single_chain (bool, default False) – If True, will only use single-chain proteins.

  • check_sequence (bool, default False) – If True, will discard proteins whose primary sequence is not identical with the sequence of amino acids in the structure. This can happen if the structure is not complete (e.g. for parts that could not be crystallized).

  • n_jobs (int, default 1) – The number of jobs for downloading and parsing files. It is recommended to increase the number of jobs with use_precomputed=False.

  • minimum_length (int, default 10) – Proteins smaller than minimum_length residues will be skipped.

  • maximum_length (int, default 2048) – Proteins larger than maximum_length residues will be skipped.

  • exclude_ids (list, default []) – Exclude PDB IDs from the dataset.

  • skip_signature_check (bool, default False) – If True, skips the signature check.

  • verbosity (int, default 2) – Verbosity level of output logging. 2: full output, 1: no progress bars, 0: only warnings and errors, -1: only errors, -2: no output.

check_signature_same_as_hosted()[source]

Safety check to ensure the provided dataset arguments are the same as were used to precompute the datasets. Only relevant with use_precomputed=True.

proteins(resolution='residue')[source]

Returns a generator of proteins from the avro file.

Parameters:

resolution (str, default 'residue') – The resolution of the proteins. Can be ‘atom’ or ‘residue’.

Returns:

An avro reader object.

Return type:

generator

>>> from proteinshake.datasets import RCSBDataset
>>> protein = next(RCSBDataset().proteins())
download()[source]

Implement me in a subclass!

This method is responsible for downloading and extracting raw pdb files from a databank source. All PDB files should be dumped in f’{self.root}/raw/files. See e.g. PDBBindRefined for an example.

add_protein_attributes(protein_dict)[source]

Implement me in a subclass!

This method annotates protein objects with addititional information, such as functional labels or classes. It takes a protein object (a dictionary), modifies, and returns it. Usually, this would utilize the ID attribute to load an annotation file or to query information from a database.

Parameters:

protein_dict (dict) – A protein_dict object. See proteinshake.datasets.Dataset.parse_pdb() for details.

Returns:

The protein_dict object with a new attribute added.

Return type:

dict

parse()[source]

Parses all PDB files returned from proteinshake.datasets.Dataset.get_raw_files() and saves them to disk. Can run in parallel.

parse_pdb(path)[source]

Parses a single PDB file first into a DataFrame, then into a protein object (a dictionary). Also validates the PDB file and provides the hook for add_protein_attributes. Returns None if the protein was found to be invalid. :param path: Path to PDB file. :type path: str

Returns:

A protein object.

Return type:

dict

pdb2df(path)[source]

Parses a single PDB file to a DataFrame (with biopandas). Also deals with multiple structure models in a PDB (e.g. from NMR) by only selecting the first model.

Parameters:

path (str) – Path to PDB file.

Returns:

A biopandas DataFrame of the PDB file.

Return type:

DataFrame

to_graph(resolution='residue', transform=<proteinshake.transforms.transforms.IdentityTransform object>, **kwargs)[source]

Converts the raw dataset to a graph dataset. See proteinshake.representations.GraphDataset() for arguments.

Returns:

The dataset in graph representation.

Return type:

proteinshake.representations.GraphDataset

to_point(resolution='residue', transform=<proteinshake.transforms.transforms.IdentityTransform object>, **kwargs)[source]

Converts the raw dataset to a point cloud dataset. See proteinshake.representations.PointDataset() for arguments.

Returns:

The dataset in point cloud representation.

Return type:

proteinshake.representations.PointDataset

to_voxel(resolution='residue', transform=<proteinshake.transforms.transforms.IdentityTransform object>, **kwargs)[source]

Converts the raw dataset to a voxel dataset. See proteinshake.representations.VoxelDataset() for arguments.

Returns:

The dataset in voxel representation.

Return type:

proteinshake.representations.VoxelDataset

class RCSBDataset(query=[], from_list=None, only_single_chain=True, max_requests=20, **kwargs)[source]

Experimental structures from the RCSB Protein Data Bank. This class also serves as a base class for all RCSB derived datasets. It can be subclassed by defining a default query argument. The query is a list of triplets (attribute, operator, value) according to this and this , which is passed to the REST API call to RCSB. See e.g. the GODataset subclass for an example. To find the right attributes, the queries can be constructed by doing an advanced search at RCSB and exporting to JSON. Also compare the API call in the download() method.

It uses RCSB’s integrated sequence similarity filtering to remove redundant proteins.

Also, only single chain proteins are used. Change the REST payload in download to override this behaviour.

Please cite

Berman, H M et al. “The Protein Data Bank.” Nucleic acids research vol. 28,1 (2000): 235-42. doi:10.1093/nar/28.1.235

Source

Raw data was obtained and modified from RCSB Protein Data Bank, originally licensed under CC0 1.0.

Parameters:

query (list) – A list of triplets (attribute, operator, value) to be added to the REST API call to RCSB.

download()[source]

Fetches PDBs from RCSB with an API call. The default query selects protein-only structures with a single chain.

class AlphaFoldDataset(organism='swissprot', version='v4', only_single_chain=True, **kwargs)[source]

3D structures predicted by AlphaFold. Requires the organism name to be specified. See https://alphafold.ebi.ac.uk/download for a full list of available organsims. Pass the full latin organism name separated by a space or underscore. organism can also be ‘swissprot’, in which case the full SwissProt structure predictions will be downloaded (ca. 500.000).

Please cite

Jumper, John, et al. “Highly accurate protein structure prediction with AlphaFold.” Nature 596.7873 (2021): 583-589.

Varadi, Mihaly, et al. “AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.” Nucleic acids research 50.D1 (2022): D439-D444.

Source

Raw data was obtained and modified from AlphaFoldDB, originally licensed under CC-BY-4.0.

Data Properties

organism

# proteins

'arabidopsis_thaliana'

27,386

'caenorhabditis_elegans'

19,613

'candida_albicans'

5,951

'danio_rerio'

24,430

'dictyostelium_discoideum'

12,485

'drosophila_melanogaster'

13,318

'escherichia_coli'

4,362

'glycine_max'

55,696

'homo_sapiens'

23,172

'methanocaldococcus_jannaschii'

1,773

'mus_musculus'

21,398

'oryza_sativa'

43,631

'rattus_norvegicus'

21,069

'saccharomyces_cerevisiae'

6,016

'schizosaccharomyces_pombe'

5,104

'zea_mays'

39,203

'swissprot'

541,143

Parameters:

organism (str) – The organism name or ‘swissprot’.

download()[source]

Implement me in a subclass!

This method is responsible for downloading and extracting raw pdb files from a databank source. All PDB files should be dumped in f’{self.root}/raw/files. See e.g. PDBBindRefined for an example.

class GeneOntologyDataset(query=[['rcsb_polymer_entity_annotation.type', 'exact_match', 'GO']], **kwargs)[source]
Proteins with annotated Gene Ontology (GO) terms.

Each protein in the dataset has a GO attribute which stores the path from the root to the leaves along the GO hierarchy. The GeneOntologyDataset also has a godag attribute, which stores the GO hierarchy (see [goatools.obo_parser.GODag](https://github.com/tanghaibao/goatools)).

Please cite

Botstein, David, et al. “Gene Ontology: tool for the unification of biology.” Nat genet 25.1 (2000): 25-9.

Central, G. O., et al. “The Gene Ontology knowledgebase in 2023.” Genetics 224.1 (2023).

Berman, H M et al. “The Protein Data Bank.” Nucleic acids research vol. 28,1 (2000): 235-42. doi:10.1093/nar/28.1.235

Source

Raw data was obtained and modified from RCSB Protein Data Bank, originally licensed under CC0 1.0.

Dataset stats

# proteins

32633

Annotations

Attribute

Key

Sample value

Molecular Function

protein['protein']['molecular_function']

['GO:0003674', 'GO:0005198']

Localization

protein['protein']['cellular_component']

['GO:0005575', 'GO:0018995',..]

Biological process

protein['protein']['biological_process']

download()[source]

Fetches PDBs from RCSB with an API call. The default query selects protein-only structures with a single chain.

add_protein_attributes(protein)[source]

Implement me in a subclass!

This method annotates protein objects with addititional information, such as functional labels or classes. It takes a protein object (a dictionary), modifies, and returns it. Usually, this would utilize the ID attribute to load an annotation file or to query information from a database.

Parameters:

protein_dict (dict) – A protein_dict object. See proteinshake.datasets.Dataset.parse_pdb() for details.

Returns:

The protein_dict object with a new attribute added.

Return type:

dict

class EnzymeCommissionDataset(query=[['rcsb_polymer_entity.rcsb_ec_lineage.name', 'exists']], **kwargs)[source]

Enzymes with annotated enzyme commission (EC) numbers.

Please cite

Berman, H M et al. “The Protein Data Bank.” Nucleic acids research vol. 28,1 (2000): 235-42. doi:10.1093/nar/28.1.235

Source

Raw data was obtained and modified from RCSB Protein Data Bank, originally licensed under CC0 1.0.

Dataset stats

# proteins

15603

add_protein_attributes(protein)[source]

Fetch the enzyme class for each protein.

class ProteinFamilyDataset(pfam_version='34.0', query=[['rcsb_polymer_entity_annotation.type', 'exact_match', 'Pfam']], **kwargs)[source]
Proteins with annotated protein families (Pfam).

Each protein in the dataset has a Pfam attribute which stores the list of protein families.

Please cite

Berman, H M et al. “The Protein Data Bank.” Nucleic acids research vol. 28,1 (2000): 235-42. doi:10.1093/nar/28.1.235

Source

Raw data was obtained and modified from RCSB Protein Data Bank, originally licensed under CC0 1.0.

Dataset stats

# proteins

31109

Annotations

Attribute

Key

Sample value

Pfam accession code

protein['protein']['Pfam']

[‘PF00102’]

add_protein_attributes(protein)[source]

Implement me in a subclass!

This method annotates protein objects with addititional information, such as functional labels or classes. It takes a protein object (a dictionary), modifies, and returns it. Usually, this would utilize the ID attribute to load an annotation file or to query information from a database.

Parameters:

protein_dict (dict) – A protein_dict object. See proteinshake.datasets.Dataset.parse_pdb() for details.

Returns:

The protein_dict object with a new attribute added.

Return type:

dict

class ProteinProteinInterfaceDataset(cutoff=6, version='2020', **kwargs)[source]
Protein-protein complexes from PDBBind with annotated interfaces.

Residues and atoms in each protein are marked with a boolean is_interface to indicate residues/atoms defined to belong to the interface of two protein chains. The default threshold for determining interface residues is 6 Angstroms (used by DIPS). See proteinshake.utils.get_interfaces() for details.

Please cite

Wang, Renxiao, et al. “The PDBbind database: Collection of binding affinities for protein− ligand complexes with known three-dimensional structures.” Journal of medicinal chemistry 47.12 (2004): 2977-2980.

Source

Raw data was obtained and modified with permission from PDBbind-CN, originally licensed under the End User Agreement for Access to the PDBbind-CN Database and Web Site.

root: str

Root directory where the dataset should be saved.

name: str

The name of the dataset.

version: str

PDBBind database version to use.

cutoff: float

Distance in angstroms within which a pair of residues is considered to belong to the interface.

Dataset stats

# proteins

2839

Annotations

Attribute

Key

Sample value

Chain Identifier

protein[{'residue' | 'atom'}]['chain_id']

['X', 'X', ..'Y', 'Y']

Binding interface

protein[{'residue' | 'atom'}]['is_interface']

[0, 0, .., 1, 0]

get_contacts(protein, cutoff=6)[source]

Obtain interfacing residues within a single structure of polymers. Uses KDTree data structure for vector search.

Parameters:

protein (dict) – Parsed protein dictionary.

Returns:

`dict` – says that residues 1 and 2 of chain A are in contact with 3 and 4 in chain B. The positions are _indices_ in the residue/atom list.

Return type:

2-level dictionary mapping a pair of chains to the list of interfacing residue positions (e.g interfaces[‘A’][‘B’] = {(1, 3), (2, 4)}

parse_interfaces()[source]

Get all interfaces and store in dict

download()[source]

Implement me in a subclass!

This method is responsible for downloading and extracting raw pdb files from a databank source. All PDB files should be dumped in f’{self.root}/raw/files. See e.g. PDBBindRefined for an example.

chain_split(dest)[source]

Split all the raw PDBs in path to individual ones by chain. Replaces original PDB file in place.

class ProteinLigandInterfaceDataset(version='2020', **kwargs)[source]

Proteins bound to small molecules from PDBBind with annotated binding site, ligand and affinity information.

Please cite

Wang, Renxiao, et al. “The PDBbind database: Collection of binding affinities for protein− ligand complexes with known three-dimensional structures.” Journal of medicinal chemistry 47.12 (2004): 2977-2980.

Source

Raw data was obtained and modified with permission from PDBbind-CN, originally licensed under the End User Agreement for Access to the PDBbind-CN Database and Web Site.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset.

  • version (str) – PDBBind version to use.

  • list-table: (..) –

    Dataset stats: :widths: 100 :header-rows: 1

      • # proteins

      • 4642

  • list-table:

    Annotations: :widths: 20 55 25 :header-rows: 1

      • Attribute

    • Key

    • Sample value

      • Dissociation constant (kd)

    • protein['protein']['kd']

    • 77.0

      • Affinity

    • protein['protein']['neglog_aff']

    • 4.11000

      • Resolution (Angstroms)

    • protein['protein']['resolution']

    • 2.20

      • Year solved

    • protein['protein']['year']

    • 2016

      • Ligand identifier (PDB code)

    • protein['protein']['ligands_id']

    • IEE

      • Ligand SMILES

    • protein['protein']['ligand_smiles']

    • 'Cc1ccc(CNc2cc(Cl)nc(N)n2)cc1'

      • Molecular ingerprints

    • protein['protein']['fp_maccs'], protein['protein']['fp_morgan_r2']

    • '[..,0, 0, 1, 0, 1, 0, 0, 0,..]

      • Molecular ingerprints

    • protein['protein']['fp_maccs'], protein['protein']['fp_morgan_r2']

    • '[..,0, 0, 1, 0, 1, 0, 0, 0,..]

      • Binding site (1 if in binding site, 0 else)

    • protein['residue']['binding_site']

    • '[..,0, 0, 1, 0, 1, 0, 0, 0,..]

affinity_parse(s)[source]

Parse the affinity string. e.g. Kd=30uM. :param s: Affinity measurement string to parse. :type s: str

Returns:

Dictionary containing parsed affinity information. value key stores the float value of the measurement. operator is the logical operator (e.g. =, >) applied to the value, unit is uM, nM, pM and measure is the type experimental measurement (e.g. Kd, Ki, IC50)

Return type:

dict

parse_pdbbind_PL_index(index_path)[source]

> INDEX_refined_data.2020 # ============================================================================== # List of the protein-ligand complexes in the PDBbind refined set v.2020 # 5316 protein-ligand complexes in total, which are ranked by binding data # Latest update: July 2021 # PDB code, resolution, release year, -logKd/Ki, Kd/Ki, reference, ligand name # ============================================================================== 2r58 2.00 2007 2.00 Kd=10mM // 2r58.pdf (MLY) 3c2f 2.35 2008 2.00 Kd=10.1mM // 3c2f.pdf (PRP) 3g2y 1.31 2009 2.00 Ki=10mM // 3g2y.pdf (GF4) 3pce 2.06 1998 2.00 Ki=10mM // 3pce.pdf (3HP) 4qsu 1.90 2014 2.00 Kd=10mM // 4qsu.pdf (TDR) 4qsv 1.90 2014 2.00 Kd=10mM // 4qsv.pdf (THM)

download()[source]

Implement me in a subclass!

This method is responsible for downloading and extracting raw pdb files from a databank source. All PDB files should be dumped in f’{self.root}/raw/files. See e.g. PDBBindRefined for an example.

add_protein_attributes(protein)[source]

Implement me in a subclass!

This method annotates protein objects with addititional information, such as functional labels or classes. It takes a protein object (a dictionary), modifies, and returns it. Usually, this would utilize the ID attribute to load an annotation file or to query information from a database.

Parameters:

protein_dict (dict) – A protein_dict object. See proteinshake.datasets.Dataset.parse_pdb() for details.

Returns:

The protein_dict object with a new attribute added.

Return type:

dict

class SCOPDataset(query=[], from_list=None, only_single_chain=True, max_requests=20, **kwargs)[source]

Proteins with annotated SCOP class.

Please cite

Murzin, Alexey G., et al. “SCOP: a structural classification of proteins database for the investigation of sequences and structures.” Journal of molecular biology 247.4 (1995): 536-540.

Berman, H M et al. “The Protein Data Bank.” Nucleic acids research vol. 28,1 (2000): 235-42. doi:10.1093/nar/28.1.235

Source

Raw data was obtained and modified from RCSB Protein Data Bank, originally licensed under CC0 1.0.

Dataset stats

# proteins

10066

download()[source]

Fetches PDBs from RCSB with an API call. The default query selects protein-only structures with a single chain.

add_protein_attributes(protein)[source]

We annotate the protein with the scop classifications at each level.

SCOPCLA - SCOP domain classification. The abbreviations denote: TP=protein type, CL=protein class, CF=fold, SF=superfamily, FA=family

class TMAlignDataset(**kwargs)[source]

Proteins that were aligned with TMalign annotated with distance/similarity metrics. The dataset provides the TM-score, RMSD, Global Distance Test (GDT), and Local Distance Difference Test (LDDT).

Please cite

Zhang, Yang, and Jeffrey Skolnick. “TM-align: a protein structure alignment algorithm based on the TM-score.” Nucleic acids research 33.7 (2005): 2302-2309.

Berman, H M et al. “The Protein Data Bank.” Nucleic acids research vol. 28,1 (2000): 235-42. doi:10.1093/nar/28.1.235

Source

Raw data was obtained and modified from RCSB Protein Data Bank, originally licensed under CC0 1.0.

Dataset stats

# proteins

994

from proteinshake.datasets import TMAlignDataset

dataset = TMAlignDataset()
proteins = dataset.proteins()
protein_1, protein_2 = next(proteins)['protein']['ID'], next(proteins)['protein']['ID']

dataset.tm_score(protein_1, protein_2)
>>> 0.03
dataset.rmsd(protein_1, protein_2)
>>> 3.64
dataset.gdt(protein_1, protein_2)
>>> 0.61
dataset.lddt(protein_1, protein_2)
>>> 0.65
align_structures()[source]

Calls TMalign on all pairs of structures and saves the output

class ProteinLigandDecoysDataset(root='data', use_precomputed=True, release='latest', only_single_chain=False, check_sequence=False, n_jobs=1, minimum_length=10, maximum_length=2048, exclude_ids=[], skip_signature_check=False, verbosity=2)[source]
Proteins (targets) from DUDE-Z with annotated ligands and decoys.

Each molecule is encoded as a SMILES string, meant to be used in a virtual screen setting. In this setting a model is given a protein and a ligand and outputs a score reflecting the likelihood that the given molecule is a binder. Then, this score is used to sort the union of all the ligands and decoys. A good model places true ligands at the top of this list. This is known as enrichment factor analysis. .. admonition:: Please cite

Stein, Reed M et al. “Property-Unmatched Decoys in Docking Benchmarks.” Journal of chemical information and modeling vol. 61,2 (2021): 699-714. doi:10.1021/acs.jcim.0c00598

pdb2df(path)[source]

Parses a single PDB file to a DataFrame (with biopandas). Also deals with multiple structure models in a PDB (e.g. from NMR) by only selecting the first model.

Parameters:

path (str) – Path to PDB file.

Returns:

A biopandas DataFrame of the PDB file.

Return type:

DataFrame

download()[source]

Implement me in a subclass!

This method is responsible for downloading and extracting raw pdb files from a databank source. All PDB files should be dumped in f’{self.root}/raw/files. See e.g. PDBBindRefined for an example.

add_protein_attributes(protein)[source]

We annotate each protein with a list of decoys and a list of active SMILES strings and molecule IDs.