utils

onehot

Compute the one-hot encoding of a protein sequence.

tokenize

Tokenizes the sequence.

positional_encoding

Sinusoidal encoding of sequence position.

compose_embeddings

Composes multiple embeddings into one by concatenating the results.

save

Saves an object to either pickle, json, or json.gz (determined by the extension in the file name).

load

Loads a pickle, json or json.gz file.

download_url

Downloads a file from an url.

extract_tar

Extracts a tar file.

zip_file

Zips a file.

unzip_file

Unzips a .gz file.

write_avro

Writes a list of protein dictionaries to an avro file.

uniprot_query

uniprot_map

onehot(sequence, resolution='residue')[source]

Compute the one-hot encoding of a protein sequence.

Parameters:
  • sequence (str) – The protein sequence.

  • resolution (str, default 'resolution') – Resolution of the protein. ‘residue’ or ‘atom’.

Returns:

The embedded sequence.

Return type:

ndarray

tokenize(sequence, resolution='residue')[source]

Tokenizes the sequence.

Parameters:
  • sequence (str) – The protein sequence.

  • resolution (str, default 'resolution') – Resolution of the protein. ‘residue’ or ‘atom’.

Returns:

The embedded sequence.

Return type:

ndarray

positional_encoding(sequence, dim=128)[source]

Sinusoidal encoding of sequence position.

Parameters:

sequence (str) – The protein sequence

Returns:

The embedded sequence.

Return type:

ndarray

compose_embeddings(embeddings)[source]

Composes multiple embeddings into one by concatenating the results.

Parameters:

embeddings (list) – A list of embeddings

Returns:

A substitute embedding function.

Return type:

function

save(obj, path)[source]

Saves an object to either pickle, json, or json.gz (determined by the extension in the file name).

Parameters:
  • obj – The object to be saved.

  • path – The path to save the object.

load(path)[source]

Loads a pickle, json or json.gz file.

Parameters:

path – The path to be loaded.

Returns:

The loaded object.

Return type:

object

download_url(url, out_path, verbosity=2, chunk_size=10485760)[source]

Downloads a file from an url. If out_path is a directory, the file will be saved under the url basename.

Parameters:
  • url (str) – The url to be downloaded.

  • out_path (str) – Path to save the downloaded file.

  • log (bool, default True) – Whether to show a progress bar.

  • chunk_size (int, default 10485760) – The chunk size of the download.

extract_tar(tar_path, out_path, extract_members=False, strip=0, verbosity=2)[source]

Extracts a tar file.

Parameters:
  • tar_path – The path to the tar file.

  • out_path – The directory to extract to.

  • extract_members (bool, default False) – If True, the tar file member will be directly extracted to out_path, instead of creating a subdirectory.

  • strip (int, default 0) – Remove strip folder hierarchies from the path of the extracted file.

zip_file(path)[source]

Zips a file.

Parameters:

path – The path to the file.

unzip_file(path, remove=True)[source]

Unzips a .gz file.

Parameters:

path – The path to the .gz file.

write_avro(proteins, path)[source]

Writes a list of protein dictionaries to an avro file.

Parameters:
  • proteins (list) – The list of proteins.

  • path – The path to the output file.