Utility

deside.utility.log_exp2cpm(exp_df: DataFrame | array, log_base=2, correct=1) DataFrame | array[source]

Convert log2(CPM + 1) to non-log space values (CPM / TPM)

Parameters:
  • exp_df – samples by genes

  • log_base – the base of log transform

  • correct – plus 1 for avoiding log transform 0

Returns:

counts per million (CPM) or transcript per million (TPM)

deside.utility.non_log2cpm(exp_df, sum_exp=1000000.0) DataFrame[source]

Normalize gene expression to CPM / TPM for non-log space

Parameters:
  • exp_df – gene expression profile in non-log space, sample by gene

  • sum_exp – sum of gene expression for each sample, default is 1e6

Returns:

counts per million (CPM) or transcript per million (TPM)

deside.utility.non_log2log_cpm(input_file_path: str | DataFrame, result_file_path: str | None = None, transpose: bool = True, correct: int = 1)[source]

Convert non-log expression data to log2(CPM + 1) or log2(TPM + 1)

Parameters:
  • input_file_path – non-log space expression file, genes by samples

  • result_file_path – file path, samples by genes

  • transpose – if input file is samples by genes, set to False, otherwise set to True

  • correct – plus 1 for avoiding log transform 0

Returns:

log2(CPM + 1) or save result to file, samples by genes if transpose is True, otherwise genes by samples

deside.utility.read_data_from_h5ad(h5ad_file_path: str) dict[source]

Read simulated bulk gene expression profiles (GEPs) from .h5ad file

Parameters:

h5ad_file_path – the file path of simulated bulk GEPs file (.h5ad)

Returns:

a dict contains bulk GEPs (bulk_exp) and cell fractions (cell_frac) used to generate this dataset

class deside.utility.read_file.ReadExp(exp_file, exp_type='TPM', transpose: bool = False)[source]

Read gene expression file, and convert to specific format (TPM / CPM, log2cpm1p)

  • TPM: transcript per million

  • CPM: UMI reads per million (3’ end sc-RNA seq), same as TPM in the full-length RNA-seq of bulk cells

  • log_space: log2(CPM + 1), or log2(TPM + 1)

  • non_log: non log space, could be normalized to TPM

  • Data from full-length protocols may benefit from normalization methods that take into account gene length (e.g. Patelet al, 2014; Kowalczyket al,2015; Soneson & Robinson, 2018), while 3’ enrichment data do not.

  • A commonly used normalization method for full-length scRNA-seq data is TPM normalization (Liet al, 2009), which comes from bulk RNA-seq analysis. (Luecken, M. D. & Theis, F. J., Mol. Syst. Biol. 15, e8746 (2019))

Parameters:
  • exp_file – file path or DataFrame, samples by genes

  • exp_type – TPM / CPM, log_space, non_log

  • transpose – transpose if exp_file formed as genes (index) by samples (columns)

align_with_gene_list(gene_list: list | None = None, fill_not_exist=False, pathway_list: bool = False)[source]

Align the expression matrix with a gene list and rescale to TPM or log2(TPM + 1)

Parameters:
  • gene_list – gene list

  • fill_not_exist – fill 0 if gene not exist in the provided gene_list when True

  • pathway_list – gene list contains pathway names, so TPM normalization is not suitable

do_scaling()[source]

Scaling GEPs by sample to [0, 1], same as Scaden

do_scaling_by_constant(divide_by=20)[source]

Scaling GEPs by dividing a constant in log space, so all expression values are in [0, 1)

get_exp() DataFrame[source]

Get the expression matrix

get_file_type() str[source]

Get the file type

save(file_path, sep=',', transpose: bool = False)[source]

Save the expression matrix to file

Parameters:
  • file_path – file path

  • sep – separator, default is ‘,’

  • transpose – transpose index and columns

to_log2cpm1p()[source]

Convert to log2(TPM + 1)

to_tpm()[source]

Convert to TPM

class deside.utility.read_file.ReadH5AD(file_path: str, show_info: bool = False)[source]

Read .h5ad file, usually the values are log2 transformed

Parameters:
  • file_path – the file path of .h5ad file, samples by genes, log2cpm1p format

  • show_info – whether to show the information of the dataset after reading

get_cell_fraction() None | DataFrame[source]

Get cell fraction, cells by cell types

get_df(result_file_path: str | None = None, convert_to_tpm: bool = False, scaling_by_sample: bool = False) DataFrame[source]

Convert to DataFrame, samples by genes, log space (log2cpm1p)

Parameters:
  • result_file_path

  • convert_to_tpm – whether to convert log2cpm1p to TPM

  • scaling_by_sample – whether to scale the expression values of each sample to [0, 1] by ‘min_max’

get_h5ad()[source]

Get the .h5ad file