Utility¶

deside.utility.log_exp2cpm(exp_df: DataFrame | array, log_base=2, correct=1) → DataFrame | array[source]¶

Convert log2(CPM + 1) to non-log space values (CPM / TPM)

Parameters:

exp_df – samples by genes
log_base – the base of log transform
correct – plus 1 for avoiding log transform 0

Returns:

counts per million (CPM) or transcript per million (TPM)

deside.utility.non_log2cpm(exp_df, sum_exp=1000000.0) → DataFrame[source]¶

Normalize gene expression to CPM / TPM for non-log space

Parameters:

exp_df – gene expression profile in non-log space, sample by gene
sum_exp – sum of gene expression for each sample, default is 1e6

Returns:

counts per million (CPM) or transcript per million (TPM)

deside.utility.non_log2log_cpm(input_file_path: str | DataFrame, result_file_path: str | None = None, transpose: bool = True, correct: int = 1)[source]¶

Convert non-log expression data to log2(CPM + 1) or log2(TPM + 1)

Parameters:

input_file_path – non-log space expression file, genes by samples
result_file_path – file path, samples by genes
transpose – if input file is samples by genes, set to False, otherwise set to True
correct – plus 1 for avoiding log transform 0

Returns:

log2(CPM + 1) or save result to file, samples by genes if transpose is True, otherwise genes by samples

deside.utility.read_data_from_h5ad(h5ad_file_path: str) → dict[source]¶

Read simulated bulk gene expression profiles (GEPs) from .h5ad file

Parameters:: h5ad_file_path – the file path of simulated bulk GEPs file (.h5ad)
Returns:: a dict contains bulk GEPs (bulk_exp) and cell fractions (cell_frac) used to generate this dataset

class deside.utility.read_file.ReadExp(exp_file, exp_type='TPM', transpose: bool = False)[source]¶

Read gene expression file, and convert to specific format (TPM / CPM, log2cpm1p)

TPM: transcript per million
CPM: UMI reads per million (3’ end sc-RNA seq), same as TPM in the full-length RNA-seq of bulk cells
log_space: log2(CPM + 1), or log2(TPM + 1)
non_log: non log space, could be normalized to TPM
Data from full-length protocols may benefit from normalization methods that take into account gene length (e.g. Patelet al, 2014; Kowalczyket al,2015; Soneson & Robinson, 2018), while 3’ enrichment data do not.
A commonly used normalization method for full-length scRNA-seq data is TPM normalization (Liet al, 2009), which comes from bulk RNA-seq analysis. (Luecken, M. D. & Theis, F. J., Mol. Syst. Biol. 15, e8746 (2019))

Parameters:

exp_file – file path or DataFrame, samples by genes
exp_type – TPM / CPM, log_space, non_log
transpose – transpose if exp_file formed as genes (index) by samples (columns)

align_with_gene_list(gene_list: list | None = None, fill_not_exist=False, pathway_list: bool = False)[source]¶

Align the expression matrix with a gene list and rescale to TPM or log2(TPM + 1)

Parameters:

gene_list – gene list
fill_not_exist – fill 0 if gene not exist in the provided gene_list when True
pathway_list – gene list contains pathway names, so TPM normalization is not suitable

do_scaling()[source]¶: Scaling GEPs by sample to [0, 1], same as Scaden

do_scaling_by_constant(divide_by=20)[source]¶: Scaling GEPs by dividing a constant in log space, so all expression values are in [0, 1)

get_exp() → DataFrame[source]¶: Get the expression matrix

get_file_type() → str[source]¶: Get the file type

save(file_path, sep=',', transpose: bool = False)[source]¶

Save the expression matrix to file

Parameters:

file_path – file path
sep – separator, default is ‘,’
transpose – transpose index and columns

to_log2cpm1p()[source]¶: Convert to log2(TPM + 1)

to_tpm()[source]¶: Convert to TPM

class deside.utility.read_file.ReadH5AD(file_path: str, show_info: bool = False)[source]¶

Read .h5ad file, usually the values are log2 transformed

Parameters:

file_path – the file path of .h5ad file, samples by genes, log2cpm1p format
show_info – whether to show the information of the dataset after reading

get_cell_fraction() → None | DataFrame[source]¶: Get cell fraction, cells by cell types

get_df(result_file_path: str | None = None, convert_to_tpm: bool = False, scaling_by_sample: bool = False) → DataFrame[source]¶

Convert to DataFrame, samples by genes, log space (log2cpm1p)

Parameters:

result_file_path –
convert_to_tpm – whether to convert log2cpm1p to TPM
scaling_by_sample – whether to scale the expression values of each sample to [0, 1] by ‘min_max’

get_h5ad()[source]¶: Get the .h5ad file