Utility¶
- deside.utility.log_exp2cpm(exp_df: DataFrame | array, log_base=2, correct=1) DataFrame | array[source]¶
Convert log2(CPM + 1) to non-log space values (CPM / TPM)
- Parameters:
exp_df – samples by genes
log_base – the base of log transform
correct – plus 1 for avoiding log transform 0
- Returns:
counts per million (CPM) or transcript per million (TPM)
- deside.utility.non_log2cpm(exp_df, sum_exp=1000000.0) DataFrame[source]¶
Normalize gene expression to CPM / TPM for non-log space
- Parameters:
exp_df – gene expression profile in non-log space, sample by gene
sum_exp – sum of gene expression for each sample, default is 1e6
- Returns:
counts per million (CPM) or transcript per million (TPM)
- deside.utility.non_log2log_cpm(input_file_path: str | DataFrame, result_file_path: str | None = None, transpose: bool = True, correct: int = 1)[source]¶
Convert non-log expression data to log2(CPM + 1) or log2(TPM + 1)
- Parameters:
input_file_path – non-log space expression file, genes by samples
result_file_path – file path, samples by genes
transpose – if input file is samples by genes, set to False, otherwise set to True
correct – plus 1 for avoiding log transform 0
- Returns:
log2(CPM + 1) or save result to file, samples by genes if transpose is True, otherwise genes by samples
- deside.utility.read_data_from_h5ad(h5ad_file_path: str) dict[source]¶
Read simulated bulk gene expression profiles (GEPs) from .h5ad file
- Parameters:
h5ad_file_path – the file path of simulated bulk GEPs file (.h5ad)
- Returns:
a dict contains bulk GEPs (bulk_exp) and cell fractions (cell_frac) used to generate this dataset
- class deside.utility.read_file.ReadExp(exp_file, exp_type='TPM', transpose: bool = False)[source]¶
Read gene expression file, and convert to specific format (TPM / CPM, log2cpm1p)
TPM: transcript per million
CPM: UMI reads per million (3’ end sc-RNA seq), same as TPM in the full-length RNA-seq of bulk cells
log_space: log2(CPM + 1), or log2(TPM + 1)
non_log: non log space, could be normalized to TPM
Data from full-length protocols may benefit from normalization methods that take into account gene length (e.g. Patelet al, 2014; Kowalczyket al,2015; Soneson & Robinson, 2018), while 3’ enrichment data do not.
A commonly used normalization method for full-length scRNA-seq data is TPM normalization (Liet al, 2009), which comes from bulk RNA-seq analysis. (Luecken, M. D. & Theis, F. J., Mol. Syst. Biol. 15, e8746 (2019))
- Parameters:
exp_file – file path or DataFrame, samples by genes
exp_type – TPM / CPM, log_space, non_log
transpose – transpose if exp_file formed as genes (index) by samples (columns)
- align_with_gene_list(gene_list: list | None = None, fill_not_exist=False, pathway_list: bool = False)[source]¶
Align the expression matrix with a gene list and rescale to TPM or log2(TPM + 1)
- Parameters:
gene_list – gene list
fill_not_exist – fill 0 if gene not exist in the provided gene_list when True
pathway_list – gene list contains pathway names, so TPM normalization is not suitable
- do_scaling_by_constant(divide_by=20)[source]¶
Scaling GEPs by dividing a constant in log space, so all expression values are in [0, 1)
- class deside.utility.read_file.ReadH5AD(file_path: str, show_info: bool = False)[source]¶
Read .h5ad file, usually the values are log2 transformed
- Parameters:
file_path – the file path of .h5ad file, samples by genes, log2cpm1p format
show_info – whether to show the information of the dataset after reading
- get_df(result_file_path: str | None = None, convert_to_tpm: bool = False, scaling_by_sample: bool = False) DataFrame[source]¶
Convert to DataFrame, samples by genes, log space (log2cpm1p)
- Parameters:
result_file_path –
convert_to_tpm – whether to convert log2cpm1p to TPM
scaling_by_sample – whether to scale the expression values of each sample to [0, 1] by ‘min_max’