Simulation¶

Simulating bulk expression profile

class deside.simulation.BulkGEPGenerator(simu_bulk_dir, merged_sc_dataset_file_path, sct_dataset_file_path, cell_type2subtype: dict, sc_dataset_ids: list, bulk_dataset_name: str | None = None, check_basic_info: bool = True, zero_ratio_threshold: float = 0.97, sc_dataset_gep_type: str = 'log_space', tcga2cancer_type_file_path: str | None = None, total_rna_coefficient: dict | None = None, subtype_col_name: str | None = None, cell_type_col_name: str | None = None)[source]¶

Generate bulk GEPs from single cell datasets

Parameters:

simu_bulk_dir – the directory to save simulated bulk cell GEPs
merged_sc_dataset_file_path – the file path of pre-merged single cell datasets
sct_dataset_file_path – the file path of single cell datasets (scGEP, dataset S1)
cell_type2subtype – cell types used when generating bulk GEPs, {‘cell_type’: [‘subtype1’, ‘subtype2’, …], …}’
sc_dataset_ids – single cell dataset id used when generating bulk GEPs
bulk_dataset_name – the name of generated bulk dataset, only for naming
check_basic_info – whether to check basic information of single cell datasets
zero_ratio_threshold – the threshold of zero ratio of genes in single cell GEPs, remove the GEP if zero ratio > threshold
sc_dataset_gep_type – the type of single cell GEPs, log_space or linear_space
tcga2cancer_type_file_path – the file path of tcga_sample_id2cancer_type.csv, which contains the cancer type of TCGA samples
total_rna_coefficient – the coefficient of total RNA, used to correct the difference of total RNA for each cell type
subtype_col_name – the column name of subtype in single cell datasets
cell_type_col_name – the column name of cell type in single cell datasets

generate_gep(n_samples, sampling_range: dict | None = None, sampling_method: str = 'segment', total_cell_number: int = 100, n_threads: int = 10, filtering: bool = True, reference_file: str | DataFrame | None = None, ref_exp_type: str | None = None, gep_filtering_quantile: tuple = (None, 0.95), log_file_path: str | None = None, n_top: int = 20, simu_method='mul', filtering_method='media_gep', add_noise: bool = False, noise_params: tuple = (), filtering_ref_types: list | None = None, show_filtering_info: bool = False, cell_prop_prior: dict | None = None, high_corr_gene_list: list | None = None, filtering_by_gene_range: bool = False, min_percentage_within_gene_range: float = 0.95, gene_quantile_range: list | None = None, filtering_in_pca_space: bool = False, pca_n_components: int | float = 0.9, norm_ord=1)[source]¶

Generating simulated bulk GEPs from scGEP dataset (S1)

Parameters:

n_samples – the number of GEPs to generate
total_cell_number – N, the total number of cells sampled from merged single cell dataset and averaged to simulate a single bulk RNA-seq sample
sampling_method – segment or random, method to generate cell fractions
sampling_range – the range of sampling, such as {‘cell_type1’: [0.1, 0.9], ‘cell_type2’: [0.1, 0.9], …}, optional, only used when sampling_method is random
n_threads – number of threads used for parallel computing
filtering – whether filtering generated GEPs
reference_file – the file path of reference dataset for filtering
ref_exp_type – the type of expression values in reference dataset, TPM / log_space
gep_filtering_quantile – quantile of nearest distance of each pair in reference, smaller quantile gives smaller radius and fewer simulated GEPs will be kept
log_file_path – the path of log file
n_top – if too many neighbors were founded for one single sample, only keep n_top neighbors, used in marker ratio filtering
simu_method – the method to generate simulated bulk GEPs, ave (average all selected single cell GEPs), mul (multiple GEP by cell fractions)
filtering_method – marker_ratio (l2 distance with marker gene ratio) or median_gep (l1 distance with median expression value for each gene) or linear_mmd (A. Gretton et al., J. Mach. Learn. Res. 13, 723–773 (2012)., also see: https://github.com/jindongwang/transferlearning/blob/master/code/distance/mmd_numpy_sklearn.py)
add_noise – whether add noise to generated bulk GEPs
noise_params – parameters for noise generation, (f, max_sum), ref: Hao, Yuning, et al. PLoS Computational Biology, 2019
filtering_ref_types – the cancer types used for filtering
show_filtering_info – whether show filtering information
cell_prop_prior – a prior range of cell proportions for each cell type in solid tumors
high_corr_gene_list – a list of genes that the expression values have high correlation with the cell proportions for at least one cell type
filtering_by_gene_range – whether filtering GEPs by gene expression range, the percentage of genes within a specific quantile range in TCGA
min_percentage_within_gene_range – the minimal percentage of genes within a specific quantile range in TCGA
gene_quantile_range – the quantile range of gene expression values in TCGA for gene based filtering
filtering_in_pca_space – whether filtering GEPs in PCA space
pca_n_components – the number of components used for PCA, or the explained variance ratio of PCA if float
norm_ord – the order of norm used for filtering, 1 for l1 norm, 2 for l2 norm

class deside.simulation.SingleCellTypeGEPGenerator(merged_sc_dataset_file_path, cell_type2subtype, sc_dataset_ids, simu_bulk_dir, bulk_dataset_name, zero_ratio_threshold: float = 0.97, sc_dataset_gep_type: str = 'log_space', subtype_col_name: str | None = None, cell_type_col_name: str = 'cell_type')[source]¶

Generating single cell type GEPs (sctGEPs)

Parameters:

simu_bulk_dir – the directory to save simulated bulk cell GEPs
merged_sc_dataset_file_path – the file path of pre-merged single cell datasets
cell_type2subtype – cell types used when generating bulk GEPs, {cell_type: [sub_cell_type1, sub_cell_type2, …], …}
sc_dataset_ids – single cell dataset id used when generating bulk GEPs
bulk_dataset_name – the name of generated bulk dataset, only for naming
zero_ratio_threshold – the threshold of zero ratio of genes in single cell GEPs, remove the GEP if zero ratio > threshold
sc_dataset_gep_type – the type of single cell GEPs, log_space or linear_space

generate_frac_sc(sample_prefix: str | None = None, sample_type: str = 'positive') → DataFrame[source]¶

Generate cell fractions for single cell samples, positive samples only contain one specific cell type, negative samples contain >= 2 cell types with equal proportion

Parameters:

sample_prefix – prefix of sample names
sample_type – positive samples (one specific cell type) or negative samples (>= 2 cell types)

Returns:

generated cell fraction, sample by cell type

generate_samples(n_sample_each_cell_type: int = 10000, n_base_for_positive_samples: int = 100, sample_type: str = 'positive', sep_by_patient=False, simu_method='ave', test_set: bool = False, minimum_n_base: int = 1, ref_gene_list_file_path: str | None = None)[source]¶

Parameters:

n_sample_each_cell_type – the number of samples to generate for each cell type
n_base_for_positive_samples – the number of single cells to average
sample_type – positive means only 1 cell type is used, negative means more than 1 cell types are used
sep_by_patient – only sampling from one patient in the original dataset if True
simu_method – ave: averaging all GEPs, or scale_by_mGEP: scaling by the mean GEP of all samples in the TCGA dataset or random_replacement: replacing the gene expression value (<1) by another value within the same cell type selected randomly
test_set – if True, generate a test set using the same cell types as the training set, otherwise generate data set with all cell types
minimum_n_base – the minimum number of single cells to average for each bulk sample
ref_gene_list_file_path – the file path of a reference gene list, the list of genes used for filtering the single cell dataset for generating SCT dataset, especially for using a customized scRNA-seq dataset

deside.simulation.filtering_by_gene_list_and_pca_plot(bulk_exp: DataFrame, tcga_exp: DataFrame, gene_list: list, result_dir: str, simu_dataset_name: str, n_components=5, pca_model_name_postfix='', bulk_exp_type='log_space', tcga_exp_type='TPM', pca_model_file_path=None, pca_data_file_path=None, h5ad_file_path=None, cell_frac_file: DataFrame | None = None, figsize=(5, 5), if_plot_pca: bool = False)[source]¶

Applying gene-level filtering based on a specific gene list to simulated bulk GEPs. After filtering, the bulk GEPs are rescaled to log2(TPM+1). And perform PCA and plot the results for both filtered bulk GEPs and TCGA samples.

Parameters:

bulk_exp – pd.DataFrame, Simulated bulk expression data in log2cpm1p format.
tcga_exp – pd.DataFrame, TCGA expression data in TPM format.
gene_list – list, List of gene names used to filter the bulk GEPs.
result_dir – str, Directory where results will be saved.
simu_dataset_name – str, Name of the simulated dataset
n_components – int, Number of PCA components to compute. Must be >= 2.
pca_model_name_postfix – str, Postfix for naming the PCA model file.
bulk_exp_type – str, Type of bulk GEPs, either ‘TPM’ or ‘log_space’.
tcga_exp_type – str, Type of TCGA samples, either ‘TPM’ or ‘log_space’.
pca_model_file_path – str, Path to save the fitted PCA model.
pca_data_file_path – str, Path to save the PCA-transformed data.
h5ad_file_path – str, Path to save the filtered bulk GEPs as a .h5ad file, if not None.
cell_frac_file – pd.DataFrame, Cell proportion matrix used to generate simulated bulk GEPs. Required if h5ad_file_path is specified.
figsize – tuple, Size of the figure for PCA plotting (width, height).
if_plot_pca – boolean, if True plot PCA components using both simulated bulk GEPs and TCGA samples.

Returns:

None

deside.simulation.get_gene_list_for_filtering(bulk_exp_file: str, tcga_file: str, result_file_path: str, q_col_name: list | None = None, filtering_type: str = 'quantile_range', corr_threshold: float = 0.3, n_gene_max: int = 1000, corr_result_fp: str | None = None, quantile_range: list | None = None) → list[source]¶

Perform gene-level filtering based on the specific filtering type.

Parameters:

bulk_exp_file – str, Path to the simulated bulk expression file in log2(TPM + 1) format to be filtered.
tcga_file – str, Path to the TCGA reference file in TPM format
result_file_path – str, Path where the filtered gene list will be saved.
q_col_name – list, Optional. Column names for quantile ranges.
filtering_type – str, Specify the method of filtering. Options include: ‘high_corr_gene’, ‘quantile_range’, ‘all_genes’, ‘high_corr_gene_and_quantile_range’ - ‘high_corr_gene’: Select genes that their expression values have high correlation with the cell proportions of any cell type - ‘quantile_range’: Select genes with their median of expression values within the [q_5, q_95] quantile range of their counterparts in TCGA (default)
corr_result_fp – str, path to the result file after ‘high_corr_gene’ filtering
quantile_range – tuple, (lower_quantile, median_quantile, upper_quantile). Genes are removed if their median expression in simulated bulk GEPs is outside this range in TCGA data of the corresponding gene in TCGA dataset will be removed
corr_threshold – float, correlation threshold for gene filtering
n_gene_max – int, maximum number of genes to select for each cell type during ‘high_corr_gene’ filtering

Returns:

list, filtered gene list

deside.simulation.random_generation_fraction(n_samples: int = 100, cell_types: list = (), sample_prefix: str | None = None, fixed_range: dict | None = None) → DataFrame[source]¶

Create pure random cell fractions, same as Scaden

Parameters:

n_samples – number of samples to create
cell_types – a list of cell types
sample_prefix – prefix of sample names
fixed_range – the range of cell fraction for each cell type, {‘cell_type’: (0, 100), ‘’: (), …}

Returns:

generated cell fraction, sample by cell type

deside.simulation.segment_generation_fraction(n_samples: int | None = None, max_value: int = 10000, cell_types: list | None = None, sample_prefix: str | None = None, cell_prop_prior: dict | None = None) → DataFrame[source]¶

Generate cell fractions by fixing a specific percentage (gradient) range (i.e., from 1% to 100%): for each specific cell type, and n samples for each gradient of each cell type

Parameters:

n_samples – the number of samples needs to generate in total
max_value – cell proportion will be sampled from U(0, max_value), and then scaled to [0, 1]
cell_types – None or a list of cell types. Using all valid cell types if None. All valid cell types can be found by list(deside.utility.cell_type2abbr.keys()).
sample_prefix – only for naming
cell_prop_prior – the prior range of cell proportions for each cell type, {‘cell_type’: (0, 0.1), ‘’: (0, 0.2), …}

Returns:

generated cell fraction, sample by cell type