spared.filtering.filter_dataset
- spared.filtering.filter_dataset(adata: AnnData, param_dict: dict) AnnData[source]
Perform complete filtering pipeline of a slide collection.
This function takes a completely unfiltered and unprocessed (in raw counts) slide collection and filters it (both samples and genes) according to the
param_dictargument. A summary list of the steps is the following:Filter out observations with
total_countsoutside the range[param_dict['cell_min_counts'], param_dict['cell_max_counts']]. This filters out low quality observations not suitable for analysis.Compute the
exp_fracfor each gene. This means that for each slide in the collection we compute the fraction of the spots that express each gene and then take the minimum across all the slides (seeget_exp_fracfunction for more details).Compute the
glob_exp_fracfor each gene. This is similar to theexp_fracbut instead of computing for each slide and taking the minimum we compute it for the whole collection. Slides don’t matter here (seeget_glob_exp_fracfunction for more details).Filter out genes depending on the
param_dict['wildcard_genes']value, the options are the following:param_dict['wildcard_genes'] == 'None':Filter out genes that are not expressed in at least
param_dict['min_exp_frac']of spots in each slide.Filter out genes that are not expressed in at least
param_dict['min_glob_exp_frac']of spots in the whole collection.Filter out genes with counts outside the range
[param_dict['gene_min_counts'], param_dict['gene_max_counts']]
param_dict['wildcard_genes'] != 'None':Read
.txtfile specified byparam_dict['wildcard_genes']and leave only the genes that are in this file.
If there are spots with zero counts in all genes after gene filtering, remove them.
Compute quality control metrics using scanpy’s
sc.pp.calculate_qc_metricsfunction.
- Parameters:
adata (ad.AnnData) – An unfiltered (unexpressed genes are encoded as
0on theadata.X matrix) slide collection.param_dict (dict) –
Dictionary that contains filtering and processing parameters. Keys that must be present are:
'cell_min_counts'(int): Minimum total counts for a spot to be valid.'cell_max_counts'(int): Maximum total counts for a spot to be valid.'gene_min_counts'(int): Minimum total counts for a gene to be valid.'gene_max_counts'(int): Maximum total counts for a gene to be valid.'min_exp_frac'(float): Minimum fraction of spots in any slide that must express a gene for it to be valid.'min_glob_exp_frac'(float): Minimum fraction of spots in the whole collection that must express a gene for it to be valid.'wildcard_genes'(str): Path to a.txtfile with the genes to keep or'None'to filter genes based on the other keys.
- Returns:
The filtered adata collection.
- Return type:
ad.AnnData