spared.filtering.filter_dataset

spared.filtering.filter_dataset(adata: AnnData, param_dict: dict) AnnData[source]

Perform complete filtering pipeline of a slide collection.

This function takes a completely unfiltered and unprocessed (in raw counts) slide collection and filters it (both samples and genes) according to the param_dict argument. A summary list of the steps is the following:

  1. Filter out observations with total_counts outside the range [param_dict['cell_min_counts'], param_dict['cell_max_counts']]. This filters out low quality observations not suitable for analysis.

  2. Compute the exp_frac for each gene. This means that for each slide in the collection we compute the fraction of the spots that express each gene and then take the minimum across all the slides (see get_exp_frac function for more details).

  3. Compute the glob_exp_frac for each gene. This is similar to the exp_frac but instead of computing for each slide and taking the minimum we compute it for the whole collection. Slides don’t matter here (see get_glob_exp_frac function for more details).

  4. Filter out genes depending on the param_dict['wildcard_genes'] value, the options are the following:

    1. param_dict['wildcard_genes'] == 'None':

      • Filter out genes that are not expressed in at least param_dict['min_exp_frac'] of spots in each slide.

      • Filter out genes that are not expressed in at least param_dict['min_glob_exp_frac'] of spots in the whole collection.

      • Filter out genes with counts outside the range [param_dict['gene_min_counts'], param_dict['gene_max_counts']]

    2. param_dict['wildcard_genes'] != 'None':

      • Read .txt file specified by param_dict['wildcard_genes'] and leave only the genes that are in this file.

  5. If there are spots with zero counts in all genes after gene filtering, remove them.

  6. Compute quality control metrics using scanpy’s sc.pp.calculate_qc_metrics function.

Parameters:
  • adata (ad.AnnData) – An unfiltered (unexpressed genes are encoded as 0 on the adata.X matrix) slide collection.

  • param_dict (dict) –

    Dictionary that contains filtering and processing parameters. Keys that must be present are:

    • 'cell_min_counts' (int): Minimum total counts for a spot to be valid.

    • 'cell_max_counts' (int): Maximum total counts for a spot to be valid.

    • 'gene_min_counts' (int): Minimum total counts for a gene to be valid.

    • 'gene_max_counts' (int): Maximum total counts for a gene to be valid.

    • 'min_exp_frac' (float): Minimum fraction of spots in any slide that must express a gene for it to be valid.

    • 'min_glob_exp_frac' (float): Minimum fraction of spots in the whole collection that must express a gene for it to be valid.

    • 'wildcard_genes' (str): Path to a .txt file with the genes to keep or 'None' to filter genes based on the other keys.

Returns:

The filtered adata collection.

Return type:

ad.AnnData