33. Normalization#
33.1. Motivation#
Contrary to the negative binomial distribution of UMI counts, ADT data is less sparse with a negative peak for non-specific antibody binding and a positive peak resembling enrichment of specific cell surface proteins[Zheng et al., 2022]. The capture efficiency varies from cell to cell due to difference in biophysical properties. Since CITE-seq experiments enrich for a priori selected features, compositional biases are more severe. Analogously to scRNA-seq data, many approaches to normalization exist. We cover the two most widely used ideas methods that require different input data and starting points.
ADT data can be normalized using Centered Log-Ratio (CLR) transformation [Stoeckius et al., 2017]. Nevertheless, a new low-level normalization method tailored to dealing with the challenges this modality poses now exists: DSB (denoised and scaled by background). DSB normalization removes two kinds of noise. First, it uses the empty droplets to estimate a background noise and remove the ambient noise. Secondly, it uses the background population mean and isotypes (antibodies that bind non-specifically to the cells) to define and remove cell-to-cell technical noise[Mulè et al., 2022]
33.2. Environment setup#
import muon as mu
import pandas as pd
import pooch
import scanpy as sc
import warnings
warnings.filterwarnings("ignore")
sc.settings.verbosity = 0
sc.settings.set_figure_params(
dpi=80,
facecolor="white",
frameon=False,
)
33.3. Loading the data#
cite_quality_control = pooch.retrieve(
url="https://figshare.com/ndownloader/files/41452449",
fname="cite_quality_control.h5mu",
path=".",
known_hash=None,
progressbar=True,
)
We are simply loading the saved MuData object from the quality control chapter back in.
mdata = mu.read("cite_quality_control.h5mu")
mdata
MuData object with n_obs × n_vars = 118563 × 36741 var: 'gene_ids', 'feature_types' 2 modalities rna: 118563 x 36601 obs: 'donor', 'batch' var: 'gene_ids', 'feature_types' prot: 118563 x 140 obs: 'donor', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'outliers' var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
mdata_raw = mu.read("cite_raw.h5mu")
mdata_raw
MuData object with n_obs × n_vars = 24807643 × 36741 var: 'gene_ids', 'feature_types' 2 modalities rna: 24807643 x 36601 obs: 'donor', 'batch' var: 'gene_ids', 'feature_types' prot: 24807643 x 140 obs: 'donor', 'batch' var: 'gene_ids', 'feature_types'
33.4. DSB normalization#
We are ready to normalize the data. In this case, we can use the raw data’s distribution as background. We also have isotype controls to define and remove cell-to-cell technical variations.
Isotype contols are antibodies that bind to the cells present in this study non-specifically, meaning you would not expect a significant abundance difference between the cells. Thus, we can use the values of the isotype controls to normalize technical differences.
We are calling the normalization function mu.prot.pp.dsb
with the filtered and raw mudata object as well as the names of the isotype controls.
isotype_controls = ["Mouse-IgG1", "Mouse-IgG2a", "Mouse-IgG2b", "Rat-IgG2b"]
mdata["prot"].layers["counts"] = mdata["prot"].X
mdata["prot"].X = mdata["prot"].layers["counts"]
mu.prot.pp.dsb(mdata, mdata_raw, isotype_controls=isotype_controls)
Let’s have a look at counts before denoising and normalization.
pd.Series(mdata["prot"].layers["counts"][:100, :100].A.flatten()).value_counts()
1.0 1090
0.0 1045
2.0 918
3.0 691
4.0 581
...
350.0 1
706.0 1
296.0 1
970.0 1
763.0 1
Name: count, Length: 524, dtype: int64
See after denoise and normalization the range changed.
pd.Series(mdata["prot"].X[:100, :100].flatten()).value_counts()
0.311677 2
-1.002554 1
2.573147 1
1.804169 1
-0.403206 1
..
0.142890 1
0.268634 1
-0.078150 1
0.258447 1
-0.271008 1
Name: count, Length: 9999, dtype: int64
33.5. Centered Log-Ratio normalization#
If you don’t have the unfiltered data available, you can also normalize the ADT data with mu.prot.pp.clr
, implementing Centered Log-Ratio normalization. There is no denoising in this type of normalization. We instead assume that the geometric mean is a good reference to make all else relative to (divide by)[Quinn et al., 2018]. We are in fact taking the natural log ratio of each protein in each cell relative to either other proteins or other cells, depending on the implementation. At first, it was done across proteins, but then it was changed to across cells. This change made the normalization less dependent on the antibody panel[Mulè et al., 2022].
mdata
MuData object with n_obs × n_vars = 118563 × 36741 var: 'gene_ids', 'feature_types' 2 modalities rna: 118563 x 36601 obs: 'donor', 'batch' var: 'gene_ids', 'feature_types' prot: 118563 x 140 obs: 'donor', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'outliers' var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts' layers: 'counts'
mdata.write("cite_normalization.h5mu")
33.6. References#
Matthew P. Mulè, Andrew J. Martins, and John S. Tsang. Normalizing and denoising protein expression data from droplet-based single cell profiling. Nature Communications, 13(11):2099, Apr 2022. doi:10.1038/s41467-022-29356-8.
Thomas P Quinn, Ionas Erb, Mark F Richardson, and Tamsyn M Crowley. Understanding sequencing data as compositions: an outlook and review. Bioinformatics, 34(16):2870–2878, Aug 2018. doi:10.1093/bioinformatics/bty175.
Marlon Stoeckius, Christoph Hafemeister, William Stephenson, Brian Houck-Loomis, Pratip K. Chattopadhyay, Harold Swerdlow, Rahul Satija, and Peter Smibert. Simultaneous epitope and transcriptome measurement in single cells. Nature Methods, 14(9):865–868, Sep 2017. URL: https://doi.org/10.1038/nmeth.4380, doi:10.1038/nmeth.4380.
Ye Zheng, Seong-Hwan Jun, Yuan Tian, Mair Florian, and Raphael Gottardo. Robust normalization and integration of single-cell protein expression across cite-seq datasets. bioRxiv, 2022. URL: https://www.biorxiv.org/content/early/2022/05/01/2022.04.29.489989, arXiv:https://www.biorxiv.org/content/early/2022/05/01/2022.04.29.489989.full.pdf, doi:10.1101/2022.04.29.489989.
33.7. Contributors#
We gratefully acknowledge the contributions of:
33.7.2. Reviewers#
Lukas Heumos
Anna Schaar