23. Single-cell ATAC sequencing#
23.1. Motivation#
Every cell of an organism shares the same DNA with the same set of functional units referred to as genes. With this in mind, what determines the tremendous diversity of cells reaching from natural killer cells of the immune system to neurons transmitting electrochemical signals throughout the body? In the previous chapters, we saw that cell identity and function can be inferred from gene expression profiles in each cell. The control of gene expression is driven by a complex interplay of regulatory mechanisms such as DNA methylation, histone modifications, and transcription factor activity. Chromatin accessibility largely reflects the combined regulatory state of a cell, serving as an orthogonal layer of information to mRNA levels describing cell identity. Furthermore, exploring the chromatin accessibility profile enables additional insights into gene regulatory mechanisms and cell differentiation processes that might not be captured by scRNA-seq data.
As depicted above, chromatin accessibility is influenced by higher-order structure down to low-level DNA modifications. (1) Chromatin scaffolding driven by scaffold/matrix attachment regions (S/MARs) and proteins in the nuclear periphery such as nuclear pore complexes (NPCs) or lamins influences chromatin compactness and gene expression [Buchwalter et al., 2019, Narwade et al., 2019]. (2, 3) More local accessibility often referred to as densly packed heterochromatin versus open euchromatin can be actively controlled by ATP-dependent and ATP-independent chromatin remodeling complexes and histone modifications such as acetylation, methylation and phosphorylation. (4) Also the binding of transcription factors can influence nucleosome positioning and lead to the recruitment of histone-modifying enzymes and chromatin remodelers. (5) On a DNA level, methylation of CpG sites influences the binding affinity of various proteins including transcription factors and histone-modifying enzymes which combined leads to the silencing of the corresponding genomic regions. For an animated visualization we also recommend this 2 minute video on epigenetics and the regulation of gene activity (credits to Nicole Ethen from the SQE, University of Illinois). For a comprehensive and up-to-date review on genome regulation and TF activity, we refer to [Isbel et al., 2022].
Taken together, an essential component defining cell identity is the regulatory state of each cell. In this chapter, we focus on chromatin accessibility data measured by the Single-Cell Assay for Transposase-Accessible Chromatin with High-Throughput Sequencing (scATAC-seq) or as part of the 10x Multiome assay (scATAC combined with scRNA-seq).
After walking you through the preprocessing steps this analysis will allow us to:
characterize cell identity with an orthogonal approach to scRNA-seq analysis
identify cell state specific transcriptional regulators
link gene expression to sequence features
disentangle epigenetic mechanisms driving cell differentiation and disease states
23.2. Experimental assay#
Currently, commercially available kits are the most widely used experimental protocols and therefore we showcase our analysis on data generated with the 10x Multiome assay (with minor changes this also applies on data generated with the unimodal 10x single cell ATAC-seq assay).
The key principle used to measure chromatin accessibility is the Assay for Transposase-Accessible Chromatin with High-Throughput Sequencing. Starting point is a single cell suspension of the tissue of interest. Nuclei are extracted, and the transposition is performed in bulk using a Tn5-transposase which binds to open regions in the chromatin and generates tagmented DNA fragments. Nuclei are then loaded onto a 10x Chromium Controller and droplets containing gel beads and single cells, also referred to as Gel Bead-in-Emulsion (GEMs), are formed. Within each droplet, RNA molecules and DNA fragments are barcoded, and after dissolving the GEMs, nucleotide sequences are pre-amplified to receive the final scATAC-seq and scRNA-seq libraries.
In the figure below, we illustrate the fragmentation process of the scATAC-seq part [Martens et al., 2022]. scATAC-seq uses the Tn5 transposase enzyme to insert sequencing adapters into open chromatin regions of single cells, which results in the cleavage of DNA and the attachment of sequencing adapters to create Tn5 fragments. Two Tn5 insertions create one fragment with sequencing adapters, and the orientation of insertion is crucial as only fragments flanked with two distinct adapters can be captured and amplified. The amplified fragments are then sequenced paired-end and aligned to the reference genome.
23.3. Data characteristics - feature definition and sparsity#
Single-cell ATAC-seq data measures chromatin accessibility across the entire genome. Since this includes coding and non-coding regions, genes can not be used as pre-defined features, as is the case for scRNA-seq data. Instead, the most common approach to define biologically meaningful features is detecting regions of high accessibility compared to a background - i.e. peaks in the distribution of fragment counts along the genome. Peaks in coding regions indicate that a gene might be transcribed, while in non-coding regions, accessibility is seen as a prerequisite or result for the binding of regulatory proteins such as transcription factors. However, calling peaks on all cells of a dataset can hide cell-type specific accessibility or accessibility profiles of rare cell types. Therefore, a proposed solution is to call cluster-specific peaks which requires prior peak-independent clustering of the cells. SnapATAC[Fang et al., 2021] and ArchR[Granja et al., 2021] suggest a binning strategy, that creates features by dividing the entire genome into uniformly sized windows and using this feature set for clustering of the cells.
Once the feature set is defined in one or the other way, a measure of Tn5 activity in those features is defined for each cell. Three main approaches are used: counting reads overlapping a feature, counting fragments overlapping a feature, and binarization. While the 10x Genomics Cell Ranger ATAC pipeline counts reads overlapping peak regions, the widely used Signac framework [Stuart et al., 2020] counts the number of fragments overlapping a feature. On the other hand, ArchR [Granja et al., 2021] counts read ends and binarizes them by default.
It is important to note that there are some differences between counting reads and counting fragments. In scATAC-seq, paired-end sequencing generates two reads that are usually in close proximity to each other. As a result, uneven counts are only generated if one read pair lies outside the feature [Martens et al., 2022]. This means that the used counting strategy can impact the resulting count distribution. It has been shown that the read counting strategy leads to a count distribution with many more even than uneven counts while counting fragments does not have this effect [Martens et al., 2022, Miao and Kim, 2022].
Another important characteristic of scATAC-seq data is its high sparsity. Since there are only two copies of DNA in each cell in a diploid organism, the maximum number of counts for a given base-pair position is 2 (note that there can be more than two counts in a peak or bin since it is a long range). This can lead to many features having zero counts, resulting in a highly sparse count matrix. To account for this, some methods take the approach of binarizing the counts, which means that a feature is called accessible as soon as one read or fragment overlaps [Ashuach et al., 2022, Bravo González-Blas et al., 2019, Granja et al., 2021]. However, binarization can result in a loss of information and can be less sensitive in detecting small differences in accessibility [Martens et al., 2022].
It is important to note that the best counting strategy for scATAC-seq data is still under debate, and further independent benchmarking is needed. Ultimately, the choice of counting strategy will depend on the specific research question and the characteristics of the dataset.
ToDo
mention harmonizing features across multiple samples
23.4. Overview of the data analysis workflow#
In the following sections, we will guide you through a standard workflow for analyzing scATAC-seq data. The accompanying overview figure presents the various stages of the analysis and highlights differences between popular frameworks used for this purpose. To begin with, we will explain the concepts of quality control and dimensionality reduction using Python and muon. Towards the end, we will demonstrate how to transfer your muon object to R and perform data visualization and interpretation using Signac.
23.5. References#
Tal Ashuach, Daniel A. Reidenbach, Adam Gayoso, and Nir Yosef. PeakVI: A deep generative model for single-cell chromatin accessibility analysis. Cell Reports Methods, pages 100182, March 2022. URL: https://www.sciencedirect.com/science/article/pii/S2667237522000376 (visited on 2022-03-22).
Carmen Bravo González-Blas, Liesbeth Minnoye, Dafni Papasokrati, Sara Aibar, Gert Hulselmans, Valerie Christiaens, Kristofer Davie, Jasper Wouters, and Stein Aerts. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Nature Methods, 16(5):397–400, May 2019. Number: 5 Publisher: Nature Publishing Group. URL: https://www.nature.com/articles/s41592-019-0367-1 (visited on 2022-03-10), doi:10.1038/s41592-019-0367-1.
Abigail Buchwalter, Jeanae M. Kaneshiro, and Martin W. Hetzer. Coaching from the sidelines: the nuclear periphery in genome regulation. Nature Reviews Genetics, 20(1):39–50, 2019. URL: http://www.nature.com/articles/s41576-018-0063-5 (visited on 2023-04-01), doi:10.1038/s41576-018-0063-5.
Rongxin Fang, Sebastian Preissl, Yang Li, Xiaomeng Hou, Jacinta Lucero, Xinxin Wang, Amir Motamedi, Andrew K. Shiau, Xinzhu Zhou, Fangming Xie, Eran A. Mukamel, Kai Zhang, Yanxiao Zhang, M. Margarita Behrens, Joseph R. Ecker, and Bing Ren. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nature Communications, 12(1):1337, February 2021. Number: 1 Publisher: Nature Publishing Group. URL: https://www.nature.com/articles/s41467-021-21583-9 (visited on 2022-03-31), doi:10.1038/s41467-021-21583-9.
Jeffrey M. Granja, M. Ryan Corces, Sarah E. Pierce, S. Tansu Bagdatli, Hani Choudhry, Howard Y. Chang, and William J. Greenleaf. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nature Genetics, 53(3):403–411, March 2021. Number: 3 Publisher: Nature Publishing Group. URL: https://www.nature.com/articles/s41588-021-00790-6 (visited on 2022-03-27), doi:10.1038/s41588-021-00790-6.
Luke Isbel, Ralph S. Grand, and Dirk Schübeler. Generating specificity in genome regulation through transcription factor sensitivity to chromatin. Nature Reviews. Genetics, 23(12):728–740, December 2022. doi:10.1038/s41576-022-00512-6.
Laura D. Martens, David S. Fischer, Fabian J. Theis, and Julien Gagneur. Modeling fragment counts improves single-cell ATAC-seq analysis. May 2022. Pages: 2022.05.04.490536 Section: New Results. URL: https://www.biorxiv.org/content/10.1101/2022.05.04.490536v1 (visited on 2023-03-29), doi:10.1101/2022.05.04.490536.
Zhen Miao and Junhyong Kim. Is single nucleus ATAC-seq accessibility a qualitative or quantitative measurement? Technical Report, bioRxiv, April 2022. Section: New Results Type: article. URL: https://www.biorxiv.org/content/10.1101/2022.04.20.488960v1 (visited on 2022-04-28), doi:10.1101/2022.04.20.488960.
Nitin Narwade, Sonal Patel, Aftab Alam, Samit Chattopadhyay, Smriti Mittal, and Abhijeet Kulkarni. Mapping of scaffold/matrix attachment regions in human genome: a data mining exercise. Nucleic Acids Research, 47(14):7247–7261, 2019. URL: https://academic.oup.com/nar/article/47/14/7247/5527279 (visited on 2023-04-01), doi:10.1093/nar/gkz562.
Tim Stuart, Avi Srivastava, Caleb Lareau, and Rahul Satija. Multimodal single-cell chromatin analysis with Signac. Technical Report, bioRxiv, November 2020. Section: New Results Type: article. URL: https://www.biorxiv.org/content/10.1101/2020.11.09.373613v1 (visited on 2022-03-31), doi:10.1101/2020.11.09.373613.
23.6. Contributors#
We gratefully acknowledge the contributions of:
23.6.2. Reviewers#
Lukas Heumos