PAGA Explained: Graphical Abstractions of Single-Cell Data

Author:Murphy | View: 20384 | Time: 2025-03-22 21:26:11

In single-cell genomics data, where we profile tens of thousands of features in individual cells, be it at the level of gene expression, protein expression, or some other genome-wide modality, we often strive to derive a high-level summary of the data. This can be achieved through various means such as differential gene expression, where we perform statistical testing between clusters to cell populations to determine which genes are statistically significant across clusters, or Data Visualization, where we compress the multiple dimensions brought on as a consequence of having multiple features measured into 2 or 3 dimensions to make sense of it.

One effective way of making sense of single-cell data is converting it into a graph – i.e., a tuple of nodes and edges. In this scenario, nodes are cells, and edges define connections between cells. This tuple-based data structure provides a flexible approach of grouping together cells, exploring the relationships between different types of cells (e.g., diseased cells from normal cells, different stages of development), and visualizing the overall structure of the dataset.

This is where methods like PAGA can prove very useful. PAGA (Partition-based Graph abstraction) is a method for visualizing and analyzing high-dimensional single-cell genomics data, introduced in a 2019 paper by Wolf, Angerer, and Theis, and has since become a valuable tool in the field of single-cell genomics.

PAGA workflow. Figure from original paper, "PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells" and licensed under Creative Commons License 4.0

At its core, PAGA is a graph-based approach that can be used to visualize the relationships between individual cells in a dataset. The method works by constructing a graph in which each cell is represented as a node, and the edges between nodes represent similarities between the cells. The graph is assembled using a k-Nearest Neighbor (kNN) graph, where nearest neighbors are identified using Euclidean distance in a defined embedding (though the authors specify that the user can opt for other distance metrics). The graph's edges are then weighted using a kernel function, with the option of either an adaptive Gaussian kernel or exponential kernel using the embedding coordinates.

After the graph has been constructed, it is then coarse-grained via clustering. In the paper, they use the popular community-detection algorithm, Louvain, but note that you easily use any desired clustering algorithm, such as Leiden, as we will demonstrate later in this article. Under the assumption the graph is undirected, for each cluster or graphical partition, they generate a PAGA graph using what they define as the PAGA connectivity measure. This is a test statistic that quantifies how connected a pair of clusters is, defined by the ratio of the number of edges between the clusters (or inter-edges) and the number of inter-edges expected under a random assignment of edges.

One of the unique features of PAGA is that it partitions the graph into discrete clusters based on the connectivity between nodes. This allows researchers to identify groups of cells that are more closely related to each other than to cells in other clusters. The clusters can then be further analyzed to identify genes or pathways that are specifically associated with the cells in each cluster.

In addition to clustering, PAGA can also be used to infer developmental trajectories from single-cell genomics data. The method works by calculating the probability of transition between cells based on their position in the graph. This allows researchers to identify the most likely path of development from one cell type to another, and to identify key genes or pathways that are involved in this process.

Let's try out PAGA firsthand. We will use the scanpy package for analyzing single-cell RNA-sequencing data in Python. This can be downloaded via conda and pip as follows:

conda install -c conda-forge scanpy python-igraph leidenalg

pip install scanpy

Scanpy has several built-in published datasets for pedagogical use. We will use a small, synthetic dataset of blood stem cell differentiation (commonly referred to in the field as hematopoiesis). This dataset models the differentiation of stem cells into one of 4 cell types: monocytes, megakaryocytes, neutrophils, and erythrocytes. These cells play key roles in supporting our immune system (monocytes and megakaryocytes), producing platelets to assist with natural clotting (neutrophils), as well as facilitating oxygen transport and producing the pigmentation that gives our blood its iconic red coloring.

import scanpy as sc
adata = sc.datasets.krumsiek11()

From here, we can calculate a nearest-neighborhood graph of 30 neighbors per cell and draw it using the Fruchterman Reingold algorithm, which, in brief, treats a pair of nodes (or cells in our case) as steel rings joined by a spring where the goal is to stabilize the collection of nodes such that the nodes are not moving with the forces acting to them netting out to zero.

sc.pp.neighbors(adata, n_neighbors=30)
sc.tl.draw_graph(adata)

Visualization of sythentic single-cell graph, simulating blood cell development. Image by Author.

We can add some annotation to the cells by clustering the data using the Leiden algorithm. This takes a resolution parameter where the larger the metric, the greater number of clusters we obtain. We don't want a lot of clusters, since we're working with 5 cell types here, so we'll keep this parameter relatively low:

sc.tl.leiden(adata, resolution=0.3)
sc.pl.draw_graph(adata, color='leiden', legend_loc='on data')

Clustered single-cell graph of synthetic blood cell development using the Leiden algorithm. Image by Author

We can further annotate these cells based on the known biology. As I'm already well-versed in hematopoiesis, I can readily annotate the clusters, but for datasets that are not as well-described, we can perform additional analyses such as differential gene expression to identify significant genes defining each cluster. After that, we can run PAGA and perform the actual partitioning and coarse-graining of the data described earlier:

adata.obs['clusters'] = adata.obs['leiden']
new_categories = ['0', '1/Erythrocytes', '2', '3/Monocytes', '4', '5/Stem Cells',
                  '6', '7/Megakaryocytes', '8/Neutrophils']
adata.rename_categories('clusters', new_categories)

sc.tl.paga(adata, groups='clusters')
sc.pl.paga(adata, frameon = False)

PAGA embedding of the single-cell graph modeling blood cell development. Image by Author

Here, stem cells are in the center, which can differentiate to one of the 4 cell types, indicated by the various edges connecting them, representing how distinct they are from one another. This can be further organized into a tree graph by specifying a "root". Since adult cells emerge from stem cells, we will specify the stem cell cluster as our root node. For more complex systems, we can specify additional roots.

sc.pl.paga(adata, layout='rt', root=[5])

Tree representation of PAGA partition of the cells. Image by Author

Here, we obtain a cleaner presentation of the original graph, showing the differentiation hierarchy from stem cells to each of the 4 blood cells. We note that monocytes and megakaryocytes are closely clustered together, which makes sense due to their shared function as white blood cells regulating the immune system, whereas erythrocytes and neutrophils are further apart due to their more distinct roles in the blood system. For systems that are not as well described, we can use this coarse-grained representation to give us a global window into the network, as well as fuel hypothesis generation, leading to deeper analyses into underlying genetic signatures of the various clusters, and annotate them accordingly.

Conclusion

PAGA is a powerful method for analyzing and visualizing high-dimensional single-cell genomics data. Its ability to identify discrete clusters, and reveal development trajectories, giving us a broader resolution, while hinting at its deeper, putative causes, has made it a popular tool in the field of single-cell genomics. This is especially useful for biological systems that are not as well-understood as our example, enabling a top-down approach to interrogate its latent properties.

References:

[1] F. A. Wolf, D. K. Hamey, M. Plass, J. Solana, J. S. Dahlin, B. Göttgens, N. Rajewsky, L. Simon, F.J. Theis, PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells (2019), BMC Genome Biology

[2] https://github.com/gephi/gephi/wiki/Fruchterman-Reingold

[3] https://my.clevelandclinic.org/health/articles/24287-hematopoiesis

[4] https://scanpy.readthedocs.io/en/stable/index.html

[5] J. Krumsiek, C. Marr, T. Schroeder, F. J. Theis, Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network, (2011), PLOS One

Tags: Computational Biology Data Visualization Graph Python Single Cell Genomics